 |
 |
Efficient haplotype inference: PedPhase
With the completion of the Human Genome Project, an almost complete human genomic DNA sequence has become available.
An important next step in human genomics is to determine genetic variations among humans and the correlation between
genetic variations and phenotypic variations (such as disease status, quantitative traits, etc.).
The patterns of human DNA sequence variations can be described by SNP (single nucleotide polymorphism) haplotypes.
However, humans are diploid and, in practice, haplotype data cannot be collected directly. Instead, genotype data
are collected routinely in large sequencing projects. Hence, efficient and accurate computational methods and
computer programs for the inference of haplotypes from genotypes are highly demanded. We are interested in the
haplotype inference problem on pedigrees and we study haplotype reconstruction under the Mendelian law of
inheritance and the minimum recombination principle on pedigree data. We have developed different algorithms
for the minimum-recombinant haplotype configuration (MRHC) formulation and implemented them in a software
package called PedPhase. Future research plan for this project includes
considering multiple small families simultaneously and haplotyping unrelated individuals in a whole genome scale.
Gene mapping by association: HapMiner
With the availability of abundant SNP markers, association studies provide new hope for gene mapping.
We are interested in association mapping in general, and particularly, we are interested in new methods
that incorporating haplotype information. We have developed a new algorithmic method for haplotype mapping
of case-control data based on a density-based clustering algorithm, and propose a new haplotype similarity
measure. Experimental results on simulated datasets obtained from the literature, and on real datasets
with the known disease gene locations show that our method could predict gene locations with high accuracy,
even when the rate of phenocopies is high. A software
package called HapMiner implements the alogrithm and is available from our website.
SNP selection for whole genome association: PS_SNP
Large-scale
whole genome association studies are increasingly common, due in large part to
recent advances in genotyping technology. With this change in paradigm for
genetic studies of complex diseases, it is vital to develop valid, powerful,
and efficient statistical tools and approaches to evaluate such data. Marker
selection procedures to identify optimal subsets of SNPs, called tag SNPs, can greatly
improve the efficiency of whole genome association studies and have drawn much
attention recently from the computational biology community. We
develop a novel SNP selection procedure with the following advantages over
existing approaches: 1) the method does not rely on the definition of haplotype
block and can be applied directly on genotype data, 2) our approach is computationally
efficient and scalable for whole genome studies, and 3) we incorporate the phenotype
into the SNP selection problem, improving the efficiency for any given study.
Preliminary results show that association tests using tag SNPs obtained by our
approach achieve higher level of significance with tremendous saving in genotyping
cost. The source code written in R is avaiable here.
Generating samples for association studies based on HapMap data: gs
A computer program is available to generate samples for
association studies based on HapMap data. Two approaches to
generate a large number of samples with genotypes/haplotypes and
phenotypes for a case-control design have been implemented based
on a disease model that can be specified by users. The first
approach takes haplotypes from samples of the HapMap project as
inputs, and the second approach takes the pattern of haplotype
block structure as inputs. The samples produced are likely to
inherit real {\it linkage disequilibrium} (LD) patterns from human
populations, and their genotypes/haplotypes are variant from the
HapMap samples. Thus a large number of replicates can be generated
to test the power of any new statistical methods for association
analyses. The samples generated by the program can also be used in
testing tag SNP selection algorithms and haplotype inference
algorithms. The program called gs implements the alogrithm and is available from our website.
-
Other areas:
We are also interesting in microarray data analysis, promoter binding sites identification,
gene regulation networks, de novo sequencing of peptides and protein interaction networks, etc.
Updated 03/28/2006.