Gene-based tests for association are increasingly being seen as a useful complement to genome-wide association studies (GWAS) [1]. A gene-based approach considers association between a trait and all markers (usually SNPs) within a gene rather than each marker individually. Depending on the underlying genetic architecture, gene-based approaches can be more powerful than traditional individual-SNP based GWAS. For example, if a gene contains more than one causative variant, then several SNPs within that gene may show marginal levels of significance that are often indistinguishable from random noise in the initial GWAS results. By making use of prior biological information and combining the effects of all SNPs in a gene into a test-statistic and p-value, the gene-based test may be able to detect these effects. Gene-based tests are also ideally suited for network (or pathway) approaches for interpreting the findings from GWAS [2,3]. These approaches are necessarily gene-centric and require a measure of the relative importance of each gene to the phenotype of interest. The gene-based approach also alleviates the multiple-testing problem faced by GWAS by only considering statistical tests for ~20,000 genes per genome as opposed to testing more than half a million SNPs in a typical GWAS.
Computing a gene-based test for basic GWAS designs using permutations is conceptually simple and is currently implemented as the ‘set-based test’ in the PLINK software package [4]; however heavy computational requirements have restricted this method from being adopted on a genome-wide scale. Other gene-based tests such as those based on genetic distances [5] or entropy [6] are often also restricted to situations where individual genotype information is available, or to specific GWAS designs (usually case-control). There are several important situations in which permutation or existing methods cannot be used; these include family-based GWAS, meta-analysis GWAS based on summary data, and DNA pooling based GWAS. To address these situations, our approach only requires individual marker p-values in order to compute a gene-based p-value. The test summarises the evidence for association on a per gene basis by summarizing either the full set of markers (typically SNPs) in the gene or a subset of the most significant makers (for example, the top 10% most significant SNPs). Under some genetic models, an approach considering all the markers in a gene may be most powerful; under other genetic models, focusing on just the most associated within a gene may be apt. The correct genetic model is of course not known in advance. The default, in our implementation and for the rest of this paper, was all markers in the gene. Our approach accounts for linkage disequilibrium (LD) between markers in a gene by using simulation from the multivariate normal distribution based on the LD structure of a set of reference individuals (HapMap CEU, YRI or CHB/JPT at ~2.1million autosomal SNPs), or a...

