University of Michigan Center for Statistical 
Genetics
Search
 
 

 
 

How MERLIN simulates data

With the --simulate option, Merlin can generate random datasets that look like the original data in terms of marker informativeness, spacing and missing data patterns. In these datasets, marker data are simulated under the null hypothesis of no linkage or association to observed phenotypes. Phenotypic measurements, including covariates, quantitative traits and affection status are preserved.

Here is what the --simulate option does:

  1. It assigns random chromosomes to founders according to allele frequencies at at each marker. Usually, alleles are simulated independently at each marker but when the --cluster, --rsq or --dist options are used to enable clustering of markers in LD, alleles in a cluster are simulated using the haplotype frequencies for the cluster.
  2. It segregates these chromosomes through the pedigree using the relationships specified in the original pedigree file and recombination fraction specified in the map file or linkage format data file.
  3. It replaces the original genotypes with these simulated genotypes, retaining the original pattern of missing data exactly. (For example, if individual A is untyped at marker B in the original data, individual A's genotype at marker B will be discarded in all replicates).

The net result of this is that you get a random chromosome (or genome) that is unlinked to any of your traits of interest. These simulated data are suitable for examining false positive rates in a genome scan and should allow for quirks of marker informativeness, trait distribution and selection scheme.

The data can be saved to a file with the --save command line option or it can be analysed with any of the regular Merlin options. Changing the random seed (with the -r command line option) generates a different set of founder chromosomes and segregation pattern.

Newer versions of Merlin (>1.1) can simulate quantitative traits under the alternative hypothesis, where trait values are influenced by genotypes at a specific marker. This option respects the original missing data pattern and allele frequencies for the trait and SNP genotypes, but introduces association between the two (and potentially linkage and linkage disequilibrium between the marker and neighboring SNPs, depending on the map information you provide as input and on whether clustering of SNPs in linkage disequilibrium is enabled.)

Simulation under the alternative is enabled by combining the --simulate and --trait options. An examplar set of command line options might be: --simulate --trait BMI,rs9930506,0.01,0.39,0.60. This would simulate trait BMI such that marker rs9930506 accounts for 0.01 of the variance, with residual polygenic variance of 0.39 and residual environmental variance of .60. For more details see the MERLIN command-line reference.

If you are interested in finding out more about gene-dropping simulations, two useful references are:

  1. Sawcer S, Jones HB, Judge D, Visser F, Compston A, Goodfellow PN, Clayton D (1997) Empirical genomewide significance levels established by whole genome simulations. Genet Epidemiol 14:223-9.
  2. Kruglyak L, Daly MJ (1998) Linkage thresholds for two-stage genome scans. Am J Hum Genet 62:994-7.

 
 

University of Michigan | School of Public Health | Abecasis Lab