How MERLIN simulates data
With the --simulate option, Merlin can generate random datasets that look like
the original data in terms of marker informativeness, spacing and missing data
patterns. In these datasets, marker data are simulated under the null hypothesis
of no linkage or association to observed phenotypes. Phenotypic measurements,
including covariates, quantitative traits and affection status are preserved.
Here is what the --simulate option does:
It assigns random chromosomes to founders according to allele
frequencies at at each marker. Usually, alleles are simulated
independently at each marker but when the --cluster, --rsq or
--dist options are used to enable clustering of markers in LD,
alleles in a cluster are simulated using the haplotype
frequencies for the cluster.
It segregates these chromosomes through the pedigree using the
relationships specified in the original pedigree file and
recombination fraction specified in the map file or linkage format
It replaces the original genotypes with these simulated genotypes,
retaining the original pattern of missing data exactly. (For example,
if individual A is untyped at marker B in the original data, individual
A's genotype at marker B will be discarded in all replicates).
The net result of this is that you get a random chromosome (or genome)
that is unlinked to any of your traits of interest. These simulated data
are suitable for examining false positive rates in a genome scan and
should allow for quirks of marker informativeness, trait distribution and
The data can be saved to a file with the --save command line option or it
can be analysed with any of the regular Merlin options. Changing the
random seed (with the -r command line option) generates a different set of
founder chromosomes and segregation pattern.
Newer versions of Merlin (>1.1) can simulate quantitative traits under the
alternative hypothesis, where trait values are influenced by genotypes at a specific
marker. This option respects the original missing data pattern and allele frequencies
for the trait and SNP genotypes, but introduces association between the two (and
potentially linkage and linkage disequilibrium between the marker and neighboring
SNPs, depending on the map information you provide as input and on whether clustering
of SNPs in linkage disequilibrium is enabled.)
Simulation under the alternative is enabled by combining the --simulate and --trait options. An
examplar set of command line options might be: --simulate --trait BMI,rs9930506,0.01,0.39,0.60. This
would simulate trait BMI such that marker rs9930506 accounts for 0.01 of the variance, with
residual polygenic variance of 0.39 and residual environmental variance of .60. For more
details see the MERLIN command-line reference.
If you are interested in finding out more about gene-dropping simulations,
two useful references are:
Sawcer S, Jones HB, Judge D, Visser F, Compston A, Goodfellow PN, Clayton D (1997)
Empirical genomewide significance levels established by whole genome simulations.
Genet Epidemiol 14:223-9.
Kruglyak L, Daly MJ (1998)
Linkage thresholds for two-stage genome scans.
Am J Hum Genet 62:994-7.