The SimPed program generates haplotype and/or genotype data for pedigree structures unconditional on the disease/quantitative trait status. Haplotype and/or genotype data can be generated either for the autosomes or the X chromosome. The pedigrees for which haplotype and/or genotype data is generated may be very large (>2,500 individuals) and may contain multiple consanguinity and marriage loops. Haplotypes and their frequencies are user-specified and can be either estimated from the investigator’s data or from other sources such as the International HapMap project (
www.hapmap.org) [
14]. For genotype data, allele frequencies must be provided which can either be estimated from the user’s data or obtained from public sources. It is possible to generate haplotypes or genotypes or a combination of haplotypes and genotypes for >20,000 marker loci, thus making it possible to simulate an entire chromosome or genome worth of marker loci. These loci can either be diallelic markers (e.g. SNPs) or multiallelic markers (e.g. micro-satellites). Intermarker recombination or genetic map distances can be incorporated into the simulation of the haplotype and/or genotype data. The user provides intermarker recombination fractions or genetic map distances obtained from genetic maps [
1,
3] or through interpolation. If no genetic map is available for the markers of interest, SNP marker loci can be ordered based upon their sequence-based physical map position and then interpolated onto a genetic map – for example, the Rutgers Combined Linkage-Physical Map [
3] or the DeCode genetic map [
13,
15].
The user must provide the SimPed program with two files. One file contains the pedigree structure(s) in standard linkage format (e.g. GENEHUNTER [
16]) with or without a disease/quantitative trait locus. Additional column(s) in this file denote for which marker loci data is available. The parameter file contains information on genetic map distances/intermarker recombination fractions, haplotype and allele frequencies, and the number of replicates to be simulated. It is possible to efficiently specify genetic map distances and haplotype and allele frequencies for thousands of marker loci due to the format of the parameter file. The SimPed program is flexible, and it is possible to acquire haplotypes/genotypes for only a subset of family members or make unknown the genotypes for a subset of marker(s) for specified family members.
The program can be used to simulate data for large pedigrees; for example, both haplotype and genotype data was generated for a 6-generational pedigree with 2,827 members of which 472 family members were founders. The SimPed program was also used to generate haplotypes and genotypes for a pedigree with 11 consanguinity loops.
The SimPed program generates haplotype and/or genotype data for pedigrees as follows. For the autosomes all of the founders with-in the pedigree are assigned two haplotypes and/or two alleles conditional on the user specified frequencies for all of the marker loci. Once assignment is completed, each founder has two haplotypes. Starting at the top of the pedigree structure, the first offspring of the founder is randomly assigned one of the founder’s haplotypes. The allele at the first marker from this haplotype is assigned to the offspring. It is then determined, based upon the genetic map, whether a recombination event has occurred between the first and second marker loci. If with probability θ a recombination event has occurred, then at the second marker locus the allele from the founder’s other haplotype is assigned to the offspring. If a recombination event has not occurred with probability (1 – θ) then the allele from the founder’s same haplotype is assigned at the second marker locus. This procedure is repeated until alleles for all markers’ loci have been assigned from one founder to their offspring. The process is then repeated, this time assigning alleles to the offspring from their other parent. This procedure varies slightly for the simulation of marker loci on the X chromosome. Since males are hemizygous all founder males are allocated one haplotype and/or allele conditional on the specified frequencies for all of the marker loci. Once assignment is complete for all marker loci the haplotype is duplicated, since the standard LINKAGE pedigree file format is for males to be homozygous for all genotypes on the X chromosome. The haplotypes for the X chromosome for the founder females are determined using the same method as was applied for the auto-somes. For non-founder males it is decided where recombination events occurred between the two maternal haplotypes as was done for the autosomes. Once the haplotype for the non-founder male is determined it is then duplicated. For female non-founders one paternal haplotype is assigned and the maternal haplotype is determined in exactly the same way as it was accomplished for the autosomes. In this manner, the haplotypes flow down the pedigree tree as all non-founders are assigned haplotypes conditional on parental haplotypes. Once all individuals within the pedigree have been assigned haplotypes, for those individuals/marker loci for which it was specified that they are unavailable, the genotypes are made unknown (i.e. 0 0).