Simulation of family genomes
#FAMILY GENOME SIMULATOR
Copyright (C) 2011 Juan Caballero [Insitute for Systems Biology]
bin/personalGenome.pl -g data/hg19.fa.gz -d data/dbSNP132.var.gz -p data/CEU.var.gz -o father.var -s M
bin/personalGenome.pl -g data/hg19.fa.gz -d data/dbSNP132.var.gz -p data/CEU.var.gz -o mother.var
bin/reproduction.pl -f father.var -m mother.var -c girl1.var -r data/hg19.fa.gz
bin/reproduction.pl -f father.var -m mother.var -c boy2.var -s M
cp boy2.var boy3.var # identical twin
bin/noise.pl -i father.var -o father.var.data -s M -g data/hg19.fa.gz
bin/noise.pl -i mother.var -o mother.var.data -s F -g data/hg19.fa.gz
bin/noise.pl -i girl1.var -o girl1.var.data -s F -g data/hg19.fa.gz
bin/noise.pl -i boy1.var -o boy1.var.data -s M -g data/hg19.fa.gz
bin/noise.pl -i boy2.var -o boy2.var.data -s M -g data/hg19.fa.gz
Complete documentation:
perldoc PROGRAM
PROGRAM -h
PROGRAM --help
I split this problem in three steps (programs):
Creating a personal genome (personalGenome.pl): I use three sources for the variation: 1000genomes population variations data, dbSNP and the reference genome. The total number of variations (SNPs for now) is an expected number (~3 mill) with a little noise, or you can fix the number. De novo SNPs can be added (default 1%). You can choose the sex, but for now I not modeling recombination in sexual chromosomes. In the first round it selects population-specific SNPs (~ 1mill because data comes from SNP-chips), the second round take data from dbSNP and in the last round de novo SNPs are created. I integrate diverse options and configurations (no population SNPs, no de novo, homozygous SNPs, etc) to as flexible as possible. The output is a text table with columns: CHROM POSITION REF_ALLELE ALLELE_1 ALLELE_2 ALLELE_INFO
Artificial Sex (reproduction.pl): this script takes a personal genome from a "father" and a "mother", performs meiosis in each genome based in a distribution of recombination points and mix together to produce a "child". You can define the sex, total SNPs, de novo SNPs and other parameters. The output is the same: CHROM POSITION REF_ALLELE ALLELE_1 ALLELE_2 ALLELE_INFO
Noise/Error addition (noise.pl): this part models missing SNPs, bad base calling, half-calling and false positives SNPs as error/noise.
There are some optimization I can do, each programs uses a lot of memory (~4-5 GB), because I need to load the reference genome, but are relatively fast, 5 min to produce a genome and 5 min to produce a child.