MicroRNAs (miRNAs) are small non-coding RNAs that can regulate their target genes at the post-transcriptional level. Skeletal muscle comprises different fiber types that can be broadly classified as red, intermediate, and white. Recently, a set of miRNAs was found expressed in a fiber type-specific manner in red and white fiber types. However, an in-depth analysis of the miRNA transcriptome differences between all three fiber types has not been undertaken. Herein, we collected 15 porcine skeletal muscles from different anatomical locations, which were then clearly divided into red, white, and intermediate fiber type based on the ratios of myosin heavy chain isoforms. We further illustrated that three muscles, which typically represented each muscle fiber type (i.e., red: peroneal longus (PL), intermediate: psoas major muscle (PMM), white: longissimus dorsi muscle (LDM)), have distinct metabolic patterns of mitochondrial and glycolytic enzyme levels. Furthermore, we constructed small RNA libraries for PL, PMM, and LDM using a deep sequencing approach. Results showed that the differentially expressed miRNAs were mainly enriched in PL and played a vital role in myogenesis and energy metabolism. Overall, this comprehensive analysis will contribute to a better understanding of the miRNA regulatory mechanism that achieves the phenotypic diversity of skeletal muscles.
miRNA; fiber type; pig; myogenesis; energy metabolism
Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores).
To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM).
MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development.
Availability and implementation
MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3.
Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica.
To meet their metabolic needs, starved cells first activate autophagy, but activation in parallel of the general amino acid control pathway increases amino acid uptake, leading to reactivation of mTOR and down-regulation of autophagy.
Organisms have evolved elaborate mechanisms to adjust intracellular nutrient levels in response to fluctuating availability of exogenous nutrients. During starvation, cells can enhance amino acid uptake and synthesis through the general amino acid control (GAAC) pathway, whereas nonessential cellular contents are recycled by autophagy. How these two pathways are coordinated in response to starvation is currently unknown. Here we show that the GAAC pathway couples exogenous amino acid availability with autophagy. Starvation caused deactivation of mTOR, which then activated autophagy. In parallel, serum/glutamine starvation activated the GAAC pathway, which up-regulated amino acid transporters, leading to increased amino acid uptake. This elevated the intracellular amino acid level, which in turn reactivated mTOR and suppressed autophagy. Knockdown of activating transcription factor 4, the major transcription factor in the GAAC pathway, or of SLC7A5, a leucine transporter, caused impaired mTOR reactivation and much higher levels of autophagy. Thus, the GAAC pathway modulates autophagy by regulating amino acid uptake and mTOR reactivation during serum/glutamine starvation.
Cancer cells derived from different stages of tumor progression may exhibit distinct biological properties, as exemplified by the paired lung cancer cell lines H1993 and H2073. While H1993 was derived from chemo-naive metastasized tumor, H2073 originated from the chemo-resistant primary tumor from the same patient and exhibits strikingly different drug response profile. To understand the underlying genetic and epigenetic bases for their biological properties, we investigated these cells using a wide range of large-scale methods including whole genome sequencing, RNA sequencing, SNP array, DNA methylation array, and de novo genome assembly. We conducted an integrative analysis of both cell lines to distinguish between potential driver and passenger alterations. Although many genes are mutated in these cell lines, the combination of DNA- and RNA-based variant information strongly implicates a small number of genes including TP53 and STK11 as likely drivers. Likewise, we found a diverse set of genes differentially expressed between these cell lines, but only a fraction can be attributed to changes in DNA copy number or methylation. This set included the ABC transporter ABCC4, implicated in drug resistance, and the metastasis associated MET oncogene. While the rich data content allowed us to reduce the space of hypotheses that could explain most of the observed biological properties, we also caution there is a lack of statistical power and inherent limitations in such single patient case studies.
Domesticated organisms have experienced strong selective pressures directed at genes or genomic regions controlling traits of biological, agricultural or medical importance. The genome of native and domesticated pigs provide a unique opportunity for tracing the history of domestication and identifying signatures of artificial selection. Here we used whole-genome sequencing to explore the genetic relationships among the European native pig Berkshire and breeds that are distributed worldwide, and to identify genomic footprints left by selection during the domestication of Berkshire. Numerous nonsynonymous SNPs-containing genes fall into olfactory-related categories, which are part of a rapidly evolving superfamily in the mammalian genome. Phylogenetic analyses revealed a deep phylogenetic split between European and Asian pigs rather than between domestic and wild pigs. Admixture analysis exhibited higher portion of Chinese genetic material for the Berkshire pigs, which is consistent with the historical record regarding its origin. Selective sweep analyses revealed strong signatures of selection affecting genomic regions that harbor genes underlying economic traits such as disease resistance, pork yield, fertility, tameness and body length. These discoveries confirmed the history of origin of Berkshire pig by genome-wide analysis and illustrate how domestication has shaped the patterns of genetic variation.
A single–base pair resolution silkworm genetic variation map was constructed from 40 domesticated and wild silkworms, each sequenced to approximately threefold coverage, representing 99.88% of the genome. We identified ∼16 million single-nucleotide polymorphisms, many indels, and structural variations. We find that the domesticated silkworms are clearly genetically differentiated from the wild ones, but they have maintained large levels of genetic variability, suggesting a short domestication event involving a large number of individuals. We also identified signals of selection at 354 candidate genes that may have been important during domestication, some of which have enriched expression in the silk gland, midgut, and testis. These data add to our understanding of the domestication processes and may have applications in devising pest control strategies and advancing the use of silkworms as efficient bioreactors.
Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.
5-methylcytosine (mC) can be oxidized by the tet methylcytosine dioxygenase (Tet) family of enzymes to 5-hydroxymethylcytosine (hmC), which is an intermediate of mC demethylation and may also be a stable epigenetic modification that influences chromatin structure. hmC is particularly abundant in mammalian brains but its function is currently unknown. A high-resolution hydroxymethylome map is required to fully understand the function of hmC in the human brain.
We present genome-wide and single-base resolution maps of hmC and mC in the human brain by combined application of Tet-assisted bisulfite sequencing and bisulfite sequencing. We demonstrate that hmCs increase markedly from the fetal to the adult stage, and in the adult brain, 13% of all CpGs are highly hydroxymethylated with strong enrichment at genic regions and distal regulatory elements. Notably, hmC peaks are identified at the 5′splicing sites at the exon-intron boundary, suggesting a mechanistic link between hmC and splicing. We report a surprising transcription-correlated hmC bias toward the sense strand and an mC bias toward the antisense strand of gene bodies. Furthermore, hmC is negatively correlated with H3K27me3-marked and H3K9me3-marked repressive genomic regions, and is more enriched at poised enhancers than active enhancers.
We provide single-base resolution hmC and mC maps in the human brain and our data imply novel roles of hmC in regulating splicing and gene expression. Hydroxymethylation is the main modification status for a large portion of CpGs situated at poised enhancers and actively transcribed regions, suggesting its roles in epigenetic tuning at these regions.
To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million nonredundant microbial genes, derived from 576.7 Gb sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent microbial genes of the cohort and likely includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, suggesting that the entire cohort harbours between 1000 and 1150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions encoded by the gene set.
Residents of the Tibetan Plateau show heritable adaptations to extreme altitude. We sequenced 50 exomes of ethnic Tibetans, encompassing coding sequences of 92% of human genes, with an average coverage of 18X per individual. Genes showing population-specific allele frequency changes, which represent strong candidates for altitude adaptation, were identified. The strongest signal of natural selection came from EPAS1, a transcription factor involved in response to hypoxia. One SNP at EPAS1 shows a 78% frequency difference between Tibetan and Han samples, representing the fastest allele frequency change observed at any human gene to date. This SNP’s association with erythrocyte abundance supports the role of EPAS1 in adaptation to hypoxia. Thus, a population genomic survey has revealed a functionally important locus in genetic adaptation to high altitude.
Meiotic recombination creates genetic diversity and ensures segregation of homologous chromosomes. Previous population analyses yielded results averaged among individuals and impacted by evolutionary pressures. Here we sequenced 99 sperm from an Asian male using the newly developed amplification method—Multiple Annealing and Looping-Based Amplification Cycles (MALBAC)—to phase the personal genome and map at high resolution recombination events, which are non-uniformly distributed across the genome in the absence of selection pressure. The paucity of recombination near transcription start sites observed in individual sperm indicates such a phenomenon is intrinsic to the molecular mechanism of meiosis. Interestingly, a decreased crossover frequency in companion with an increase of autosomal aneuploidy is observable on a global per-sperm basis.
To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, CUSHAW2, GEM and GPU-based aligners BarraCUDA and CUSHAW, SOAP3-dp was found to be two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60%. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1% FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides the same scoring scheme as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.
Previous studies have indicated two main domestic pig dispersal routes in East Asia: one is from the Mekong region, through the upstream region of the Yangtze River (URYZ) to the middle and upstream regions of the Yellow River, the other is from the middle and downstream regions of the Yangtze River to the downstream region of the Yellow River, and then to northeast China. The URYZ was regarded as a passageway of the former dispersal route; however, this assumption remains to be further investigated. We therefore analyzed the hypervariable segements of mitochondrial DNA from 513 individual pigs mainly from Sichuan and the Tibet highlands and 1,394 publicly available sequences from domestic pigs and wild boars across Asia. From the phylogenetic tree, most of the samples fell into a mixed group that was difficult to distinguish by breed or geography. The total network analysis showed that the URYZ pigs possessed a dominant position in haplogroup A and domestic pigs shared the same core haplotype with the local wild boars, suggesting that pigs in group A were most likely derived from the URYZ pool. In addition, a region-wise network analysis determined that URYZ contains 42 haplotypes of which 22 are unique indicating the high diversity in this region. In conclusion, our findings confirmed that pigs from the URYZ were domesticated in situ.
It is evident that epigenetic factors, especially DNA methylation, play essential roles in obesity development. Using pig as a model, here we investigated the systematic association between DNA methylation and obesity. We sampled eight variant adipose and two distinct skeletal muscle tissues from three pig breeds living within comparable environments but displaying distinct fat level. We generated 1,381 gigabases (Gb) of sequence data from 180 methylated DNA immunoprecipitation (MeDIP) libraries, and provided a genome-wide DNA methylation map as well as a gene expression map for adipose and muscle studies. The analysis showed global similarity and difference among breeds, sexes and anatomic locations, and identified the differentially methylated regions (DMRs). The DMRs in promoters are highly associated with obesity development via expression repression of both known obesity-related genes and novel genes. This comprehensive map provides a solid basis for exploring epigenetic mechanisms of adipose deposition and muscle growth.
It is well established that the metabolic risk factors of obesity and its comorbidities are more attributed to adipose tissue distribution rather than total adipose mass. Since emerging evidence suggests that epigenetic regulation plays an important role in the aetiology of obesity, we conducted a genome-wide methylation analysis on eight different adipose depots of three pig breeds living within comparable environments but displaying distinct fat level using methylated DNA immunoprecipitation sequencing. We aimed to investigate the systematic association between anatomical location-specific DNA methylation status of different adipose depots and obesity-related phenotypes. We show here that compared to subcutaneous adipose tissues which primarily modulate metabolic indicators, visceral adipose tissues and intermuscular adipose tissue, which are the metabolic risk factors of obesity, are primarily associated with impaired inflammatory and immune responses. This study presents epigenetic evidence for functionally relevant methylation differences between different adipose depots.
pig; subcutaneous adipose tissue; visceral adipose tissue; DNA methylation; MeDIP-seq
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
RNA-Seq, a method using next generation sequencing technologies to sequence the transcriptome, facilitates genome-wide analysis of splice junction sites. In this paper, we introduce SOAPsplice, a robust tool to detect splice junctions using RNA-Seq data without using any information of known splice junctions. SOAPsplice uses a novel two-step approach consisting of first identifying as many reasonable splice junction candidates as possible, and then, filtering the false positives with two effective filtering strategies. In both simulated and real datasets, SOAPsplice is able to detect many reliable splice junctions with low false positive rate. The improvement gained by SOAPsplice, when compared to other existing tools, becomes more obvious when the depth of sequencing is low. SOAPsplice is freely available at http://soap.genomics.org.cn/soapsplice.html.
RNA-Seq; splice junction; spliced alignment
Myopia is the most common ocular disorder worldwide, and high myopia in particular is one of the leading causes of blindness. Genetic factors play a critical role in the development of myopia, especially high myopia. Recently, the exome sequencing approach has been successfully used for the disease gene identification of Mendelian disorders. Here we show a successful application of exome sequencing to identify a gene for an autosomal dominant disorder, and we have identified a gene potentially responsible for high myopia in a monogenic form. We captured exomes of two affected individuals from a Han Chinese family with high myopia and performed sequencing analysis by a second-generation sequencer with a mean coverage of 30× and sufficient depth to call variants at ∼97% of each targeted exome. The shared genetic variants of these two affected individuals in the family being studied were filtered against the 1000 Genomes Project and the dbSNP131 database. A mutation A672G in zinc finger protein 644 isoform 1 (ZNF644) was identified as being related to the phenotype of this family. After we performed sequencing analysis of the exons in the ZNF644 gene in 300 sporadic cases of high myopia, we identified an additional five mutations (I587V, R680G, C699Y, 3′UTR+12 C>G, and 3′UTR+592 G>A) in 11 different patients. All these mutations were absent in 600 normal controls. The ZNF644 gene was expressed in human retinal and retinal pigment epithelium (RPE). Given that ZNF644 is predicted to be a transcription factor that may regulate genes involved in eye development, mutation may cause the axial elongation of eyeball found in high myopia patients. Our results suggest that ZNF644 might be a causal gene for high myopia in a monogenic form.
People with myopia see near objects more clearly than objects far away. Myopia is the most common ocular disorder worldwide, with a high prevalence in Asian (40%–70%) and Caucasian (20%–30%) populations. Although the etiologies of myopia have not yet been established, previous studies have indicated the involvement of genetic and environmental factors (such as close working habits, higher education levels, and higher socioeconomic class). Genetic factors play a critical role in the development of myopia, especially high myopia. In this study, we use exome sequencing, a powerful tool for a disease gene identification, to identify a gene involved in high myopia in a monogenic form among Han Chinese. Mutations in zinc finger protein 644 isoform 1 (ZNF644) were identified as potentially responsible for the phenotype of high myopia. The main feature of high myopia is axial elongation of the eye globe. Given that ZNF644 is predicted to be a transcription factor that may regulate genes involved in eye development, a mutant ZNF644 protein may impact the normal eye development and therefore may underlie the axial elongation of the eye globe in high myopia patients. Further study of the biological function of ZNF644 will provide insight into the pathogenesis of myopia.
Analysis across the genome of patterns of DNA methylation reveals a rich landscape of allele-specific epigenetic modification and consequent effects on allele-specific gene expression.
DNA methylation plays an important role in biological processes in human health and disease. Recent technological advances allow unbiased whole-genome DNA methylation (methylome) analysis to be carried out on human cells. Using whole-genome bisulfite sequencing at 24.7-fold coverage (12.3-fold per strand), we report a comprehensive (92.62%) methylome and analysis of the unique sequences in human peripheral blood mononuclear cells (PBMC) from the same Asian individual whose genome was deciphered in the YH project. PBMC constitute an important source for clinical blood tests world-wide. We found that 68.4% of CpG sites and <0.2% of non-CpG sites were methylated, demonstrating that non-CpG cytosine methylation is minor in human PBMC. Analysis of the PBMC methylome revealed a rich epigenomic landscape for 20 distinct genomic features, including regulatory, protein-coding, non-coding, RNA-coding, and repeat sequences. Integration of our methylome data with the YH genome sequence enabled a first comprehensive assessment of allele-specific methylation (ASM) between the two haploid methylomes of any individual and allowed the identification of 599 haploid differentially methylated regions (hDMRs) covering 287 genes. Of these, 76 genes had hDMRs within 2 kb of their transcriptional start sites of which >80% displayed allele-specific expression (ASE). These data demonstrate that ASM is a recurrent phenomenon and is highly correlated with ASE in human PBMCs. Together with recently reported similar studies, our study provides a comprehensive resource for future epigenomic research and confirms new sequencing technology as a paradigm for large-scale epigenomics studies.
Epigenetic modifications such as addition of methyl groups to cytosine in DNA play a role in regulating gene expression. To better understand these processes, knowledge of the methylation status of all cytosine bases in the genome (the methylome) is required. DNA methylation can differ between the two gene copies (alleles) in each cell. Such allele-specific methylation (ASM) can be due to parental origin of the alleles (imprinting), X chromosome inactivation in females, and other as yet unknown mechanisms. This may significantly alter the expression profile arising from different allele combinations in different individuals. Using advanced sequencing technology, we have determined the methylome of human peripheral blood mononuclear cells (PBMC). Importantly, the PBMC were obtained from the same male Han Chinese individual whose complete genome had previously been determined. This allowed us, for the first time, to study genome-wide differences in ASM. Our analysis shows that ASM in PBMC is higher than can be accounted for by regions known to undergo parent-of-origin imprinting and frequently (>80%) correlates with allele-specific expression (ASE) of the corresponding gene. In addition, our data reveal a rich landscape of epigenomic variation for 20 genomic features, including regulatory, coding, and non-coding sequences, and provide a valuable resource for future studies. Our work further establishes whole-genome sequencing as an efficient method for methylome analysis.
The SilkDB is an open-access database for genome biology of the silkworm (Bombyx mori). Since the draft sequence was completed and the SilkDB was first released 5 years ago, we have collaborated with other groups to make much remarkable progress on silkworm genome research, such as the completion of a new high-quality assembly of the silkworm genome sequence as well as the construction of a genome-wide microarray to survey gene expression profiles. To accommodate these new genomic data and house more comprehensive genomic information, we have reconstructed SilkDB database with new web interfaces. In the new version (v2.0) of SilkDB, we updated the genomic data, including genome assembly, gene annotation, chromosomal mapping, orthologous relationship and experiment data, such as microarray expression data, Expressed Sequence Tags (ESTs) and corresponding references. Several new tools, including SilkMap, Silkworm Chromosome Browser (SCB) and BmArray, are developed to access silkworm genomic data conveniently. SilkDB is publicly available at the new URL of http://www.silkdb.org.
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.
The YH database is a server that allows the user to easily browse and download data from the first Asian diploid genome. The aim of this platform is to facilitate the study of this Asian genome and to enable improved organization and presentation large-scale personal genome data. Powered by GBrowse, we illustrate here the genome sequences, SNPs, and sequencing reads in the MapView. The relationships between phenotype and genotype can be searched by location, dbSNP ID, HGMD ID, gene symbol and disease name. A BLAST web service is also provided for the purpose of aligning query sequence against YH genome consensus. The YH database is currently one of the three personal genome database, organizing the original data and analysis results in a user-friendly interface, which is an endeavor to achieve fundamental goals for establishing personal medicine. The database is available at http://yh.genomics.org.cn.
Gene conversion causes a non-reciprocal transfer of genetic information between similar sequences. Gene conversion can both homogenize genes and recruit point mutations thereby shaping the evolution of multigene families. In the rice genome, the large number of duplicated genes increases opportunities for gene conversion.
To characterize gene conversion in rice, we have defined 626 multigene families in which 377 gene conversions were detected using the GENECONV program. Over 60% of the conversions we detected were between chromosomes. We found that the inter-chromosomal conversions distributed between chromosome 1 and 5, 2 and 6, and 3 and 5 are more frequent than genome average (Z-test, P < 0.05). The frequencies of gene conversion on the same chromosome decreased with the physical distance between gene conversion partners. Ka/Ks analysis indicates that gene conversion is not tightly linked to natural selection in the rice genome. To assess the contribution of segmental duplication on gene conversion statistics, we determined locations of conversion partners with respect to inter-chromosomal segment duplication. The number of conversions associated with segmentation is less than ten percent. Pseudogenes in the rice genome with low similarity to Arabidopsis genes showed greater likelihood for gene conversion than those with high similarity to Arabidopsis genes. Functional annotations suggest that at least 14 multigene families related to disease or bacteria resistance were involved in conversion events.
The evolution of gene families in the rice genome may have been accelerated by conversion with pseudogenes. Our analysis suggests a possible role for gene conversion in the evolution of pathogen-response genes.
TreeFam (http://www.treefam.org) was developed to provide curated phylogenetic trees for all animal gene families, as well as orthologue and paralogue assignments. Release 4.0 of TreeFam contains curated trees for 1314 families and automatically generated trees for another 14 351 families. We have expanded TreeFam to include 25 fully sequenced animal genomes, as well as four genomes from plant and fungal outgroup species. We have also introduced more accurate approaches for automatically grouping genes into families, for building phylogenetic trees, and for inferring orthologues and paralogues. The user interface for viewing phylogenetic trees and family information has been improved. Furthermore, a new perl API lets users easily extract data from the TreeFam mysql database.