|Home | About | Journals | Submit | Contact Us | Français|
A fundamental concept in biology is that heritable material is passed from parents to offspring, a process called vertical gene transfer. An alternative mechanism of gene acquisition is through horizontal gene transfer (HGT), which involves movement of genetic materials between different species. Horizontal gene transfer has been found prevalent in prokaryotes but very rare in eukaryote. In this paper, we investigate horizontal gene transfer in the human genome.
From the pair-wise alignments between human genome and 53 vertebrate genomes, 1,467 human genome regions (2.6 M bases) from all chromosomes were found to be more conserved with non-mammals than with most mammals. These human genome regions involve 642 known genes, which are enriched with ion binding. Compared to known horizontal gene transfer regions in the human genome, there were few overlapping regions, which indicated horizontal gene transfer is more common than we expected in the human genome.
Horizontal gene transfer impacts hundreds of human genes and this study provided insight into potential mechanisms of HGT in the human genome.
The acquisition of genes from an organism other than a direct ancestor, which is called horizontal gene transfer (HGT), is well known in bacteria and unicellular eukaryotes [1–6]. Although the transfer of genes is thought to be crucial in prokaryotic evolution, its existence in higher organisms, including animals, is less well established [7–9]. However, in recent years, more and more instances of horizontal gene transfer have been reported in multicellular eukaryotes, even in humans [7–12]. This phenomenon has been previously reported in bovine genome, which has a high abundance of the BovB sequence of the python and copperhead reptiles. However, the BovB sequence was more similar between cows and reptiles than cows and horses. A proposed mechanism for this phenomenon was horizontal gene transfer between organisms with ticks as the media . Another project discovered the “Space Invaders,” or SPIN elements, that were highly conserved between seven distantly related species from diverse lineages. A proposed mechanism to explain the similarity of sequences between two species with a very distant common ancestor was horizontal gene transfer .
A recent study investigated the possibility of HGT in 26 animal species (10 primates, 12 flies and four nematodes) and a simplified analysis in a further 14 vertebrates. Genome-wide comparative and phylogenetic analyses show that HGT in animals typically gives rise to tens or hundreds of active ‘foreign’ genes, largely concerned with metabolism. This analyses suggest that while fruit flies and nematodes have continued to acquire foreign genes throughout their evolution, humans and other primates have gained relatively few since their common ancestor .
In this study, we analyzed the pair-wise alignment data between the human reference genome hg19  and 53 other vertebrates (41 mammalian vertebrates and 12 non-mammalian vertebrates). We identified potential HGT regions in the human genome (see Methods). Genome annotations and genome-wide signals from recent genomics or epigenetics studies [15, 16] of these sequences have been gathered to infer the functions of these predicted HGT regions, and evolutionary trends of these sequences have been analyzed. This study thus sheds light upon the prevalence of HGT and provide an insight of its underlining mechanisms.
We started with the pair-wise blastz alignments between the human reference genome hg19  and 53 other vertebrates (41 mammalian vertebrates and 12 non-mammalian vertebrates). The alignment data were then linked into chains and formed into an alignment net . More details about the pair-wise alignment data can be found in the UCSC Genome Browser website .
From the pair-wise alignment data, we identified the human genome regions that were conserved between 12 non-mammalian vertebrates and the human genome. These regions were then compared to the mammalian genomes to create HGT regions. A human genome region is called a non-mammal conserved region if this region is aligned to at least two non-mammal genomes with an identity no less than 40% and the aligned human genome region length is longer than 1000 bps. For these non-mammal conserved regions, if a region can be aligned to at most 8 (<20%) of the 41 mammals with an alignment identity ≥40% and length of the aligned region longer than 40% of the human region, this region is defined as a predicted HGT region (Fig. 1).
It’s notable that this set of predicted HGT regions are not comprehensive since some regions may be separated by small gaps between alignment sequences that cause two short, yet close, conserved sequences to be filtered out. In order to check the reliability of HGT predictions, we explored various identity thresholds in the pipeline (40%, 50 and 60%) between vertebrates and the human genome and the length coverage (40%, 20% and 0) between mammals and the human genome to calculate the number of predicted or candidate HGT regions.
To verify the conservation levels of the sequences we extracted, we randomly sampled 50 sequences from our results and verified the sequences based on the existence of homologous sequence in several groups of vertebrates using the University of California Santa Cruz (UCSC) Comparative Genomics track (Primate Chain/Net, Placental Chain/Net, Vertebrate Chain/Net) [17, 19, 20]. We grouped the 53 vertebrates into two bins: the first bin includes all non-human mammals; the second bin includes all non-mammal vertebrates. For each of our samples, we determined if less than eight mammals have homology sequences with ≥40% identity with hg19 and≥40% of the human genome region length covered.
We checked whether our sets of predicted HGT sequences overlapped with known coding genes in ENSEMBL database . We downloaded the Homo sapiens gene data from ENSEMBL database and converted them into bed format file. For each sequences dataset, we intersected the Ensembl Genes with these sequences to discover genes that had any overlap with the predicted HGT regions.
We extracted the transcript IDs from the overlapped Ensembl genes and searched these IDs through the Ensembl Human Gene sets table, in which these genes were classified as protein coding, pseudogenes, etc. .
All involved genes were generated from Ensemble Genes  and enrichment of these genes was calculated using the Database for Annotation, Visualization and Integrated Discovery (DAVID [22, 23], version 6.7) with all human genes as the background level. All items were adjusted with the modified Fisher Exact p-value (EASE score)<0.05. The gene enrichment results were shown in three aspects: 1) molecular functions, 2) biological processes, and 3) cellular components.
Using the Broad chromHMM  Chromatin State track in nine cell lines (Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, Nhlf), we checked the chromatin states of HGT regions. We then compared the percentages of different chromatin states in these regions against the corresponding values in the whole human genome as the background.
We used the RepeatMasker track from the UCSC Genome Browser to generate distributions of the different types of repeats, including SINEs, LINEs, LTRs, DNA repeats and simple repeats . We divided the HGT regions and flanking regions into five regions: w1 (300 bps upstream regions of the HGT regions), w2 (the beginning 300 bps of the HGT regions), w3 (the middle regions of the HGT regions), w4 (the ending 300 bps of the HGT regions), and w5 (300 bps downstream of the HGT regions). We calculated the average proportions of different types of repeats in the five regions of our sequences and drew the average proportions curve of different types of repeats within the HGT regions and 300 bps upstream and downstream of the HGT regions.
We calculated the GC content and drew GC content curve for the HGT regions with a similar approach as the above section we did for repeats.
In order to predict regulatory functions of our sequence, regulatory related histone modification and histone variants on the sequences in H1-hESC cell lines was investigated. One transcription related histone marks: H3K4Me1; and four regulation-related histone marks and variations: H2A.Z, H3K4Me3, H3K9Ac, and H3K27Ac  were applied. In addition, one histone methylation related to polycomb-repression: H3K27Me3 [16, 24] was also included. All these datasets were downloaded from the corresponding tracks in the UCSC genome browser.
We took the phylogenetic tree directly from the Comparative Genomics track “Vertebrate Multiz Alignment & Conservation (100 Species)” [25, 26]. For each HGT region, we extracted the homologous sequences from related vertebrate genomes. Then we aligned these homology sequences by ClustalW and constructed the phylogenetic tree by Maximum Likelihood Method using MEGA6 . We compared this HGT homologous sequence based phylogenetic tree with the species phylogenetic tree and analyzed the differences between two phylogenetic trees. In order to show the predicted HGT events between these species, we used Sprit  to calculate the minimum subtree prune and regraft (SPR) distance between the species phylogenetic tree and the HGT homology sequence based phylogenetic tree.
In order to check the confidence levels of the HGT regions found in the human genome, we checked the possible transfer paths from other species to human. 22 HGT fragments (similarity between vertebrates and hg19 was larger than 60% with lengths at least 1000 bps and no alignment between mammalian genomes and hg19 was allowed) were aligned with blastn to the Nucleotide collection database of National Center for Biotechnology Information (NCBI), which consists of all nucleotide sequences available. We searched the Nucleotide collection database to find the homologous sequences of our HGT fragments in some species, which might be the media species for the HGT progress, such as parasites, bacteria or fungi.
In our pipeline (see Methods), we compared pair-wise alignment block regions between 53 vertebrates and the human genome and extracted those human genome sequences that show greater conservation with non-mammalian vertebrates than with most mammals. With the similarity threshold (40%) and the length coverage (40%) between mammalian genomes and hg19, our pipeline has found 1,467 HGT regions longer than 1,000 bps.
We located these HGT regions on the human chromosomes. The number of HGT regions varies from chromosome to chromosome and most of HGT regions are located on both ends of the chromosomes (Table 1, Fig. 2). We calculated the frequency of HGT regions in the both ends and the middle of the chromosomes, and found the frequency of HGT regions in the both ends of the chromosomes is significantly greater than the middle of the chromosomes (paired t-test p-value<0.001).
Since the HGT regions we predicted here happened after the split of primates from mammals, this discovery leads us to conjecture that the HGT fragments might be more likely to insert on both ends of the chromosomes.
We found 642 Ensembl genes overlapped with predicted HGT regions and the types of genes were identified according to the Ensemble gene annotation. The most prevalent type of genes was “protein coding”, which covers 39.1% of HGT regions, followed by “lincRNA” (6.6%) and “antisense” (2.9%). The non-gene region covers 46.1% of the HGT regions. Background data (the whole human genome) showed that “protein coding” genes cover 40.7% of human genome, followed by “lincRNA” (7.0%), and “antisense” (3.7%). The non-gene regions cover 45.7% of human genome (Table 2, Fig. 3a). Therefore, the gene type composition of overlapped genes is not significantly different with that of the background.
By using the DAVID to calculate the enrichment of overlapped genes, we discovered 435 genes that had a function annotation in the Gene Ontology (GO) database and categorized these into functional groups and analyzed their enrichment in certain functions. With the modified Fisher Exact test (p-value<0.05), we discovered one term of gene functions are significantly enriched. This term is metal ion binding (120 genes) (Table 3). We also discovered that the proportion of cell part genes is greater than that of the background (Fig. 3c, d). Overall, the most significant gene function impacted by HGT regions is ion binding.
Comparing the Chromatin States of predicted HGT regions and human genome, we found that, of the HGT regions, 61% were heterochromatin, 23% were transcription, and 6.8% were repressed. However, in the human genome, 72% were heterochromatin, 20% were transcription, and 1.6% were repressed. (Table 4, Fig. 4). It indicated that HGT regions are more active than the average level of transcription and may play an important functional role in the human genome.
Using the RepeatMasker track, we calculated the percentage of HGT regions that were known repeats, as well as the proportion of each type of repeats (simple repeat, SINE, LINE, etc.). The percentage of repeats (average: 24%) in HGT regions was lower than that of the background region (average: 41%). The most prevalent type of repeats is “simple repeat”, occupying about 10% of the whole HGT regions, followed by “Low complexity repeat” at 4.5%. Background data shows, in human genome, the most prevalent types of repeats are LINE and SINE repeats, with minimal amount of “simple repeats” (Fig. 5a).
Compared to the flanking regions, the predicted HGT regions contains higher percentages of “simple repeat” and “Low complexity repeat”. But for the other types of repeats, such as “LTR”, “SINE” and “LINE” repeats, the percentage in HGT regions is lower than the upstream and downstream regions (Fig. 5b).
We calculated the average percentage of CG content of predicted HGT regions and their 300 bps upstream and downstream regions. The average GC percentage of HGT regions is higher than the upstream and downstream regions (Fig. 5c).
We calculated the average percentage of regulatory related histone modification signals in predicted HGT regions and their 300 bp upstream and downstream regions. Results showed no significant feature of histone modification in HGT regions (Fig. 5d).
Our pipeline can find the HGT regions which have greater conservation with non-mammalian vertebrates than with most mammals. It is also important to note that we allow less than eight mammals containing homologous sequences with the human genome. Indeed, for most HGT regions we have found, some mammals do have homologous sequences with hg19 and most of them are the primates. This phenomenon indicated horizontal gene transfer happened between the ancestor of primates and the non-mammalian vertebrates.
In order to further understand the HGT mechanism, we extracted homologous sequence of the predicted horizontal gene transfer sequences in mammals and non-mammalian species, constructed sequence phylogenetic tree and compared it with species phylogenetic tree. Comparison of two phylogenetic trees showed that, in most cases, two phylogenetic trees were not consistent. Then we compared the tree constructed from predicted HGT regions to the tree constructed from species genomes using the program SPRIT that estimated the number of required subtree prune and regrafts (SPR) to transform one tree into another. It was apparent from the SPRIT output that for most HGT regions, at least three or four SPRs are required to explain the homologous sequences' phylogenetic tree topology. Each SPR may correspond to at least one HGT event, therefore we conclude that at least three or four interspecies HT events have occurred during the evolutionary history of vertebrates. Besides, when we verify the possibility of these HGT events, we found that there were at least one HGT event between the ancestor of all or most primates and the non-mammalian.
With the similarity threshold (60%) and the length coverage (0) between mammalian vertebrates and hg19, we have extracted 22 fragments with lengths more than 1000 bps. These sequences are most likely to be HGT regions. We discovered that one of these fragments, range=chr11:71072901–71074379, has homologues sequence in four primates (Human, Chimp, Gorilla, and Orangutan), one mammal (Chinese hamster), one Lepidosauria (Lizard), and one fish (Lamprey). When we have searched the Nucleotide collection database to find the homologous sequences of our HGT regions in other species, we found that several fishes (such as Cyprinus carpio, Lethenteron camtschaticum etc.) have the homologous sequences, which suggests that fish might be the origin of this HGT region. Comparing the phylogenetic trees built by HGT fragments and species genomes, we discovered at least three predicted HGT events were required to explain the inconsistency. We thought there might be some HGT events happening between fish and Primates, between fish and Lizard or between Primates and Chinese hamster (Fig. 6). Therefore, we concluded that three interspecies HGT events might have occurred during the evolutionary history of this HGT fragment.
We were not able to remove the interpolation of gene lost. However, more evidence can be found to support HGT for these regions we identified. When we searched the Nucleotide collection database to find the homologous sequences of the predicted HGT fragments, we were able to find them in some species, such as parasite, bacteria or fungi, which might be the media species for the HGT progress. Results showed that most of our predicted HGT fragments have homologues sequences in human, fish, bird, poultry and livestock. Among the 22 most reliable predicted HGT regions, 6 have homologues sequences in at least one trematode parasite of human (Table 5), such as Echinostoma and Schistosoma. Both Echinostoma and Schistosoma can infect human and other mammals. Some species of Schistosoma can also infect birds and crocodiles. Other trematodes, which though have no homologous sequence of the predicted HGT fragments, can also infect the fishes . Some of our predicted HGT fragments can be find in primates and some species of fishes and birds. From our Phylogenetic tree analysis, we predicted there might be several HGT events happening among fish, bird, primate and a few mammals, which lead us to think that the fishes and birds might be the origin of the HGT progress and trematode parasites or their common ancestor might be the media species for the HGT procedures.
In our pipeline, we can also change the similarity threshold (40%, 50, 60%) between vertebrates and hg19 and the length coverage (40%, 20%, 0) between mammalian vertebrates and hg19 to calculate the number of sequences and the total bases we extracted for HGT regions. With different threshold values, we have discovered different number of sequences. We have discovered 22 HGT regions when we set the similarity threshold to 60% and length coverage to 0. These 22 sequences are more likely to be the result of HGT events. We verified the sequences based on the existence of homologous sequence in several groups of vertebrates using the University of California Santa Cruz (UCSC) Comparative Genomics (Primate Chain/Net, Placental Chain/Net, Vertebrate Chain/Net), finding 19 sequences have an abnormal conservation. When we lowered the similarity threshold to 40% and raised the length coverage to 40%, we have discovered 1,467 predicted HGT regions. In order to verify the result of 1,467 sequences, we sampled 50 of them to check their conservation levels manually in mammals and non-mammalian species using the (UCSC) Comparative Genomics track (Primate Chain/Net, Placental Chain/Net, Vertebrate Chain/Net). We found 44 of the 50 regions (88%) have an abnormal conservation. Therefore, there might be hundreds or even more than a thousand predicted HGT regions in the human genome.
We also analyzed the HGT potential of five invariant regions among all human genomes. None of our HGT regions overlapped with these five sequences, whose corresponding track displays from the UCSC Genome Browser can be found in Table 6. However, using the UCSC Genome Browser, we discovered that we may have missed the 1st sequence because of the presence of too many mammals which were also conserved. If we allowed a few more (than 8) mammals to be existed in our pipeline, we might find part of this invariant region in our results. We may have missed the 2nd and 3rd sequences because the length of the subsequence which aligns the non-mammal with human is shorter than our threshold, 1000 bps.
There were 145 horizontal transfer genes found in Primates in a previous study . We compared our result with these 145 genes and found that very few (5) genes were in both results. Comparing our pipeline with this previous study, we found that both pipelines extracted those human genome regions that show greater conservation with distant species than with the close species. However, in previous articles, the human genes having greater conservation with non-metazoan than metazoan were identified while our pipeline looked for human genome sequences having greater conservation with non-mammals than mammals. In another words, we looked for HGT events happened more recently than those found in the previous studies. Therefore, very few genes were found in both two results.
According to our previous knowledge of biology, horizontal gene transfer in prokaryotes is a common phenomenon, and the underlying mechanism has been extensively studied. In recent years, the studies of horizontal gene transfer in Eukaryotes has gradually increased, though these findings still remain controversy. At present, these previous studies still have many shortcomings, such as lack of evidences to prove these sequences are indeed derived from the horizontal gene transfer, few studies of underlying mechanism, incapability of finding common features of these HGT regions, and so on.
In this study, we not only identified 1,467 HGT regions in the human genome, but also analyzed the phenomenon of horizontal gene transfer in Eukaryotes. We analyzed the conservation of HGT regions in different Eukaryotic genomes to determine whether these sequences are derived from the horizontal gene transfer or not. The analysis of chromosomal locations of these sequences and repeat contents lead us to find some possible features of these HGT regions. We discovered the genes that had any overlap with our extracted sequences and we found most significant gene functions impact by HGT, by using gene function enrichment analysis. Comparing the phylogenetic tree built by homologous sequences with the species phylogenetic tree, we can predict the process of horizontal gene transfer. Besides, we also discovered the homologous sequences in the common parasite of humans and other eukaryotes. It has been reported that horizontal gene transfer in Eukaryotes need some media species to spread the genomic fragments. And these media species might be intracellular parasitic bacterium (Wolbachia) or some endoparasites (schistosomiasis, Trypanosoma) . One study of Schistosoma genome found that Schistosomiasis share more orthologues with the vertebrates than they do with the ecdysozoans . So the interaction between the host genome and the parasite genome might be a possible step in HGT. We discovered the homologous sequences in the common parasites (Schistosomiasis) of humans and other Eukaryotes, which provided the evidence to prove horizontal gene transfer process we predicted is true.
In general, the majority of the HGT regions discovered by our pipeline is credible. The pipeline we built to analyze horizontal gene transfer in Eukaryotes is also reasonable. According to our previous discussion, we believe that horizontal gene transfer is more common than we expected in the human genome. This study will provide a reference for future study of the horizontal gene transfer in Eukaryotes.
Hundreds of human genes were found to be more conserved with non-mammals than with most mammals. Few of these genes overlapped with known horizontal transferred genes. This indicated that horizontal gene transfer is more common than we expected in the human genome. This study provided insight into potential mechanisms of HGT in the human genome.
This work was supported by grants from the National Natural Science Foundation of China [61272250, 61472246 and J1210047]; the National Basic Research Program of China [2013CB956103]; and the National High-Tech R&D Program (863) [2014AA02150 and 2012AA101601]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
All datasets were taken from publicly available websites. All scripts for this paper have been deposited into Github at https://github.com/huangwenze/HGT-in-Human-Genome and are freely available to all users.
CW conceived and designed the project; WH, LT and YL implemented and tested the system; CW, WH, LT, YL, NH and CS analyzed the data; CW, WH, LT and YL wrote the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Wenze Huang, Email: nc.ude.utjs@eznewgnauh.
Lillian Tsai, Email: ude.dravrah.egelloc@iastnaillil.
Yulong Li, Email: moc.liamtoh@utjsgnoluyil.
Nan Hua, Email: ude.csu@auhn.
Chen Sun, Email: nc.ude.utjs@19enrev.
Chaochun Wei, Email: nc.ude.utjs@iewcc.