|Home | About | Journals | Submit | Contact Us | Français|
Background: Chinese clearhead icefish, Protosalanx hyalocranius, is a representative icefish species with economic importance and special appearance. Due to its great economic value in China, the fish was introduced into Lake Dianchi and several other lakes from the Lake Taihu half a century ago. Similar to the Sinocyclocheilus cavefish, the clearhead icefish has certain cavefish-like traits, such as transparent body and nearly scaleless skin. Here, we provide the whole genome sequence of this surface-dwelling fish and generated a draft genome assembly, aiming at exploring molecular mechanisms for the biological interests. Findings: A total of 252.1 Gb of raw reads were sequenced. Subsequently, a novel draft genome assembly was generated, with the scaffold N50 reaching 1.163 Mb. The genome completeness was estimated to be 98.39 % by using the CEGMA evaluation. Finally, we annotated 19884 protein-coding genes and observed that repeat sequences account for 24.43 % of the genome assembly. Conclusion: We report the first draft genome of the Chinese clearhead icefish. The genome assembly will provide a solid foundation for further molecular breeding and germplasm resource protection in Chinese clearhead icefish, as well as other icefishes. It is also a valuable genetic resource for revealing the molecular mechanisms for the cavefish-like characters.
Icefishes (Osmeriformes, Salangidae) are widely distributed in freshwater, coastal, and estuarine habitats in East Asian countries [1–3]. Chinese clearhead icefish (Protosalanx hyalocranius; Fig. 1), a diadromous fish, mainly inhabits in coastal areas and adjacent freshwaters [4–6]. As an economically important fish in China, the clearhead icefish was widely introduced into some lakes from the original Lake Taihu half a century ago, and it has developed a resident life history in these water areas [2, 7, 8]. Because of its transparent body and nearly scaleless skin, similar to the Sinocyclocheilus cavefishes , we are very interested in this surface-dwelling fish and are performing comparative genomics studies to explore the mechanisms for these biological phenotypes. However, with the rapid development of the Chinese economy in recent decades, the population size of the clearhead icefish has been seriously declining because of overfishing, construction of water conservancy facilities, and water pollution in the ecological systems . To maintain its sustainable development in China, here we performed whole genome sequencing of Chinese clearhead icefish to support its biological and economic importance.
In this study, we applied Illumina whole genome sequencing strategy to sequence the genome of Chinese clearhead icefish (NCBI taxonomy ID: 418454; Fishbase ID: 12236). Genomic DNA was isolated from the muscle tissue of an individual collected from the Lake Taihu of Jiangsu Province in China. We constructed seven paired-end libraries with three short-insert libraries (250, 500, and 800 bp) and four long-insert libraries (2, 5, 10, and 20 kb) using the standard protocol provided by Illumina (San Diego, CA, USA). Subsequent paired-end sequencing was performed by the Illumina HiSeq 2000 platform for each library. Finally, we obtained 252.1 Gb of raw reads for further analysis.
The SOAPfilter v2.2 software  with optimized parameters (-y -p -g 1 -o clean -M 2 -f 0) was utilized to remove low-quality raw reads (including reads with 10 or more Ns and low-quality bases) and PCR replicates as well as adaptor sequences. In total, we obtained 169.0 Gb of clean reads. Subsequently, we estimated the genome size based on the 17-mer depth frequency distribution method . We applied the following formula to calculate the genome size: G = k_num/k_depth = b_num/b_depth (k_num is the total number of K-mers from the sequencing data, k_depth is the expected coverage depth for k-mers, b_num is the total number of bases, b_depth is the expected coverage depth of bases; As one read with length L generates L-K+1 k-mers, k_num/b_num = (L-K+1)/L). In our current study, the K_num was 10500 000000 and the K_depth was 20. Hence, we estimated that the genome size of Chinese clearhead icefish is 525 Mb.
The filtered reads were assembled using SOAPdenovo2 v2.04.4 software  with optimized parameters (pregraph -K 79 -d 1; contig -M 1; scaff -F -b 1.5 -p 16) to generate contigs and original scaffolds. The gaps were filled using GapCloser v1.12 software  with default parameters and –p set to 25. Finally, we generated a draft genome assembly of 536 Mb, with the scaffold N50 reaching 1.163 Mb (Table 1).
The completeness of our assembly was evaluated by using both CEGMA  and BUSCO . The CEGMA program (Core Eukaryotic Genes Mapping Approach; version 2.4) assessment with 248 conserved Core Eukaryotic Genes was performed for evaluation of the gene space completeness. Our results revealed that the assembled genome had a CEGMA completeness score at 90.32 % and 98.39 %, which was calculated from the complete gene set and the partial gene set, respectively. Meanwhile, we used the representative metazoa gene set , which contains 843 single-copy genes that are widely present in metazoan, as a reference. The assessment demonstrated that the BUSCO value is 89 %, containing [D: 10 %], F: 7.7 %, M: 2.9 %, n: 843 (C: complete [D: duplicated], F: fragmented, M: missed, n: genes). These data from CEGMA and BUSCO indicate that the assembled genome covered majority of the gene space.
Firstly, a de novo repeat library was constructed by the RepeatModeller v1.05  and LTR_FINDER.x86_64-1.0.6  with default parameters. Then, the assembled genome sequences were aligned against the RepBase v21.01  and the de novo repeat libraries to recognize the known and novel transposable elements using the RepeatMasker v4.06 . Meantime, the Tandem Repeat Finder v4.07  with parameters “Match = 2, Mismatch = 7, Delta = 7, PM = 80, PI = 10, Minscore = 50, and MaxPeriod = 2000” was utilized for annotation of tandem repeats. Furthermore, the RepeatProteinMask software v4.0.6  was used to predict transposable element relevant proteins in our genome assembly. Finally, we observed that the repeat sequences account for 24.43 % of the assembled genome (Table 1), and the de novo annotation method predicted the most abundant repeat sequence among the four methods (Table 2).
In brief, we utilized two different methods to predict total gene set of the clearhead icefish.
The AUGUSTUS v2.5  and GENSCAN v1.0  were executed to ab initio predict genes within the assembled genome, with the repetitive sequences masked as “N” to discard pseudo gene prediction. Those low-quality genes with short length (<150 bp), premature termination, or frame-shifting were removed. Finally, we identified 23132 and 21379 pro-coding genes by using the AUGUSTUS and GENSCAN software (Table 3).
We aligned the protein sequences from six published genomes, including Danio rerio , Oryzias latipes , Takifugu rubripes , Tetraodon nigroviridis , Esox lucius , and Gasterosteus aculeatus , against our assembly to predict homology-based genes. The potential homology-based genes were searched by TblastN  with an e-value of 10−5. The TblastN results were then processed by Sorting Out Local Alignment Result  to obtain the best hit of each alignment. Subsequently, GeneWise v2.2.0  was performed to detect the possible gene structure for the best hit of each alignment. The low-quality genes were also removed as described in the above-mentioned de novo annotation.
We employed the GLEAN  to generate a nonredundant and comprehensive gene set. Finally, the best hit of each protein was obtained through all protein sequences from the GLEAN results aligned to the databases of the SwissProt and TrEMBL  (Uniprot release 2011.06) by BlastP with an e-value of 10−5. Overall, we generated a final gene set with 19,884 genes for the Chinese clearhead icefish (Table 3).
CEGMA was performed again to evaluate the coverage rate between eukaryotic orthologous group genes predicted by CEGMA and the predicted total gene set. It demonstrates that the predicted gene set mapped 96.4 % of the eukaryotic orthologous groups. Simultaneously, the BUSCO was implemented again to assess completeness of the predicted gene set. The BUSCO values were calculated as follows: C: 79 % [D: 16 %], F: 9.8 %, M: 10, n: 843 (C: complete [D: duplicated], F: fragmented, M: missed, n: genes). The assessment values from both CEGMA and BUSCO proved high accuracy of the annotation.
The predicted protein sequences of the clearhead icefish were aligned against several public databases (Pfam , PRINTS , ProDom , and SMART ) for detection of functional motifs and domains. Finally, we found that 96.2 % of the predicted total gene set had been annotated with at least one functional assignment from other public databases (Swiss-Prot , Interpro , TrEMBL , and KEGG ).
We performed phylogenomic analyses with orthologues from representative species for each clade. We used the Ensembl BioMart (www.ensembl.org; Ensembl version 76) to extract orthologues for zebrafish , fugu , stickleback , medaka , and spotted gar . This generated orthologue dataset from six species was filtered out to retain only one-to-one orthologues. Meanwhile, a new Asian arowana gene set stemmed from our recent work . To extrapolate the Biomart orthologues to the arowana and clearhead icefish gene sets, we used zebrafish as the reference. We ran InParanoid  for the three species pairs (zebrafish-arowana and zebrafish-clearhead icefish) at default settings (i.e., a minimum BLASTP score of 40 bits, minimum 50 % alignment span, minimum 25 % alignment coverage, and minimum inparalog confidence level of 0.05). By comparing the three InParanoid outputs, we narrowed down the list of one-to-one orthologues, presented in all seven species, to 454 genes. Multiple alignments were subsequently performed on proteins of each selected family using MUSCLE (version 3.8.31) , and protein alignments were converted to their corresponding CDS alignments using an in-house perl script (see supporting data). All the translated CDS sequences were linked into one “supergene” for each species. Nondegenerated sites extracted from the supergenes were subsequently joined into the new sequence of each species to construct a phylogenetic tree (Fig. 2) using MrBayes  (GTR+gamma model, Version 3.2). Our phylogenetic data demonstrate the phylogenetic position of the clearhead icefish (Fig. 2).
Genomic homology between the clearhead icefish and Nile tilapia  was examined using i-ADHoRe 3.0  using the following settings: alignment method gg2, gap size 30, tandem gap 30, cluster gap 35, q value of 0.85, prob cutoff 0.01, anchor points 5, and using multiple hypothesis correction FDR. The output of this was processed by the pipeline and incorporated in a relational database to which visualization programs can connect and on which additional statistical analysis can then be performed. For synteny detection, the cloud mode was enabled (cluster_type = cloud) and appropriate settings were selected as follows: cloud_gap_size 20, cloud_cluster_gap 20, cloud_filter_method binomial, prob cutoff 0.01, anchor points 5, multiple hypothesis correction FDR, and level_2_only true. Finally, we identified 771 synteny blocks containing 7057 genes between the clearhead icefish and Nile tilapia.
Subsequently, protein sequences of homologous gene pairs in the identified syntenic regions were aligned using MUSCLE , and the protein alignments were then converted to the CDS alignments. Finally, 4-fold degenerative third-codon transversion (4DTV) values were calculated on these CDS alignments and corrected using the HKY model in the PAML package . These data indicate that the clearhead icefish also experienced the teleost-specific whole genome duplication (Fig. 3).
We generated a draft genome assembly of the Chinese clearhead icefish. The novel genome data were deposited in publicly accessible repositories to promote further biological research, molecular breeding, and resource protection of this representative and valuable icefish.
Supporting data and materials are available in the GigaScience GigaDB database , with the raw genome sequences deposited in the SRA under the bioproject number PRJNA328051.
The authors declare that they have no competing interests.
This study was supported by a grant from the Natural Science Foundation of Jiangsu Province (No. BK2012093), fish investigation in Taihu Lake (No. TH2016WT007), National Infrastructure of Fishery Germplasm Resources (No. 2016DKA30470), Basic Research Funds from Freshwater Fisheries Research Center (No. 2013JBFM07), Special Project on the Integration of Industry, Education and Research of Guangdong Province (No. 2013B090800017), Shenzhen Special Program for Future Industrial Development (N o. JSGG20141020113728803), and Zhenjiang Leading Talent Program for Innovation and Entrepreneurship.
KL, PX, QS, DX, JX, CB, and ZZ conceived the project. MZ, XY, HY, JC, GX, DF, JQ, SJ, and JH collected the samples and extracted the genomic DNA. JL, CB, and HY performed the genome assembly and data analysis. JL, CB, QS, KL, XP, KL, YY, and ZZ wrote the paper.