Since processes in well-known model organisms have specific features different from those in Bos taurus, the organism under study, a good way to describe gene regulation in ruminant embryos would be a species-specific consideration of closely related species to cattle, sheep and pig. However, as highlighted by a recent report, gene dictionaries in pig are smaller than in cattle, bringing a risk to reduce the gene resources to be mined (and so for sheep dictionaries). Bioinformatics approaches that allow an integration of available information on gene function in model organisms, taking into account their specificity, are thus needed. Besides these closely related and biologically relevant species, there is indeed much more knowledge of (i) trophoblast proliferation and differentiation or (ii) embryogenesis in human and mouse species, which provides opportunities for reconstructing proliferation and/or differentiation processes in other mammalian embryos, including ruminants. The necessary knowledge can be obtained partly from (i) stem cell or cancer research to supply useful information on molecular agents or molecular interactions at work in cell proliferation and (ii) mouse embryogenesis to supply useful information on embryo differentiation. However, the total number of publications for all these topics and species is great and their manual processing would be tedious and time consuming. This is why we used text mining for automated text analysis and automated knowledge extraction. To evaluate the quality of this “mining”, we took advantage of studies that reported gene expression profiles during the elongation of bovine embryos and defined a list of transcription factors (or TF, n = 64) that we used as biological “gold standard”. When successful, the “mining” approach would identify them all, as well as novel ones.
To gain knowledge on molecular-genetic regulations in a non model organism, we offer an approach based on literature-mining and score arrangement of data from model organisms. This approach was applied to identify novel transcription factors during bovine blastocyst elongation, a process that is not observed in rodents and primates. As a result, searching through human and mouse corpuses, we identified numerous bovine homologs, among which 11 to 14% of transcription factors including the gold standard TF as well as novel TF potentially important to gene regulation in ruminant embryo development. The scripts of the workflow are written in Perl and available on demand. They require data input coming from all various databases for any kind of biological issue once the data has been prepared according to keywords for the studied topic and species; we can provide data sample to illustrate the use and functionality of the workflow.
To do so, we created a workflow that allowed the pipeline processing of literature data and biological data, extracted from Web of Science (WoS) or PubMed but also from Gene Expression Omnibus (GEO), Gene Ontology (GO), Uniprot, HomoloGene, TcoF-DB and TFe (TF encyclopedia). First, the human and mouse homologs of the bovine proteins were selected, filtered by text corpora and arranged by score functions. The score functions were based on the gene name frequencies in corpora. Then, transcription factors were identified using TcoF-DB and double-checked using TFe to characterise TF groups and families. Thus, among a search space of 18,670 bovine homologs, 489 were identified as transcription factors. Among them, 243 were absent from the high-throughput data available at the time of the study. They thus stand so far for putative TF acting during bovine embryo elongation, but might be retrieved from a recent RNA sequencing dataset (Mamo et al. , 2012). Beyond the 246 TF that appeared expressed in bovine elongating tissues, we restricted our interpretation to those occurring within a list of 50 top-ranked genes. Among the transcription factors identified therein, half belonged to the gold standard (ASCL2, c-FOS, ETS2, GATA3, HAND1) and half did not (ESR1, HES1, ID2, NANOG, PHB2, TP53, STAT3).
A workflow providing search for transcription factors acting in bovine elongation was developed. The model assumed that proteins sharing the same protein domains in closely related species had the same protein functionalities, even if they were differently regulated among species or involved in somewhat different pathways. Under this assumption, we merged the information on different mammalian species from different databases (literature and biology) and proposed 489 TF as potential participants of embryo proliferation and differentiation, with (i) a recall of 95% with regard to a biological gold standard defined in 2011 and (ii) an extension of more than 3 times the gold standard of TF detected so far in elongating tissues. The working capacity of the workflow was supported by the manual expertise of the biologists on the results. The workflow can serve as a new kind of bioinformatics tool to work on fused data sources and can thus be useful in studies of a wide range of biological processes.