|Home | About | Journals | Submit | Contact Us | Français|
In our paper we addressed the research question: “Has machine translation achieved sufficiently high quality to translate PubMed titles for patients?”. We analyzed statistical machine translation output for six foreign language - English translation pairs (bi-directionally). We built a high performing in-house system and evaluated its output for each translation pair on large scale both with automated BLEU scores and human judgment. In addition to the in-house system, we also evaluated Google Translate’s performance specifically within the biomedical domain. We report high performance for German, French and Spanish -- English bi-directional translation pairs for both Google Translate and our system.
One of the aims of patient centered medicine is to empower the patients in the medical decision making. According to the US Census Bureau, 18 percent (47 million people) of the US population aged five and over reported they spoke a language other than English at home in 20001. To fulfill the promise of patient centered medicine for non-English speaking patients in the United States (US) it is immensely valuable to make English language biomedical text available in foreign languages. The Census Bureau estimates that there are about 45 million Hispanics living in the United States and many of them are Spanish-only speakers2. Not only Spanish native speakers, but other US residents would also benefit from accessing biomedical information in their native tongue even if they can communicate in English.
In addition to patient centered medicine, clinical trials require the translation of biomedical text, as well. An increasing number of clinical trials require cross-border and cross-language enrollment in order to have a sufficiently diverse representation of the human gene pool. There is also a growing need to collect and aggregate disease-specific information across countries and continents to achieve meaningful sample size for rare diseases. In case of international research, much of the clinical information is locked into free text in different languages. Accessing this information, either automatically by Natural Language Processing tools or by human investigators is much easier if automated, timely, high quality and scalable translations are available.
In the past two decades, statistical machine translation (SMT) has become the dominating approach to machine translation (MT) due to its robustness, good performance, and the fact that it does not require manually crafted rules3. There are state of the art translation engines that were developed for general translation purposes. One of the most sophisticated publicly accessible machine translation engine is Google’s Google Translate4. It is unclear if the Google Translate system has any specific training for the biomedical domain. To our knowledge, our work is the first evaluation of Google Translate for the biomedical domain
In this paper we present the results of our experiments to evaluate a state of the art general-purpose (Google Translate) and an in-house developed biomedical field focused statistical machine translation system. In our work we build on the success of publicly released statistical machine translation components and downloadable parallel biomedical corpus. We evaluate the performances of Google and our system against the human generated parallel corpora using an automated scoring system. To round out the evaluation process we employed human annotators to judge the quality of the machine translation system’s output.
In the “Background” section we will describe the most relevant machine translation efforts in the non-biomedical domain and some of the earlier translation works that are focused on biomedical text. In “Methods” we will provide a detailed description of the task, the data and evaluation approaches. In “Results” we will show the findings from the automated and human translation evaluations. In “Discussion” we will analyze the results and finally we will present the “Conclusions”.
There are two streams of related work that we intend to cover for this paper. First, we will describe a few selected general-purpose machine translation works that are most relevant for our topic and evaluation approaches. Second, we will provide details of the two biomedical focused translation efforts known to us.
For MT evaluation, there are two types of evaluation: human evaluation and automatic evaluation. In human evaluation, bilingual speakers are presented with source sentences and translations produced by an MT system, and asked to judge the fluency and adequacy of the translations in a 1–5 scale5. This approach is intuitive and the results are easy to understand. However, human evaluation is slow, labor intensive, expensive, and cannot be reused. It is also subjective and may not be sensitive to small changes of MT quality. Because of these disadvantages, human evaluation cannot be used to monitor the effect of daily changes to an MT system in order to weed out bad ideas from good ideas6.
To address these limitations, Papineni and his colleagues proposed an automatic measure called BLEU6. The main idea behind the measure is that the closer a MT translation is to a professional human translation, the better it is. To calculate the BLEU score, MT translations are compared with reference human translations and n-gram (n=1,2,3,4) precisions are calculated, where n-gram precision is the percentage of word n-grams in an MT translation that also occur in the corresponding human reference translations. BLEU score is defined to be the geometric mean of n-gram precisions multiplied by the brevity penalty (which is used to penalize an MT translation that is shorter than the reference translations).
Very often the human reference translations are already available from various bilingual text. Compared to human evaluation, calculating BLEU is cheap and quick, and it can be done frequently to assess daily changes of MT systems. BLEU also correlates well with human judgment. Recently, other automatic measures such as TER and METEOR have been proposed7, 8. In this paper, because BLEU is the most well-known and commonly used automatic measure in the SMT community, we will use it for evaluation, in addition to human judgment.
We know of only one translation tool, or more accurately a cross-language tool that was developed specifically for the PubMed text corpus. BabelMeSH was developed for Medline/PubMed9. BabelMeSH is a cross-language tool for searching Medline/PubMed articles in the user’s native language. However, the tool is not intended as a full-text translation engine. It focuses only on searching Medical Subject Header (MeSH) terms in PubMed by utilizing the foreign language entries in the Unified Medical Language (UMLS) Metathesaurus.
Turner et al, are working on a public health documentation focused machine translation system10. Their goal is to improve the availability of health materials for individuals with Limited English Proficiency, and develop fundamentally new machine translation technology designed to adapt generic systems to the health care domain. Ultimately, they want to eliminate health disparities caused by language barriers and improve access to pertinent multilingual health information for patients.
In our paper we address the research question: “Has machine translation achieved sufficiently high quality to translate PubMed titles for patients?”. In more general terms, we will answer the question: “Are we there yet?”. Is it possible to start using statistical machine translation systems to generate high quality, large scale PubMed title translations? We will evaluate the output of Google Translate and our in-house built system that we will call from now on as BioMT. When we started this research we were specifically interested in the human judged quality of system generated translations.
We selected the freely leasable database of Medline/PubMed articles as the foundation of our corpora. The biomedical parallel corpus was constructed from the foreign language titles and their corresponding English translated titles of Medline/PubMed articles. The database had (as of March 2010) over 17 million titles and covered 55 languages with the vast majority being English-only titles.
For the parallel corpora we extracted French, Spanish, German, Hungarian, Turkish and Polish titles and corresponding human English translations. The Medline/PubMed database consists of XML files that include XML tags for foreign titles (so-called Vernacular Titles) and the English translations (so-called Article Titles). Figure 1 presents an example for a German-English title pair.
For the pre-processing steps, we used regular expressions to find the <ArticleTitle>, and corresponding <VernacularTitle> tags and extract the contents from the XML files. We randomly selected 80% training, 10% test, and 10% development data. Table 1 shows the descriptive statistics of our parallel training, development and test corpora (after filtering out training sentences that were longer than 40 words).
To build BioMT we used the Moses toolkit, a state-of-art open-source phrase-based SMT system and followed its step by step guide11 Moses builds a translator for S=>T (S is the source and T is the target language) in three stages: training, tuning and testing.
At the training stage, Moses first learns word-to-word translation and distortion models from the training data using IBM Models 1–5, then uses the models to find word alignment between each sentence pair in the training data, and next uses the word alignment to build a phrase table and reordering model12. The phrase table stores the probability that a source phrase translates into a target phrase. “Phrase” in this context refers to a word n-gram, not necessarily a linguistic phrase. A reordering model captures how likely the source phrases are reordered in the target side. Finally, Moses uses the SRLIM package13 to build an n-gram language model from the target side of the training corpus.
A tri-gram language model was trained on the target side of the training parallel corpus using the SRILM package. The translation and re-ordering model relied on “intersect” symmetrized word-to-word alignments. That is, each word alignment can be seen as a set of source word - target word pairs. Moses takes the intersection of the two sets, and uses that as the final word alignment.
The goal of the tuning stage is to learn good weights of the translation, reordering and language models. The tuning is done by running the machine translation system with various weight combinations on a set of new sentences and choosing the combination that produces good translation results (the evaluation is described below in more detail). For tuning we experimented with several tuning sizes: we used the first 200/300/400/500 lines of the development data set.
The last stage is testing (also called decoding). Moses uses the models learned in the training stage and model weights chosen in the tuning stage to translate sentences in the test data. We used the default settings recommended by Moses for the decoder.
To compare the performances of our system and the Google translation engine we submitted the test corpora (Table 1) for each language pairs and each translation direction (that is, Foreign to English and English to Foreign) to Google Translate. We used the publicly available Google Translator API to connect to the Google service14.
To measure the impact of the size of the training corpus on the performance of the machine translation system, we experimented with different training sizes by changing the number of foreign titles with corresponding human reference translations in the training corpus. We measured the BLEU scores while evaluating on the same test corpus.
We implemented both automated (BLEU score) and human evaluation processes to measure translation quality. While BLEU is a standard method of evaluation in the general field of machine translation, we are aware of the importance of generating linguistically and culturally appropriate translations for patients15. In order to measure the cultural and linguistic appropriateness of the translations we hired bilingual and monolingual judges. We had two bilingual judges for Spanish, Hungarian and Polish each, and one bilingual judge for French. We had no bilingual judges for German and Turkish so English to German and English to Turkish translations were not evaluated. We also hired two monolingual (English-only) judges who evaluated Foreign to English translations across all the six languages.
The bilingual judges evaluated the translation quality in both directions of the translations (Foreign to English and English to Foreign). The monolingual judges evaluated the quality only of the Foreign to English translations. All judges were untrained in translation evaluation. The monolingual judges were recent BSc (Statistics and Anthropology) graduates. The bilingual judges were all native speakers of the evaluated language and lived in the US. The bilingual judges had Masters or Doctoral degree as their highest educational diploma. One of the Hungarian and the single French judge are co-authors of the paper but none of the other judges had any involvement in the study other than evaluating the quality of the translations.
For human evaluation, 100 titles were randomly selected from the test set for each language with their corresponding foreign and English reference translations. The corresponding 100 BioMT and Google translations were selected as well. The source titles, the reference (human) translations and the two systems’ (BioMT and Google) outputs were presented to the judges who scored the translation quality for “Fluency” and “Content” on a 1–5 scale (1/worst and 5/best). The judges were also asked to indicate which translation they considered better. Figure 2 demonstrates an example from the French title and translation set with corresponding questions and scores from the judge.
“9_8” indicates the title number in the 100-title set and the example illustrates that the judge gave 5 for “Fluency” for both system’s output and scored the “Content” 4 and 5 while indicated that the second system provided a better translation. The order of printing BioMT’s and Google’s outputs were randomly switched for each of the 100 titles to avoid developing a bias against either “SYS1” or “SYS2” as presented in the files. While the investigators kept track which system corresponded to Google and BioMT the judges were unaware of this information. The scores were collected with an automated process.
The judges were instructed to evaluate the translation characteristics as follows: “Content: How well the main message of the source sentence is communicated in the translation even if the translation’s fluency is terrible.” and “Fluency: How human like is the translation as a sentence in the target language?”. To answer the last question, “Which is better? Sys1 vs Sys2 (1 vs 2)):” the judges could answer 1 (SYS1 is considered a better translation), 2 (SYS2 is considered a better translation) or 0 (both translations are considered the same quality). Scores of “0” were discarded before running a Chi-square analysis on the scores for the fifth question.
The following legend applies to each table and figure with language pairs indicated (FtE = French to English, EtF = English to French, HtE = Hungarian to English, EtH = English to Hungarian, PtE = Polish to English, EtP = English to Polish, StE = Spanish to English, EtS = English to Spanish, GtE = German to English, EtG= English to German, TtE = Turkish to English, EtT= English to Turkish).
Table 2 shows the BLEU scores for each pair of translations. To produce the BLEU results shown in Table 2, the BioMT system was trained on the maximum number of available training corpora. For example, in order for BioMT to generate 45.46 BLEU score for the French to English translation direction (as measured on 55,598 title translations in the French test corpus), the BioMT system was trained on all 443,862 training titles and their corresponding human (reference) translations.
Figure 3 shows the BLEU scores for each language pair for both the Google (G) and the BioMT systems.
Table 3 shows the impact of the training corpus’ size on the BLEU evaluation scores for BioMT.
Including the bidirectional and monolingual scoring we collected 26 text files from the judges. Each file included the original text and the judges’ scores (as presented in Figure 2 above). The judges made five scoring decisions for each of the 100 titles and corresponding translations for each of the files. Altogether the human judges made 13,000 scoring decisions (26*100*5). Table 4 shows the number of scoring decisions for each direction of the studied translations. The number of scoring decisions per language pair depends on the availability of bilingual judges as described in the Methods section.
Table 5 presents the averages of human judgment scores (both mono and bilingual when it was available) for the fluency and content of each translation per machine translation system. The table also presents the 95 percent confidence interval boundaries for the means. Boldface font type in the “Mean” column indicate a statistically significant difference (tested by non-overlapping 95 percent confidence intervals) in favor of the particular system.
Figure 6 presents the mean fluency and content scores. The results are plotted per translation pairs for both systems.
Table 6 presents the judges’ “voting” decisions (SYS1 vs SYS2) in response to the “Which system is better?” question. Bold fonts indicate statistically significant difference by the Chi-square test (p<0.05) in favor of a system.
Table 7 presents the BLEU scores for each language pair when the BioMT system is trained on all available data but tested only on the same 100 titles that were used for human evaluation.
In summary, the BioMT system achieved numerically higher BLEU scores in case of nine language pairs while GoogleTranslate had numerically higher scores in three cases. Only the Hungarian-English language pair showed split results between opposite directions of translations. The mean value of human judges’ decisions was numerically higher for BioMT for “fluency” in four and for “content” in two translation directions. Meanwhile, GoogleTranslate achieved numerically higher “fluency” in six and “content” scores in eight cases. Cumulatively the human judges voted BioMT’s translation a better output in four translation directions and GoogleTranslate’s in six. Statistical significance tests did not always supported the numerically higher performance findings. Finally, the results indicate that the increasing size of the training corpus continues to improve the performance of the BioMT system as measured by the automated BLEU score.
Figures 3 (BLEU scores per languages and systems) and 6 (human judged scores of fluency and content) show good albeit not perfect correlation between BLEU and human judgment across the studied language pairs. (Figure 3 presents two additional scores compared to Figure 6, because we could generate BLEU statistics for English to German and English to Turkish translations while we had no access to bilingual human judges for those translation directions.) These findings corroborate published results from the general-purpose machine translation field, that the BLEU score is a viable automatic measure of translation quality in the biomedical domain, as well.
On the other hand, while higher BLEU scores indicated that BioMT provided better quality translations for most of the translation directions (Figure 3 and Table 2), statistical significance tests of the human judgments did not support this finding. Table 5 illustrates that the judges scored fluency and content more frequently higher for the output of Google Translate than for BioMT. Google Translate was scored higher than BioMT for six translations for fluency and eight translations for content measures. BioMT was scored higher than Google Translate for four translations for fluency and twice for content. The fluency and content scores correlated remarkably well. Only in cases of English to French and Polish to English translations did the judges score across systems (higher fluency scores for BioMT while higher content scores for Google). Only two translation pairs (Hungarian to English and Turkish to English) were statistically significant for the differences between the scores for translation quality for the two systems.
Table 6 shows a mixed picture for the human judgment scores. BioMT was “voted” by the judges four times as the better system (French to English, English to French, English to Polish and English to Spanish). Google was “voted” six times as the system with better translations (Hungarian to English, English to Hungarian, Polish to English, Spanish to English, German to English, and Turkish to English). The Google “wins” were more pronounced (by numeric absolute value) and were found statistically significant (via the Chi-square test at p<0.05) in four cases (Hungarian to English, English to Hungarian, Spanish to English, and Turkish to English).
After aligning the three evaluation methods (Figure 3 and Tables 5 and and6)6) we found that out of the 12 BLEU scores only three did not align (at least partially) with the results of at least one of the human judgment methods (fluency/content or voting for better system output). For English to Hungarian, Spanish to English and German to English translations, the BLEU statistics pointed to the opposite direction than the human judgments. For two translation pairs (English to German and English to Turkish) we do not have human judgment data.
Table 3 and Figures 4 and and55 show that as the size of the training corpora increases so does the BLEU score. This is not surprising as statistical systems tend to do better with larger amount of training data. The data also points to a plateau effect for translation pairs where we had sufficiently large training corpora to experiment meaningfully with the size of the training data (English to French, French to English, German to English, and English to German). However, none of the studied translation pairs “arrived” to the plateau, yet. It is likely that as the parallel corpora accumulate the quality of the machine translation will improve even without further breakthrough in the translation algorithms. This is good news for investigators working on or planning to work on biomedical machine translation systems.
Finally, Table 7 shows the BLEU scores on the small test sets with 100 titles, which is the set used for human judgment. The results correlate exceptionally well with results from large (occasionally 600 times larger) test sets, and they allow us to compare BLEU and human judgment on the same test sets.
It is noteworthy that building an in-house high performance statistical machine translation system that produces results comparable to the state of the art Google Translate (for translating PubMed titles), according to both human judgment and automated BLEU measurements, is relatively straightforward. All the “parts” necessary to build the system are available as open source components. The parallel corpora are also leasable (free of charge) from the National Library of Medicine. Compared to using an off-the-shelf translation such as Google Translate, which is a black box to the public, the advantages of having an in-house machine translation system are enormous. The in-house system is trained by in-domain data (PubMed) titles, it can be re-trained when more training data become available (which is the case as the number of PubMed titles increases over time), and more training data will result in better translation performance, as shown in Table 3. In addition, maintenance of the in-house system requires minimal or no effort as both the collection of the accumulating parallel corpora and the retraining of the system is easy to automate.
Some of the limitations of our research include that we did not have the same parallel corpora across languages. This makes it impossible to compare translation quality across the studied languages. A second limitation is that the judges were untrained for scoring translation output. However, this limitation is somewhat mitigated by the fact that the translation outputs are intended for “untrained” users (e.g. patients who do not speak English) and if a future version of BioMT will be deployed then its output will be read and interpreted by “untrained” users. In future research we will address the limitations mentioned. We will also develop post-processing steps specific for the biomedical domain to enhance the quality of the translations. We plan to explore the capabilities of the in-house built translation engine to translate PubMed abstracts, in addition to titles.
In answering our research question, we conclude that “We are almost there for some languages but very far for others”. For languages (German, Spanish and French) with large training corpora already accumulated in PubMed, translating the titles with high quality machine translation is almost a reality. For these languages the average fluency and content (human judgment) scores were all above four on a five-point scale and in case of Spanish-English and English-Spanish translations the mean scores were very close to the maximum. For languages with small training corpora, the translation quality was very low. Based on the BLEU statistics we conclude that at the present state of statistical machine translation -- in order to generate high quality translations -- the study language needs a training corpus with at least 100K lines of parallel reference titles. Furthermore, the results support BLEU as a viable machine translation evaluation approach in the biomedical domain.
Kristina Toutanova from Microsoft Research provided invaluable comments for a pilot version of this work. Dina Demner-Fushman from NLM brought the parallel corpora to our attention. We greatly appreciate their assistance. The project described was supported by Grant Numbers 1K99LM010227-01 and 7R00LM010227-03 from the National Library of Medicine. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Library of Medicine. The authors would like to thank three anonymous reviewers for their suggestions.