(BLEU scores per languages and systems) and 6 (human judged scores of fluency and content) show good albeit not perfect correlation between BLEU and human judgment across the studied language pairs. ( presents two additional scores compared to , because we could generate BLEU statistics for English to German and English to Turkish translations while we had no access to bilingual human judges for those translation directions.) These findings corroborate published results from the general-purpose machine translation field, that the BLEU score is a viable automatic measure of translation quality in the biomedical domain, as well.
On the other hand, while higher BLEU scores indicated that BioMT provided better quality translations for most of the translation directions ( and ), statistical significance tests of the human judgments did not support this finding. illustrates that the judges scored fluency and content more frequently higher for the output of Google Translate than for BioMT. Google Translate was scored higher than BioMT for six translations for fluency and eight translations for content measures. BioMT was scored higher than Google Translate for four translations for fluency and twice for content. The fluency and content scores correlated remarkably well. Only in cases of English to French and Polish to English translations did the judges score across systems (higher fluency scores for BioMT while higher content scores for Google). Only two translation pairs (Hungarian to English and Turkish to English) were statistically significant for the differences between the scores for translation quality for the two systems.
shows a mixed picture for the human judgment scores. BioMT was “voted” by the judges four times as the better system (French to English, English to French, English to Polish and English to Spanish). Google was “voted” six times as the system with better translations (Hungarian to English, English to Hungarian, Polish to English, Spanish to English, German to English, and Turkish to English). The Google “wins” were more pronounced (by numeric absolute value) and were found statistically significant (via the Chi-square test at p<0.05) in four cases (Hungarian to English, English to Hungarian, Spanish to English, and Turkish to English).
After aligning the three evaluation methods ( and and ) we found that out of the 12 BLEU scores only three did not align (at least partially) with the results of at least one of the human judgment methods (fluency/content or voting for better system output). For English to Hungarian, Spanish to English and German to English translations, the BLEU statistics pointed to the opposite direction than the human judgments. For two translation pairs (English to German and English to Turkish) we do not have human judgment data.
and and show that as the size of the training corpora increases so does the BLEU score. This is not surprising as statistical systems tend to do better with larger amount of training data. The data also points to a plateau effect for translation pairs where we had sufficiently large training corpora to experiment meaningfully with the size of the training data (English to French, French to English, German to English, and English to German). However, none of the studied translation pairs “arrived” to the plateau, yet. It is likely that as the parallel corpora accumulate the quality of the machine translation will improve even without further breakthrough in the translation algorithms. This is good news for investigators working on or planning to work on biomedical machine translation systems.
Finally, shows the BLEU scores on the small test sets with 100 titles, which is the set used for human judgment. The results correlate exceptionally well with results from large (occasionally 600 times larger) test sets, and they allow us to compare BLEU and human judgment on the same test sets.
It is noteworthy that building an in-house high performance statistical machine translation system that produces results comparable to the state of the art Google Translate (for translating PubMed titles), according to both human judgment and automated BLEU measurements, is relatively straightforward. All the “parts” necessary to build the system are available as open source components. The parallel corpora are also leasable (free of charge) from the National Library of Medicine. Compared to using an off-the-shelf translation such as Google Translate, which is a black box to the public, the advantages of having an in-house machine translation system are enormous. The in-house system is trained by in-domain data (PubMed) titles, it can be re-trained when more training data become available (which is the case as the number of PubMed titles increases over time), and more training data will result in better translation performance, as shown in . In addition, maintenance of the in-house system requires minimal or no effort as both the collection of the accumulating parallel corpora and the retraining of the system is easy to automate.
Some of the limitations of our research include that we did not have the same parallel corpora across languages. This makes it impossible to compare translation quality across the studied languages. A second limitation is that the judges were untrained for scoring translation output. However, this limitation is somewhat mitigated by the fact that the translation outputs are intended for “untrained” users (e.g. patients who do not speak English) and if a future version of BioMT will be deployed then its output will be read and interpreted by “untrained” users. In future research we will address the limitations mentioned. We will also develop post-processing steps specific for the biomedical domain to enhance the quality of the translations. We plan to explore the capabilities of the in-house built translation engine to translate PubMed abstracts, in addition to titles.