kb/data/en.wikipedia.org/wiki/Languages_of_science-7.md

5.0 KiB

title chunk source category tags date_saved instance
Languages of science 8/13 https://en.wikipedia.org/wiki/Languages_of_science reference science, encyclopedia 2026-05-05T03:39:42.114015+00:00 kb-cron

=== Machine translation === Scientific publication has been the first major use case for machine translation, with early experiments dating back to 1954. Development in this area slowed after 1965, because of the increasing predominance of English, limitations in computing infrastructure, and the shortcomings of the leading approach—rule-based machine translation. By design, rule-based methods favored translation between a few major languages (English, Russian, French, German, ...), since a "transfer module" needed to be developed for "each pair of languages"; this situation led to a combinatorial explosion when more languages were considered. After the 1980s, the field of machine translation was revived as it underwent a "full-scale paradigm shift": explicit rules were replaced by statistical and machine learning methods applied to a large aligned corpus. By that time, most of the demand no longer came from scientific publishing, but instead from commercial documents such as technical and engineering manuals. A second paradigm shift occurred in the 2010s, with the development of deep learning methods, which can be partially trained on a non-aligned corpus (i.e., "zero-shot translation"). Requiring few supervision inputs, deep learning models make it possible to incorporate a wider diversity of languages, but also a wider diversity of linguistic contexts within one language. The results are significantly more accurate than with rule-based machine translation: after 2018, automated translation of PubMed biomedical abstracts was deemed superior to human translation for a few languages (such as Portuguese). Scientific publications are an appropriate use case for neural-network translation models, since they work best "in restricted fields for which it has a lot of training data." In 2021, there were "few in-depth studies on the efficiency of Machine Translation in social science and the humanities" since "most research in translation studies are focused on technical, commercial or law texts". Uses of machine translation are especially difficult to estimate, since freely available tools such as Google Translate have become ubiquitous. "There is an emerging yet rapidly increasing need for machine translation literacy among members of the scientific research and scholarly communication communities. Yet in spite of this, there are very few resources to help these community members acquire and teach this type of literacy." In an academic setting, machine translation includes a variety of uses. Production of written translations remains constrained by a lack of accuracy and consequently efficiency, since the post-editing of an imperfect machine translation should ideally take less time than a human translation. Automated translation of foreign-language text is more widespread in the context of a literature survey or "information assimilation", since the quality requirements are generally lower and a global understanding of a text is sufficient. The impact of machine translation on linguistic diversity in science depends on these uses:

If machine translation for assimilation purposes makes it possible, in principle, for researchers to publish in their own language and still reach a wide audience, then machine translation for dissemination purposes could be seen to favor the opposite and to support the use of a common language for research publication. The increased use of machine translation has created concerns about "uniform multilingualism". Research in the field has largely focused on English and a few major European languages; "While we live in a multilingual world, this is paradoxically not taken into account by machine translation". English has often been used as a pivot language, serving as a hidden intermediary state in translation between two non-English languages. From a training corpus, probabilistic methods tend to favor the most expected possible translation and to rule out more unusual alternatives. "A common argument against the statistical methods in translation is that when the algorithm suggests the most probable translation, it eliminates alternative options and makes the language of the text so produced conform to well-documented modes of expression." While deep learning models can deal with a wider diversity of language construct, they can still be limited by collection bias in the original corpus; "the translation of a word can be affected by the prevailing theories or paradigms in the corpus harvested to train the AI". In its 2022 research assessment of open science, the Council of the European Union welcomed the "promising developments that have recently emerged in the area of automatic translation"; the council supported a more widespread use of "semi-automatic translation of scholarly publications within Europe" because of its "major potential in terms of market creation".

== Open science and multilingualism ==