Parallel Corpora

Bilingual and Multilingual Parallel Corpora

Expert parallel corpora for nearly 400 language pairs and numerous multilingual combinations for training Language Models and boosting the performance of Machine Translation engines.


The corpora include bilingual and multilingual segments that consist of corpus-derived, manually curated full sentences and short phrases with their corresponding equivalents in other languages.


These segments are based on dictionary examples of usage, which have been created and refined to illustrate typical language patterns by expert linguists and translators worldwide, for general language use and 100 vertical domains.

The languages include: Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, Latin, North Sami, Norwegian, Polish, Portuguese (Brazil / Portugal), Russian, Spanish, Swedish, and Turkish.

In addition to general language vocabulary, there are segments for more than one hundred vertical domains.

Parallel Corpora – Multilingual Sample (sport)

Arabic   .تمركز كل المشتركين على خط الانطلاق
Chinese S.   所有的参赛者都在起跑线上.
Danish   Alle konkurrencedeltagerne står på startlinjen.
Dutch   Alle deelnemers staan aan de start.
English   All the competitors are on the starting line.
French   Tous les concurrents sont sur la ligne de départ.
German   Alle Wettstreiter sind auf der Startlinie.
Greek   Όλοι οι αθλητές είναι στη γραμμή της αφετηρίας.
Hebrew   .כל המִתְחָרים עומדים על קו הזינוק
Italian   Tutti i concorrenti sono sulla linea di partenza.
Japanese   全(すべ)ての選手がスタートラインに立(た)っている。
Norwegian   Alle konkurrentene står på startlinjen.
Polish   Wszyscy rywale są na linii startu.
Portuguese Br.   Todos os competidores estão na linha de partida.
Portuguese Pt.   Todos os concurrentes estão na linha de partida.
Russian   Все уча́стники соревнова́ния собрали́сь на ста́рте.
Spanish   Todos los competidores están en la linea de salída.
Swedish   Alla deltagarna står på startlinjen.
Turkish   Bütün yarışçılar start çizgisinin üstündeler.

Datasets

From

Select language
Dataset From

To

Select language
Dataset To

Arabic - Chinese Simplified

Segments:    15,982

Arabic - Portuguese Portugal

Segments:    15,990

Arabic - Danish

Segments:    39,351

Arabic - Dutch

Segments:    39,464

Arabic - English

Segments:    11,917

Arabic - French

Segments:    16,225

Arabic - German

Segments:    57,499

Arabic - Greek

Segments:    16,070

Arabic - Hebrew

Segments:    16355

Arabic - italian

Segments:    14,161

Arabic - Japanese

Segments:    18,138

Arabic - Norwegian

Segments:    39,124

Domains

Lexicala datasets classify word senses into more than 100 domains.


Acoustics
Music


Architecture
Cartography


Chemistry
Pharmacology


Culinary
Drinks


Electricity
Energy


Geography
Geology


Grammar
Linguistics


Literature
Publishing


Military
Police


Theology
Religion


Agriculture
Botanics
Environment


Anthropology
Archeology
Philosophy


Culture
History
Politics


Education
School
University


Games
Leisure time&hobbies


Geometry
Mathematics
Statistics


Maritime
Nautical
Oceanography


Mythology
Psychology
Sociology


Journalism
Law
Occupation


Astronomy
Meteorology
Optics
Physics


Clothing
Cosmetics
Dress
Fashion


Radio
Technology
Telephone
Television


Anatomy
Genetics
Health
Medicine
Physiology


Aeronautics
Aviation
Automobiles
Rail
Transportation


Anatomy
Biology
Ecology
Genetics
Physiology
Zoology


Administration
Advertising
Commerce
Economics
Finance
Industry
Marketing


Art
Cinema
Color
Dance
Entertainment
Music
Photography
Theatre


Computers
Data
Electronics
Engineering
Informatics
Internet
IT
Technical
Technology
Telecommunication


Construction
Family
Furniture
Hygiene
Measurements&units
Mechanics

Space
Sport
Time
Tourism