Parallel Corpora for Machine Translation

Parallel corpora for nearly 400 language pairs and numerous multilingual combinations, including 10 million quality segments. 

The segments consist of corpus-derived, manually curated full sentences and short phrase, including their translation equivalents. They are based on dictionary examples of usage that have been created over the years by expert linguists, lexicographers and translators worldwide. 

The data can be applied to enhance the training of language learning models to boost the performance of Neural Machine Translation engines.

The languages include: Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, Latin, North Sami, Norwegian, Polish, Portuguese (Brazil / Portugal), Russian, Spanish, Swedish, and Turkish.

In addition to general language vocabulary, there are segments for more than one hundred vertical domains.

Datasets

From

Select language

To

Select language

Arabic - Chinese Simplified

Segments:    15,982

Arabic - Portuguese Portugal

Segments:    15,990

Arabic - Danish

Segments:    39,351

Arabic - Dutch

Segments:    39,464

Arabic - English

Segments:    11,917

Arabic - French

Segments:    16,225

Arabic - German

Segments:    57,499

Arabic - Greek

Segments:    16,070

Arabic - Hebrew

Segments:    16355

Arabic - italian

Segments:    14,161

Arabic - Japanese

Segments:    18,138

Arabic - Norwegian

Segments:    39,124

Domains


Acoustics
Music


Architecture

Cartography


Chemistry

Pharmacology


Culinary

Drinks


Electricity

Energy


Geography

Geology


Grammar
Linguistics


Literature

Publishing


Military

Police


Theology

Religion


Agriculture

Botanics
Environment


Anthropology

Archeology
Philosophy


Culture
History
Politics


Education

School
University


Games
Leisure time&hobbies


Geometry

Mathematics
Statistics


Maritime

Nautical
Oceanography


Mythology

Psychology
Sociology


Journalism

Law
Occupation


Astronomy

Meteorology
Optics
Physics


Clothing

Cosmetics
Dress
Fashion


Radio

Technology
Telephone
Television


Anatomy

Genetics
Health
Medicine
Physiology


Aeronautics

Aviation
Automobiles
Rail
Transportation


Anatomy

Biology
Ecology
Genetics
Physiology
Zoology


Administration

Advertising
Commerce
Economics
Finance
Industry
Marketing


Art

Cinema
Color
Dance
Entertainment
Music
Photography
Theatre

 


Computers

Data
Electronics
Engineering
Informatics
Internet
IT
Technical
Technology
Telecommunication

Construction
Family
Furniture
Hygiene
Measurements&units
Mechanics

Space
Sport
Time
Tourism

Contact

Contact us to ask about our resources and services.