Parallel Corpora For Machine Translation

Parallel corpora for nearly 400 language pairs and for numerous multilingual combinations, including 10 million bilingual segments with 100 million tokens in twenty three languages.

The segments consist of manually curated full sentences and short phrases, with translation equivalents, based on corpus evidence and frequency, originally created by our editors and translators worldwide as examples of usage for dictionary entries.

The data can be applied to train Machine Learning models and boost the performance of Neural Machine Translation engines.

The languages include Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, Latin, North Sami, Norwegian, Polish, Portuguese (Brazilian and European), Russian, Spanish, Swedish, and Turkish.

See all the datasets here.

Besides general language vocabularies, there are segments for a hundred vertical domains.

Datasets

From

Select language

To

Select language

Domains


Acoustics
Music


Architecture

Cartography


Chemistry

Pharmacology


Culinary

Drinks


Electricity

Energy


Geography

Geology


Grammar
Linguistics


Literature

Publishing


Military

Police


Theology

Religion


Agriculture

Botanics
Environment


Anthropology

Archeology
Philosophy


Culture
History
Politics


Education

School
University


Games
Leisure time&hobbies


Geometry

Mathematics
Statistics


Maritime

Nautical
Oceanography


Mythology

Psychology
Sociology


Journalism

Law
Occupation


Astronomy

Meteorology
Optics
Physics


Clothing

Cosmetics
Dress
Fashion


Radio

Technology
Telephone
Television


Anatomy

Genetics
Health
Medicine
Physiology


Aeronautics

Aviation
Automobiles
Rail
Transportation


Anatomy

Biology
Ecology
Genetics
Physiology
Zoology


Administration

Advertising
Commerce
Economics
Finance
Industry
Marketing


Art

Cinema
Color
Dance
Entertainment
Music
Photography
Theatre

 


Computers

Data
Electronics
Engineering
Informatics
Internet
IT
Technical
Technology
Telecommunication

Construction
Family
Furniture
Hygiene
Measurements&units
Mechanics

Space
Sport
Time
Tourism

Contact

Contact us to access the API, subscribe to Lexicala Review, or ask about our data and services.