PARALLEL CORPORA FOR AI

Parallel corpora for over 350 language pairs and numerous multilingual combinations, including 9 million bilingual segments and 90 million tokens in twenty languages.

The segments consist of manually curated full sentences and short phrases with translation equivalents, based on corpus evidence and frequency, originally created by our editors and translators worldwide as examples of usage for dictionary entries.

The data can be applied to boost the performance of Language Service Providers, to train Machine Learning models and enhance their Neural Machine Translation solutions.

The languages include Arabic, Chinese (Simplified), Danish, Dutch, English, French, German, Greek, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese – Brazilian and European, Russian, Spanish, Swedish, and Turkish.

Besides general language vocabularies, there are parallel corpora for a hundred specific language domains, as shown below.

DOMAINS

Administration
Advertising
Aeronautics
Agriculture
Anatomy
Anthropology
Archeology
Architecture
Art
Astrology
Astronomy
Automobiles
Aviation
Biology
Botanics
Cartography
Chemistry
Cinema
Clothing

Color
Commerce
Computers
Construction
Cosmetics
Culinary
Culture
Dance
Data
Dress
Drinks
Drugs
Ecology
Economics
Education
Electricity
Electronics
Energy

Engineering
Entertainment
Environment
Family
Fashion
Finance
Furniture
Games
Genetics
Geography
Geology
Geometry
Grammar
Health
History
Hygiene
Industry
Informatics

Internet
IT
Journalism
Law
Leisuretime and hobbies
Linguistics
Literature
Maritime
Marketing
Mathematics
Measurements and units
Mechanics
Medicine
Meteorology
Military
Music

Mythology
Nautical
Occupation
Oceanography
Optics
Pharmacology
Philosophy
Photography
Physics
Physiology
Police
Politics
Post
Psychology
Publishing
Radio
Rail
Religion

School
Sex
Sociology
Space
Sport
Statistics
Technical
Technology
Telecommunication
Telephone
Television
Theatre
Theology
Time
Tourism
Transportation
University
Zoology

CONTACT

    Font Resize
    Contrast