University of Vienna students’ internship at Lexicala
The interns – Karin Niederreiter, Bettina Pátzay, and Nadezda Sazanova – have started to explore three challenging projects on hate speech and offensive language. Their work involves experimental computational linguistics tasks on English, German, Hungarian and Russian vocabularies.
The results of these research projects are expected to be made available toward the end of Q1, 2024.
WordPlay Games, word puzzle video games with Lexicala resources
WordPlay Games have released a new line of SpellStruck word puzzle video games for five languages, including definitions from Lexicala monolingual resources.
The games display word definitions from Lexicala in Spanish, German, French, Italian, and Brazilian Portuguese, adding to the popular English-language application that was released earlier this year and features definitions from Wordnik.
Spellstruck, featuring Disney characters, is available on the Apple Arcade subscription service.
Presenting a poster on LLM training data at Copenhagen University Language Technology Conference
Winter is coming, and Lexicala CEO, Ilan Kernerman, is heading to the cold north but heartwarming annual conference of the Language Technology Center at Copenhagen University (CST). Sprogteknologisk Konference 2023 will be held on Thursday, November 23, bringing together local and foreign developers, researchers, users and students, and featuring diverse talks on language-centered AI.
Ilan will present a poster entitled ‘Integrating Multi-layer Lexical Data in Language Model Training,’ which describes the unique characteristics of Lexicala’s multi-layer language data and how they can enhance the training and performance of LLMs and MT systems, as well as other NLP applications.
Our innovative multi-layer method converges human creation and curation with automated data processing. It starts with the meticulous exploration of the foundations of each language, and the evidence that is gathered and diagnosed enables “mapping its DNA” minutely to create a monolingual core. Then we proceed to add L2 equivalents, to produce a bilingual pair, and juxtapose more language translations forming a multilingual layer around the initial L1 core. This core and its multilingual satellites can be further cross-lingualized and extrapolated across other language networks, all relying on a comprehensive technical infrastructure and overall framework.
Besides the usual sense disambiguation, definitions, examples, and multiword expressions, the data features rich semantic and syntactic information including morphology lists linking inflected forms to main lemmas, offensive language taxonomies, domain and register classification, synonyms and antonyms, grammatical details, phonetic transcription, spelling variations, alternative scripts, cross-references, etc.
The most relevant for LLMs and MT are the usage examples that stem from traditional learner’s dictionaries. They are conceived by experts who identify typical language patterns and design short phrases and full sentences that illustrate their use for (foreign) language learners.
Incorporating such quality lexical resources in the initial stages of language model training exceeds feeding masses of data for LLM training, offering the advantages of efficiency, precision, compactness, savings (work, time, costs), and copyright issues.
Moreover, shifting the emphasis to the preliminary model training phases also moderates the needs for excessive (Automated) Post Editing and Quality Estimation and the challenges originating from LLM hallucination, bias, and inconsistency.
The poster will highlight the prime Danish language resource of Lexicala, which is based on data from the legacy Den Danske Ordbog (DDO, The Danish Dictionary), of the Society of Danish Language and Literature (DSL).
Over the years we have substantially revised and adapted the DDO entries to create the Danish monolingual layer in our Global series, and it has since served as a base for developing bilingual and multilingual layers, including a trilingual Danish-English-Korean dataset that we tailor-made in 2018-2020 for Naver corporation in Korea, with the help of Lexical Computing.
It’s been a while since my last visit to Copenhagen, well before the Covid-19 outbreak. I look forward to meeting our colleagues at CST and making new acquaintances!
Distribution agreement between ELDA and Lexicala for the dissemination of multilingual lexical data.
We are happy to announce that ELDA, the European Language resources Distribution Agency, is starting to disseminate the language resources of Lexicala to its clients.
Taus conference 2023, Salt Lake City
Lexicala CEO, Ilan Kernerman, is attending the TAUS annual conference in Salt Lake City, meeting with industry experts, learning about the implications of Large Language Models for Machine Translation, and presenting our relevant work at Lexicala.
Ilan is honored to take part in a panel discussion on the Localization Business – The Provider Perspective, along with Mathijs Sonnemans (Blackbird.io), Spence Green (Lilt), Jeffrey Jean-Paul Kiser (Acolad) and Jan Gordecki (WeLocalize), moderated by Renato Beninatto (Nimdzi).
Ilan will also attend a pre-conference workshop on GenAI in Localization, to gain insights on the emerging landscape of LLMs, including governance and risk management, and explore use cases and applications.
Last but not least, Ilan will showcase the innovative methods and added value of Lexicala in the AI Revolution Readiness Contest, alongside leading MT professionals, focusing on the vital role of quality lexical data for fostering LLM training and performance.
The fifth plenary meeting of NexusLinguarum
This COST Association – European Cooperation in Science and Technology Action, on building a European network for Web-centred linguistic data science, will end next April, and there is a lot of work on planning the final stages. A big part of the meeting was thus devoted to the Roadmap track and Lexicala CEO chaired a session on industry adoption of LLD (Linguistic Linked Data), with good help from Dimitar Trajanov.
While distinguishing it from LLOD (Linguistic Linked Open Data), the participants explained how LLD is adopted in the industry and its benefits, as well as the mutual contribution to and from LLMs that open new ground and exciting possibilities.