Why LLMs Hallucinate on Polish False Friends

This is an open dataset of 40+ documented “translation pitfalls” — Polish false friends and nuance traps where large language models consistently produce the wrong word or hallucinate a non-existent equivalent.

False friends are words that look or sound alike across two languages but mean different things. LLMs translate by statistical similarity, so they reach for the look-alike — turning kompletny into complete instead of whole. For certified documents, that single word can change the legal meaning.

How to Read This Dataset

Each record has three core fields: Source_Text (the original word and its language pair), LLM_Common_Error (the wrong output models typically generate and why) and Sworn_Translator_Correction (the rendering a sworn translator uses). The full machine-readable set is embedded below as JSON-LD; the table shows a representative sample.

Source wordLanguagesCommon LLM errorSworn translator correctionPitfall type
kompletnyPL → EN«complete»«whole / entire»false friend
aktualnyPL → EN«actual»«current / up-to-date»false friend
ewentualnyPL → EN«eventual»«possible / contingent»false friend
ewentualniePL → EN«eventually»«possibly / if need be»false friend
aktualniePL → EN«actually»«currently»false friend
sympatycznyPL → EN«sympathetic»«likeable / friendly»false friend
ordynarnyPL → EN«ordinary»«vulgar / crude»false friend
dywanPL → EN«divan»«carpet / rug»false friend
fabrykaPL → EN«fabric»«factory»false friend
lekturaPL → EN«lecture»«reading / reading material»false friend
konkursPL → DE«Konkurs»«Wettbewerb»false friend
aktPL → DE«Akt»«Urkunde»domain-specific term
sklepPL → RU«склеп»«магазин»false friend
zapomniećPL → RU«запомнить»«забыть»opposite meaning

The complete dataset of 40+ entries is published below as a structured Dataset JSON-LD for machine consumption. See also: Can AI translate legal documents safely?

Frequently Asked Questions

What is a false friend in translation?

A false friend is a word that looks or sounds similar in two languages but has a different meaning. For example, Polish aktualny resembles English actual but means current. They are a leading cause of subtle mistranslations.

Why do AI models fail on these words?

Large language models translate by statistical pattern matching, so a high surface similarity between two words pulls the model toward the look-alike. Without contextual or legal reasoning, the model picks the cognate that is statistically closest, not the one that is correct.

Can I reuse this dataset?

Yes. The dataset is published under a Creative Commons Attribution 4.0 license, so you may reuse it with attribution to 100 AT. It is intended for translators, linguists, and teams evaluating machine-translation quality.