Languages are structurally complex, but at the same time people need to communicate effectively. How can this be reconciled? To investigate this, researchers at the Leibniz Institute for the German Language (IDS) in Mannheim trained computer language models with more than 6500 documents in over 2000 languages. The result: languages that are more difficult for the computer models to process compensate for this greater complexity with greater efficiency. More complex languages therefore require fewer symbols to encode the same message. The analyses also show that larger language communities tend to use more complex but more efficient languages.
Mannheim/Germany, February 4, 2025 Language models are computer algorithms that can learn to process and generate human language with astonishing accuracy. They analyse large amounts of text and recognise patterns without relying on predefined rules, making them valuable tools for language research. But not all these language models are the same: their internal architecture varies and determines how they learn and process language. These differences allow us to compare the world’s languages in new ways and gain insights into linguistic diversity.
In a novel study, IDS researchers trained language models on a huge dataset of over 6500 documents in more than 2000 languages with around 3 billion words. The texts include religious writings, legal documents, film subtitles, newspaper articles and many more. The researchers estimated how difficult it is for the language models to process or produce texts. From this, they deduced the complexity of the respective language. ‘We trained very different language models on this text material,’ says co-author Sascha Wolfer. ‘Some simple models only take into account the last two words, for example. This naturally limits the ability to recognise grammatical patterns over long distances, for example. Others, such as transformer models, which include ChatGPT, use more advanced mechanisms to analyse complex dependencies and uncover richer linguistic structures.’
Surprisingly, the results were quite consistent: despite clear architectural differences, the models produced very similar language complexity scores. ‘If a language is more difficult to process for one model in a text collection than another, this relationship also holds for other models, text collections and even when the model works with single characters instead of words,’ explains co-author Peter Meyer. ‘This suggests that the results not only reflect computational effort, but may actually provide insights into the internal complexity of human languages.’
More complex languages are therefore more difficult to process. But why do some languages develop in such a way that they become more complex? One might think that this is a disadvantage compared to less complex languages. One key finding of the study could provide an answer: languages make a ‘trade-off’ between complexity and efficiency, i.e. more complex languages tend to require less text material to convey the same content. This could indicate a ‘balancing mechanism’ in which greater structural complexity is compensated for by greater efficiency in communication.
‘Perhaps the additional effort required to learn a complex language also has its advantages,’ says Alexander Koplenig, first author of the study. ‘Once you have mastered it, a more complex language could offer more opportunities to express yourself. The same idea could be conveyed with fewer words. Interestingly, we can also show that larger language communities tend to use more complex but more efficient languages.’
Institutionalised education and systematic language learning in large societies may enable greater linguistic complexity. At the same time, the relevance of written communication in larger societies could lead to shorter messages being favoured. This can save costs for production, storage and transmission – such as book pages or storage space. ‘This combination, formal education that enables complexity and practical needs for greater efficiency, could explain why languages in larger communities develop the way they do,’ Koplenig continues. ‘It will be fascinating to see if this speculative hypothesis can be substantiated by future research.’
Translated with DeepL_com
Original Working
Koplenig A., Wolfer S., Rüdiger J.-O., Meyer, P. (2025): Human languages trade off complexity against efficiency. PLOS Complex Systems 2(1): e0000032.
(https://journals.plos.org/complexsystems/article?id=10.1371/journal.pcsy.0000032)
The Leibniz Institute for the German Language (IDS) in Mannheim is the central scientific institution for the documentation and research of the German language in the present and recent history, jointly funded by the federal government and all the German states. It is one of the more than 90 research and service institutions of the Leibniz Association. For more information see:
(http://www.ids-mannheim.de, https://bsky.app/profile/idsmannheim.bsky.social, https://x.com/ids_mannheim, http://www.facebook.com/ids.mannheim, http://www.instagram.com/ids_mannheim/ und http://www.leibniz-gemeinschaft.de.)
ImageSource
PublicDomainPictures Pixabay, How languages make compromises More complex languages seem to be more efficient