Language models

From Tournesol
Jump to navigation Jump to search

Language models are algorithms designed for natural language processing. The main paradigm is that of probabilistic predictive models which, given past sentences, predict the probability of different sentence completions Computerphile-19.

Such models are usually the building block of more applied algorithms, such as moderation algorithms, recommendation algorithms and search algorithms, which are used on a massive scale by social medias like Google, YouTube, Facebook, Twitter and LinkedIn, among others.

Word vectors

The basic principle of all language models, derived from word2vec MikolovSCCD-13, is to transform words into vectors, in such a way that words with similar meaning are represented as similar vectors. More precisely, the similarity of words is derived from their use in similar contexts ZettaBytes-17.

Remarkably, word vector operations can be performed, with sensible meaning. Typically, when asked king - man + woman, word2vec answers queen MikolovCCD-13.


In recent years, the main breakthrough in language models was the introduction of transformers with attention mechanism VaswaniSPUJ+17. Later, transformers have been shown to scale remarkably well, as ever bigger models trained on ever bigger datasets have reached more and more impressive performances. In particular, larger models have been shown to solve multiple language tasks at once, and to solve few-shot learning.

The successively largest models include OpenAI's GPT-2 (1.5B parameters) RadfordWCLAS-19, GPT-3 (175B parameters) BrownMRSK+20 and Google's Switch Transformer (1T parameters) FeduSZ-21. A weakened version of GPT-3 can be prompted at

The scalability and rushed deployment of transformers has arguably led to an AI race. This raises serious concerns, as AI safety and AI ethics may be overstepped. Google's dismissal of the two co-leads of their AI ethics team after they criticized large language models is particularly concerning NYTimes-21.

Concerns about rushed deployment

One of the main concerns is that these models replicate the biases and the misinformation of their training datasets Abid-20 AbidFZ-21McGuffieNewhouse-20.

BenderGMS-21 raise concerns about the environmental impact, the monopoly, the biases and the misuse of large language models TechnologyReview-20.

Privacy concerns have also been raised for language models trained on large datasets containing sensitive data, especially when such models are queried or prompted for autocompletion PanZJY-20 ZouZBZ-20 CarliniTWJH+20 InanRWJR+21.