What Every Computational and Corpus Linguist Should Know About Type-Token Distributions and Zipf’s Law

Tutorial at LREC 2018 (Miyazaki) & Corpus Linguistics 2019 (Cardiff)

This tutorial (i) introduces the mathematical foundations of statistical models of type-token distributions (known as LNRE models, for Large Number of Rare Events), including recent work on bootstrapped confidence sets and corrections for non-randomness; (ii) shows how to put these models to practical use in NLP tasks with the zipfR implementation, an add-on package for the widely-used statistical programming environment R; and (iii) discusses applications of type-token statistics with a particular focus on quantitative measures of productivity and type-richness.

Its aim is to equip participants with the knowledge, skills and tools to deal properly with low-frequency data and highly skewed type-token distributions in their linguistic research and NLP applications.

The LREC tutorial took place on Monday, 7 May 2018 (afternoon session). The Corpus Linguistics tutorial takes place on Monday, 22 July 2019 (afternoon session).

Instructor: Stefan Evert external link (FAU Erlangen-Nürnberg)

Motivation

Type-token statistics based on Zipf’s law play an important supporting role in many natural language processing tasks as well as in the linguistic analysis of corpus data. On the one hand, type-token analysis has been applied to tasks such as Good-Turing smoothing, stylometrics and authorship attribution, patholinguistics, measuring morphological productivity, studies of the type-richness e.g. of an author’s vocabulary, as well as coverage estimates for treebank grammars and other language models. On the other hand, virtually all probability estimates obtained from corpus data—ranging from psycholinguistic frequency norms over the collocational strength of multiword expressions to supervised and unsupervised training of statistical models—are affected by the skewed frequency distribution of natural language expressed by Zipf’s law. Recent work has shown that the significance of low-frequency data can be overestimated substantially even by methods previously believed to be robust, such as the log-likelihood ratio.

However, many researchers are not familiar with the specialised mathematical techniques required for a statistical analysis of type-token distributions, in particular so-called LNRE models based on Zipfian type density functions (Baayen 2001 external link ). Most off-the-shelf NLP software packages also fail to provide reliable estimation methods for Zipf-like frequency distributions and other necessary functionality. As a result, arbitrary cutoff thresholds for low-frequency data are applied rather than adjusting statistical estimators; type-token analysis, if carried out at all, is based on intuitive, but problematic measures such as the type-token ratio (TTR); and empirical observations of coverage or vocabulary size cannot reliably be extrapolated to larger sample sizes.

Tutorial outline

Motivation: Introduction to Zipf's law and type-token distributions. Overview of applications and their requirements
Basic concepts & notation: Type-frequency rankings, vocabulary growth curve, frequency spectrum, Zipf-Mandelbrot law. Overview of quantitative measures of lexical diversity and productivity.
LNRE models: Statistical analysis of type-token distributions with LNRE models: type density function, expected frequency spectrum, asymptotic variance, goodness-of-fit test, estimation of model parameters
Applications & examples: Hands-on examples in R/zipfR for various applications, including coverage estimates (How many typos are there on the internet?), literary stylometry (How many words did Shakespeare know?), patholinguistics (Is lexical diversity an early indicator of dementia?), morphological productivity (Which word-formation processes are productive?) and Zipfian priors (Good-Turing smoothing, properly chance-corrected association measures).
Advanced techniques: LNRE models as a basis for parametric simulation studies. Confidence intervals for model parameters. Significance tests for differences between samples. Effects of non-randomness and LNRE models based on document frequency.
Challenges: Problems affecting LNRE models: non-randomness of texts, deviations from the Zipf-Mandelbrot law, robustness of parameter estimation from small samples. Recent approaches for improving LNRE models and significance tests are discussed.

Course materials

LREC 2018: Slides (PDF, 3.2 MB), Handout (PDF, 4up, 2.4 MB)
Corpus Linguistics 2019: Slides (PDF, 3.8 MB), Handout (PDF, 4up, 2.9 MB)

Updated on 22 July 2019

Code & data

Extensive hands-on examples will be demonstrated based on the LNRE implementation in the zipfR package for R external link . Since the tutorial does not leave enough room for a practice session, the full example code will be made available here for download with detailed explanations.

Code examples & data sets

First steps: brown_adverbs.R, with sample data in “vertical” format: brown_adverbs.txt

Interim zipfR release & additional data

zipfR v0.6-66 for R 3.6.0: source code (Linux) – Mac OS X – Windows
Data from Evert, Wankerl & Nöth (2017): Wankerl2017.rda

Updated on 22 July 2019