![]() ![]() |
What Every Computational and Corpus Linguist Should Know About Type-Token Distributions and Zipf’s LawTutorial at LREC 2018 (Miyazaki) & Corpus Linguistics 2019 (Cardiff)Motivation – Outline – Code & Data |
![]() ![]() |
This tutorial (i) introduces the mathematical foundations of statistical models of type-token distributions (known as LNRE models, for Large Number of Rare Events), including recent work on bootstrapped confidence sets and corrections for non-randomness; (ii) shows how to put these models to practical use in NLP tasks with the zipfR implementation, an add-on package for the widely-used statistical programming environment R; and (iii) discusses applications of type-token statistics with a particular focus on quantitative measures of productivity and type-richness.
Its aim is to equip participants with the knowledge, skills and tools to deal properly with low-frequency data and highly skewed type-token distributions in their linguistic research and NLP applications.
The LREC tutorial took place on Monday, 7 May 2018 (afternoon session). The Corpus Linguistics tutorial takes place on Monday, 22 July 2019 (afternoon session).
Instructor: Stefan Evert (FAU Erlangen-Nürnberg)
Type-token statistics based on Zipf’s law play an important supporting role in many natural language processing tasks as well as in the linguistic analysis of corpus data. On the one hand, type-token analysis has been applied to tasks such as Good-Turing smoothing, stylometrics and authorship attribution, patholinguistics, measuring morphological productivity, studies of the type-richness e.g. of an author’s vocabulary, as well as coverage estimates for treebank grammars and other language models. On the other hand, virtually all probability estimates obtained from corpus data—ranging from psycholinguistic frequency norms over the collocational strength of multiword expressions to supervised and unsupervised training of statistical models—are affected by the skewed frequency distribution of natural language expressed by Zipf’s law. Recent work has shown that the significance of low-frequency data can be overestimated substantially even by methods previously believed to be robust, such as the log-likelihood ratio.
However, many researchers are not familiar with the specialised mathematical techniques required for a statistical analysis of type-token distributions, in particular so-called LNRE models based on Zipfian type density functions (Baayen 2001 ). Most off-the-shelf NLP software packages also fail to provide reliable estimation methods for Zipf-like frequency distributions and other necessary functionality. As a result, arbitrary cutoff thresholds for low-frequency data are applied rather than adjusting statistical estimators; type-token analysis, if carried out at all, is based on intuitive, but problematic measures such as the type-token ratio (TTR); and empirical observations of coverage or vocabulary size cannot reliably be extrapolated to larger sample sizes.
Extensive hands-on examples will be demonstrated based on the LNRE implementation in the zipfR package for R . Since the tutorial does not leave enough room for a practice session, the full example code will be made available here for download with detailed explanations.
zipfR
v0.6-66 for R 3.6.0: source code (Linux) – Mac OS X – Windows