What Every Computational and Corpus Linguist Should Know About TypeToken Distributions and Zipf’s LawTutorial at LREC 2018 (Miyazaki) & Corpus Linguistics 2019 (Cardiff)Motivation – Outline – Code & Data 


This tutorial (i) introduces the mathematical foundations of statistical models of typetoken distributions (known as LNRE models, for Large Number of Rare Events), including recent work on bootstrapped confidence sets and corrections for nonrandomness; (ii) shows how to put these models to practical use in NLP tasks with the zipfR implementation, an addon package for the widelyused statistical programming environment R; and (iii) discusses applications of typetoken statistics with a particular focus on quantitative measures of productivity and typerichness.
Its aim is to equip participants with the knowledge, skills and tools to deal properly with lowfrequency data and highly skewed typetoken distributions in their linguistic research and NLP applications.
The LREC tutorial took place on Monday, 7 May 2018 (afternoon session). The Corpus Linguistics tutorial takes place on Monday, 22 July 2019 (afternoon session).Instructor: Stefan Evert (FAU ErlangenNürnberg)
Typetoken statistics based on Zipf’s law play an important supporting role in many natural language processing tasks as well as in the linguistic analysis of corpus data. On the one hand, typetoken analysis has been applied to tasks such as GoodTuring smoothing, stylometrics and authorship attribution, patholinguistics, measuring morphological productivity, studies of the typerichness e.g. of an author’s vocabulary, as well as coverage estimates for treebank grammars and other language models. On the other hand, virtually all probability estimates obtained from corpus data—ranging from psycholinguistic frequency norms over the collocational strength of multiword expressions to supervised and unsupervised training of statistical models—are affected by the skewed frequency distribution of natural language expressed by Zipf’s law. Recent work has shown that the significance of lowfrequency data can be overestimated substantially even by methods previously believed to be robust, such as the loglikelihood ratio.
However, many researchers are not familiar with the specialised mathematical techniques required for a statistical analysis of typetoken distributions, in particular socalled LNRE models based on Zipfian type density functions (Baayen 2001 ). Most offtheshelf NLP software packages also fail to provide reliable estimation methods for Zipflike frequency distributions and other necessary functionality. As a result, arbitrary cutoff thresholds for lowfrequency data are applied rather than adjusting statistical estimators; typetoken analysis, if carried out at all, is based on intuitive, but problematic measures such as the typetoken ratio (TTR); and empirical observations of coverage or vocabulary size cannot reliably be extrapolated to larger sample sizes.
Extensive handson examples will be demonstrated based on the LNRE implementation in the zipfR package for R . Since the tutorial does not leave enough room for a practice session, the full example code will be made available here for download with detailed explanations.
zipfR
v0.666 for R 3.6.0: source code (Linux) – Mac OS X – Windows