zipfR: user-friendly LNRE modelling in R

Introduction
Obtaining zipfR
Getting started
Auxiliary scripts
Contact information & mailing list
Background reading
Change log & old versions
R-Forge project page
CRAN download: zipfR

Courses & tutorials: ESSLLI 2006 – LREC 2018 / CL 2019

current version: 0.6-70 (October 2020)

Introduction

The zipfR package implements “Large Number of Rare Events” (LNRE) models for type-token statistics on word frequencies and other type-rich distributions, whether linguistic or non-linguistic. It provides utilities for various kinds of analyses commonly used in lexical statistics, such as the extrapolation of vocabulary growth curves and frequency estimation for rare events, building on the powerful R environment for statistical computing.

Like R, zipfR is free and open source.

A few examples of application areas where the zipfR functionalities can be useful:

Investigating the quantitative productivity of word formation processes and other linguistic phenomena
Estimating and comparing vocabulary size and other measures oflexical richness in language acquisition or stylometry studies
Estimating the number of types of a certain category in a large population based on a limited sample, e.g., the number of typos on the Web
Providing empirically supported Bayesian priors for lexical probabilities

zipfR is being developed by Stefan Evert and Marco Baroni.

If you publish work based on zipfR, please quote this page and/or one of the following papers:

Evert, Stefan and Baroni, Marco (2006). The zipfR library: Words and other rare events in R. Presentation at useR! 2006: The Second R User Conference, Vienna, Austria.
Evert, Stefan and Baroni, Marco (2007). zipfR: Word frequency distributions in R. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, Prague, Czech Republic.

The mathematics behind zipfR are described in the technical report Inside zipfR (currently work in progress).

Back to the top

Obtaining zipfR

Source code and binary versions of the zipfR package can conveniently be downloaded from CRAN external link , the Comprehensive R Archive Network, using R's built-in package manager. If you need help with installation from CRAN, please take a look at the zipfR tutorial available from the Getting started section. The current version if v0.6-70, released on 13 October 2020.

Nightly builds of the cutting-edge version are available from the R-Forge repository. Note that these development builds may occasionally be buggy or unstable, so you should only use them if you feel comfortable as a beta tester (and it is highly recommended that you join the zipfR mailing list in this case). To install the cutting-edge version of zipfR, enter the R command

install.packages("zipfR", repos="http://R-Forge.R-project.org")

or point your GUI package installer to the R-Forge repository at http://R-Forge.R-project.org/.

Back to the top

Getting started

The best way to get started with zipfR is to install it and work your way through the tutorial. Now would also be a good time to sign up for the mailing list.

Download the current version of the tutorial now!

When you have finished the tutorial, you should browse the comprehensive package documentation to find out more about its functionalities. If you want to learn more about the motivation behind zipfR, the math and some applications, we recommend that you take a look at the materials from our ESSLLI 2006 course Counting Words and at the LREC 2018 / Corpus Linguistics 2019 tutorial What Every Computational Corpus Linguist Should Know About Type-Token Distributions. The papers listed in the Background reading section below may also be of interest to you.

Back to the top

Auxiliary scripts

zipfR accepts several input formats, including simple frequency lists and plain samples in one-token-per-line format. Thus, it should be easy to extract suitable input data using standard corpus processing tools.

However, we also provide two Perl pre-processing scripts that might be useful for certain studies:

compute_emp_vgc.pl: This script takes a corpus in one-token-per-line format and computes its observed vocabulary and hapax growth curves (observed growth curves cannot be generated from frequency lists, since they require information on the ordering of the tokens in the input). Note that the Perl script is more efficient for large corpora than the built-in zipfR functions.

randomization_experiments.pl: This script takes a CWB-encoded corpus as input and computes spectra and vocabulary growth curves for multiple randomizations of arbitrary text segments (e.g. sentences or documents). The script is meant for rather advanced tests of the randomness assumption lying behind LNRE modeling and for handling large corpora. Most users will have no use for it.

Back to the top

Contact information & zipfR mailing list

We would love to hear from you: feedback, feature requests, offers to collaborate on the development of the toolkit, invitations, money, … ;-)

Stefan Evert (Corpus Linguistics Group, FAU Erlangen-Nürnberg):
stefan evert AT fau de | http://www.stefan-evert.de/
Marco Baroni (CIMeC, University of Trento):
marco baroni AT unitn it | http://clic.cimec.unitn.it/marco/

You should also subscribe to the zipfR mailing list, where you can get help and hold discussions with other users.

Back to the top

Background reading

H. Baayen (2001). Word frequency distributions. Kluwer, Dordrecht (the companion LEXSTATS toolkit for LNRE analysis is available here).

M. Baroni (2008). Distributions in text. In Anke Lüdeling and Merja Kytö (eds.), Corpus Linguistics: An International Handbook. Berlin: Mouton de Gruyter.

M. Baroni and S. Evert (2007). Words and echoes: Assessing and mitigating the non-randomness problem in word frequency distribution modeling. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic.

S. Evert (2004a). A simple LNRE model for random character sequences. In Proceedings of JADT 2004, 411-422.

S. Evert (2004b, published 2005). The statistics of word cooccurrences: Word pairs and collocations. Ph.D. thesis, Institut für maschinelle Sprachverarbeitung, University of Stuttgart.

S. Evert (2020). Inside zipfR. Technical report. [Work in progress. Provides mathematical details on the algorithms implemented in the zipfR package.]

S. Evert and M. Baroni (2006a). Testing the extrapolation quality of word frequency models. In Proceedings of Corpus Linguistics 2005, Birmingham, UK.

S. Evert and M. Baroni (2006b). The zipfR library: Words and other rare events in R. Presentation at useR! 2006: The Second R User Conference, Vienna, Austria.

S. Evert and M. Baroni (2007). zipfR: Word frequency distributions in R. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, Prague, Czech Republic. [poster]

F. J. Tweedie and R. H. Baayen (1998). How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352.

Back to the top

Change log & old version downloads

Whenever possible, you should use the most recent version of the zipfR package from CRAN or R-Forge (see installation instructions). Here, we provide source code packages of old versions as a fallback in case there are any incompatibilities with legacy platforms or existing R scripts.

Version 0.6-70 (October 2020)
source code (Linux) – Mac OS X (R 4.0.2) – Windows (R 4.0.2)
- parallelization of lnre.bootstrap() for efficient parametric bootstrapping, especially confidence intervals for LNRE model parameters
- additional productivity measures and some other improvements
- lnre.productivity.measures() computes approximate expectations or bootstrapped confidence intervals for productivity measures
- plot() method for LNRE models (log type density or cumulative probability distribution) and improved plots of type frequency lists (relative frequencies, LNRE distributions)
- all plot methods accept list of objects in first argument to avoid do.call() trickery
- lnre() constructor supports user-defined cost function for parameter estimation
- bug fix: tfl2spc() no longer includes types with f = 0 in frequency spectrum
- new default colour scheme and line styles
Version 0.6-66 (July 2019)
source code (Linux) – Mac OS X (R 3.6.0) – Windows (R 3.6.0)
- parameter estimation now performs multiple runs with different random start values for improved robustness
- efficient parametric boostrapping from LNRE model with lnre.bootstrap() and bootstrap.confint()
- rlnre() can directly generate type-frequency lists (essential for large samples)
Version 0.6-44 (July 2018)
source code (Linux) – Mac OS X (R 3.5.0) – Windows (R 3.5.0)
- compute various productivity measures from observed frequency spectrum, type-frequency list, vocabulary growth curve or token vector
- default minimization algorithm changed to Nelder-Mead (Custom often failed to converge on problematic data sets)
- new default cost function gof based on multivariate chi-squared statistic, closely related to maximum likelihood estimation
- confint() method for LNRE models
- merge() method for two or more type frequency lists
- new example data sets from Evert & Lüdeling (2001) and Baayen (2001)
Version 0.6-10 (August 2017)
source code (Linux)
- maintenance release for compatibility with new CRAN checks and restrictions
- plotting utilities (zipfR.begin.plot(), zipfR.end.plot(), etc.) are deprecated
- zipfR tutorial has been rewritten as genuine package vignette
- reading and writing type frequency lists is more robust and allows character encoding of the disk file to be declared
- Zipf-ranking plots for type-frequency lists
- removed zipfR.legend() function
- added package citation details, and upgraded license to GPL v3
Version 0.6-4 (January 2007)
source code (Linux)
- various bug fixes and improvements
- new convenience function read.multiple.objects()
Version 0.6 (first public release, August 2006)
source code (Linux)
- first public release on CRAN
- improved parameter estimation, can be fine-tuned by users with choice of cost functions and minimization algorithms
- default settings tuned to work well for most data sets
- minor bug fixes and improved documentation, updated version of tutorial

Back to the top