zipfR: user-friendly LNRE modelling in R
|
current version: 0.6-70 (October 2020) |
The zipfR package implements “Large Number of Rare Events” (LNRE) models for type-token statistics on word frequencies and other type-rich distributions, whether linguistic or non-linguistic. It provides utilities for various kinds of analyses commonly used in lexical statistics, such as the extrapolation of vocabulary growth curves and frequency estimation for rare events, building on the powerful R environment for statistical computing.
Like R, zipfR is free and open source.
A few examples of application areas where the zipfR functionalities can be useful:
zipfR is being developed by Stefan Evert and Marco Baroni.
If you publish work based on zipfR, please quote this page and/or one of the following papers:
Source code and binary versions of the zipfR package can conveniently be downloaded from CRAN , the Comprehensive R Archive Network, using R's built-in package manager. If you need help with installation from CRAN, please take a look at the zipfR tutorial available from the Getting started section. The current version if v0.6-70, released on 13 October 2020.
Nightly builds of the cutting-edge version are available from the R-Forge repository. Note that these development builds may occasionally be buggy or unstable, so you should only use them if you feel comfortable as a beta tester (and it is highly recommended that you join the zipfR mailing list in this case). To install the cutting-edge version of zipfR, enter the R command
install.packages("zipfR", repos="http://R-Forge.R-project.org")
or point your GUI package installer to the R-Forge repository at http://R-Forge.R-project.org/
.
The best way to get started with zipfR is to install it and work your way through the tutorial. Now would also be a good time to sign up for the mailing list.
Download the current version of the tutorial now!
When you have finished the tutorial, you should browse the comprehensive package documentation to find out more about its functionalities. If you want to learn more about the motivation behind zipfR, the math and some applications, we recommend that you take a look at the materials from our ESSLLI 2006 course Counting Words and at the LREC 2018 / Corpus Linguistics 2019 tutorial What Every Computational Corpus Linguist Should Know About Type-Token Distributions. The papers listed in the Background reading section below may also be of interest to you.
zipfR accepts several input formats, including simple frequency lists and plain samples in one-token-per-line format. Thus, it should be easy to extract suitable input data using standard corpus processing tools.
However, we also provide two Perl pre-processing scripts that might be useful for certain studies:
compute_emp_vgc.pl: This script takes a corpus in one-token-per-line format and computes its observed vocabulary and hapax growth curves (observed growth curves cannot be generated from frequency lists, since they require information on the ordering of the tokens in the input). Note that the Perl script is more efficient for large corpora than the built-in zipfR functions.
randomization_experiments.pl: This script takes a CWB-encoded corpus as input and computes spectra and vocabulary growth curves for multiple randomizations of arbitrary text segments (e.g. sentences or documents). The script is meant for rather advanced tests of the randomness assumption lying behind LNRE modeling and for handling large corpora. Most users will have no use for it.
We would love to hear from you: feedback, feature requests, offers to collaborate on the development of the toolkit, invitations, money, … ;-)
|
http://www.stefan-evert.de/
|
http://clic.cimec.unitn.it/marco/
You should also subscribe to the zipfR mailing list, where you can get help and hold discussions with other users.
H. Baayen (2001). Word frequency distributions. Kluwer, Dordrecht (the companion LEXSTATS toolkit for LNRE analysis is available here).
M. Baroni (2008). Distributions in text. In Anke Lüdeling and Merja Kytö (eds.), Corpus Linguistics: An International Handbook. Berlin: Mouton de Gruyter.
M. Baroni and S. Evert (2007). Words and echoes: Assessing and mitigating the non-randomness problem in word frequency distribution modeling. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic.
S. Evert (2004a). A simple LNRE model for random character sequences. In Proceedings of JADT 2004, 411-422.
S. Evert (2004b, published 2005). The statistics of word cooccurrences: Word pairs and collocations. Ph.D. thesis, Institut für maschinelle Sprachverarbeitung, University of Stuttgart.
S. Evert (2020). Inside zipfR. Technical report. [Work in progress. Provides mathematical details on the algorithms implemented in the zipfR package.]
S. Evert and M. Baroni (2006a). Testing the extrapolation quality of word frequency models. In Proceedings of Corpus Linguistics 2005, Birmingham, UK.
S. Evert and M. Baroni (2006b). The zipfR library: Words and other rare events in R. Presentation at useR! 2006: The Second R User Conference, Vienna, Austria.
S. Evert and M. Baroni (2007). zipfR: Word frequency distributions in R. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, Prague, Czech Republic. [poster]
F. J. Tweedie and R. H. Baayen (1998). How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352.
Whenever possible, you should use the most recent version of the zipfR package from CRAN or R-Forge (see installation instructions). Here, we provide source code packages of old versions as a fallback in case there are any incompatibilities with legacy platforms or existing R scripts.
lnre.bootstrap()
for efficient parametric bootstrapping, especially confidence intervals for LNRE model parameterslnre.productivity.measures()
computes approximate expectations or bootstrapped confidence intervals for productivity measuresplot()
method for LNRE models (log type density or cumulative probability distribution) and improved plots of type frequency lists (relative frequencies, LNRE distributions)do.call()
trickerylnre()
constructor supports user-defined cost function for parameter estimationtfl2spc()
no longer includes types with f = 0 in frequency spectrumlnre.bootstrap()
and bootstrap.confint()
rlnre()
can directly generate type-frequency lists (essential for large samples)Custom
often failed to converge on problematic data sets)gof
based on multivariate chi-squared statistic, closely related to maximum likelihood estimationconfint()
method for LNRE modelsmerge() method for two or more type frequency lists
zipfR.begin.plot()
, zipfR.end.plot()
, etc.) are deprecatedzipfR.legend()
functionread.multiple.objects()