Statistical Analysis of Corpus Data with R is an online course originally developed by Marco Baroni and Stephanie Evert starting in 2007. Since then it has grown thanks to contributions, feedback and suggestions from many generous people. This Web page provides slides, exercises, worked examples, and illustrative data sets, as well as an R package accompanying the course.

All materials and R code are made available under open licenses (CC-BY-SA for the course and GPL 3 for programme code). All source files are available in the SVN repository of the SIGIL R-Forge page. Share and enjoy!

Recent changes:

2025-06-19: corpora package v0.7 has been released on CRAN.
2024-06-06: The SIGIL course has finally moved to its proper home on R-Forge.
2024-06-06: Slides for Unit #9 (inter-annotator agreement) are available online now.

R package

This course is supported by an R package that contains useful functions for analysing corpus frequency data, some illustrative data sets, as well as convenience functions. R code examples in the course slide and exercises make use of these functions and data sets, so please make sure that you have installed the R package. Further add-on packages that will be needed in this course are listed in Unit 1.

The corpora package (version 0.7, new on 10.06.2025) is available on CRAN and can be installed with any standard R package manager. new

In addition to data sets and utilities directly connected to the SIGIL course, the corpora package provides useful functionality for analysing corpus frequency data, such as:

efficient p-values for Fisher's exact test on 2x2 contingency tables (fisher.pval)
efficient confidence intervals for proportions (prop.cint)
reference implementations of keyness measures, including recently suggested LRC (keyness)
many small convenience functions (qw, cont.table, sample.df, corpora.palette, colVector, …)
reference implementations of keyness measures (v0.6) as well as association measures for collocation analysis (v0.7)

Archive of older corpora package versions:

version 0.7 (06/2025): Windows – Mac OS X – source code
version 0.6 (08/2023): Windows – Mac OS X – source code
version 0.5-1 (03/2022): Windows – Mac OS X – source code
version 0.5 (06/2016): Windows – Mac OS X – source code

SIGIL course units

In order to follow worked examples and solve exercises, it is recommended that you put all relevant data and code files in an RStudio project directory (or your current working directory). All code examples in the slides and exercises will make this assumption. Some slides may still refer to data sets in the SIGIL package, which was rejected by CRAN. Please use the corpora package instead, making sure that you have installed version 0.6 or newer.

Unit 1: General introduction / First steps in R
- handouts: slides (1.2 MiB) – print version (0.8 MiB)
- exercise sheet – solution notes
- data: brown.stats.txt – lob.stats.txt
Unit 2: Corpus frequency data & statistical inference
- handouts: slides (7.6 MiB) – print version (4.7 MiB)
- worked example on BNC frequency comparisons: RMarkdown – PDF
- exercise sheet – solution notes
- data: passives.brown.csv – passives.lob.csv – bnc_queries.tbl – bnc_metadata_utf8.tbl (1.2 MiB)
Unit 3: Descriptive and inferential statistics for continuous data
- part 1: slides (1.0 MiB) – print version (0.8 MiB)
- part 2: slides (0.5 MiB) – print version (0.4 MiB)
- worked example on the effectiveness of a corpus-driven language course: RMarkdown – PDF
- exercise sheet – solution notes
Unit 4: Collocations, keywords & contingency tables
- measuring keyness: slides (5.3 MiB) – print version (4.9 MiB) – screencast (MP4, 195 MiB) (new on 11.09.2023) new
- worked example on keyword analysis: RStudio project 04_keyness_hands_on.zip (5.8 MiB) – HTML report (new on 13.09.2023) new
- screencast from hands-on session: part 1 (MP4, 348 MiB) – part 2 (MP4, 217 MiB) (new on 17.09.2023) new
- collocations part 1: slides (0.8 MiB) – print version (0.5 MiB)
- collocations part 2: slides (1.0 MiB) – print version (0.7 MiB)
- worked example on collocations and keywords: RMarkdown – PDF – PNG figure (for typesetting)
- exercise sheet – solution notes
- data: brown_bigrams.tbl – krenn_pp_verb.tbl
Unit 5: Word frequency distributions and Zipf's law: Using add-on packages
- handouts: slides (1.6 MiB) – print version (1.4 MiB)
- worked example in the zipfR package tutorial: PDF
- exercise sheet – solution notes
- data: bigrams.100k.tfl (1.1 MiB) – bigrams.100k.spc
- much more information on type-token distributions can be found in Stephanie's LNRE tutorial
Unit 6: Regression and the general linear model
Unit 7: Multivariate analysis (updated on 06.04.2023)
- overview talk: slides (5.8 MB) – handout (5.8 MB) – powerpoint (86 MB, with embedded animations) (updated on 04.04.2023)
- mathematical background: slides (0.8 MiB) – print version (0.8 MiB) update pending
- worked examples: Multivariate analysis in R – Geometric Multivariate Analysis (updated on 06.04.2023)
- RStudio project: 07_project.zip (7.3 MB; ZIP archive including all code & data) (updated on 06.04.2023)
Unit 8: The non-randomness of corpus data & generalised linear models
- handouts: slides (6.0 MiB) – print version (4.5 MiB)
- Worked example on The frequency of passives: RMarkdown – PDF
- data: passives_by_text.tbl
Unit 9: Inter-annotator agreement (new on 06.06.2024) new
- handouts: slides (0.7 MiB) – print version (0.4 MiB)
Unit 10: Bootstrapping

imprint & privacy