Statistical Analysis of Corpus Data with R is an online course originally developed by Marco Baroni and Stephanie Evert starting in 2007. Since then it has grown thanks to contributions, feedback and suggestions from many generous people. This Web page provides slides, exercises, worked examples, and illustrative data sets, as well as an R package accompanying the course.

All materials and R code are made available under open licenses (CC-BY-SA for the course and GPL 3 for programme code). All source files are available in the SVN repository of the SIGIL R-Forge page. Share and enjoy!

Recent changes:

R package

This course is supported by an R package that contains useful functions for analysing corpus frequency data, some illustrative data sets, as well as convenience functions. R code examples in the course slide and exercises make use of these functions and data sets, so please make sure that you have installed the R package. Further add-on packages that will be needed in this course are listed in Unit 1.

In addition to data sets and utilities directly connected to the SIGIL course, the corpora package provides useful functionality for analysing corpus frequency data, such as:

Archive of older corpora package versions:

SIGIL course units

In order to follow worked examples and solve exercises, it is recommended that you put all relevant data and code files in an RStudio project directory (or your current working directory). All code examples in the slides and exercises will make this assumption. Some slides may still refer to data sets in the SIGIL package, which was rejected by CRAN. Please use the corpora package instead, making sure that you have installed version 0.6 or newer.


imprint & privacy