Wikipedia Data Analysis Toolkit

An extensible toolkit for Wikipedia data analytics, based on Python and R.

Installation

Install WikiDAT to process and analyze the collection of public Wikipedia data dump files (history of all edits, administrative actions, etc.).

Instructions »

Documentation

Learn details about the format and content of Wikipedia dump files, and how to use and extend WikiDAT for your own projects.

Read »

Examples

These case studies introduce and explain methods, processes and hints to anlayze data from Wikipedia, including reproductions of previous studies.

Browse »

  • Project name: WikiDAT.
  • Main author: Felipe Ortega
  • License: GPLv3

WikiDAT is an extensible toolkit for Wikipedia data analytics, based on Python and R. Currently supported database backends include MySQL and MariaDB. Additional database engines may be also supported in the future.

WikiDAT includes several case studies describing exemplary projects using Wikipedia data sources. Each case study implements a different type of analysis and results and returned in form of data files, figures, PDF reports or interactive web pages. Case studies also include Python and R code embedded in literate programming documents (such as IPython or RMarkdown). These documents show how to replicate every case study, including data preparation, cleaning and analysis steps.

The long-term goal is to incorporate more case examples progressively, in order to cover the most illustrative and salient quantitative analyses with Wikipedia data. In the future, this may also involve the use of distributed computing tools (Hadoop, Spark) for the analysis of really huge data sets in high-resolution studies.