Full history dump files contain the complete bitacora describing all edits performed by any user on any page in a Wikipedia language site. Hence, we find the complete history of edits for pages in all namespaces, not only encyclopedic articles (Talk, User, User talk, etc.).

Standard filename format

These files are named using the following format:

File: {lang}-{date}-{pages-meta-history}[num].xml[-pSTART-pEND].{compress}

Examples: eswiki-20150429-pages-meta-history1.xml.7z
          enwiki-20150304-pages-meta-history1.xml-p000000010p000002944.7z

Mandatory elements are shown in {}, while optional elements are shown in []. The meaning of each field is:

  • lang: Identifier of the Wikipedia language. The current convention is to prepend the corresponding ISO-639 code for the language to the term wiki, identifying Wikipedia dumps. Dumps from other Wikimedia projects use their own identifier ("wikiquote", "wikibooks", etc.).
  • date: The date on which the dump file was produced. For large dump files, this date does not correspond to the date of the last revision included in the file (compression and integrity checks may take some time).
  • pages-meta-history: Identifier of the type of dump file, in this case the complete history of all edits on all pages in a Wikipedia language.
  • num: In large Wikipedia languages with too many pages and edits, producing a single dump file would be impractical. In these cases, the dump is split in different files, usually by page id in ascending order. Thus, we will need to process all these individual files to recover the complete dump for that language. The extreme case is the English Wikipedia, whose complete dump is sliced in many chunks, each one recording the range of page ids included in the file (see below).
  • xml: Type of data file. Currently, edit dumps are only provided in XML format.
  • pSTART-pEND: Optional field, displays the identifer of the first and the last page whose information is included in this file. The identifier is the same as for the field <id> in element <page> (see below).
  • compress: Extension identifying the algorithm used to compress the file. It is customary to use either 7zip (LZMA) or bzip2 to compress these files, as they can be very large in their original form.

File content

Heading elements

  • <mediawiki>: This is the root element. It provides information about the XML namespace and the URL of the schema describing some of the XML elements (v0.6, still work in progress).
  • <siteinfo>: Includes general information about this Wikipedia site, with the following subelements:
    • <sitename>: Name of this Wikipedia site (in the corresponding language).
    • <base>: Base URL for all pages in this Wikipedia site.
    • <generator>: Version of the MediaWiki engine that produced this dump.
    • <case>: Convention for text case (usually, first letter of each word).
    • <namespaces>: A list of the name (localized for that language) and the internal numerical code for each namespace in this MediaWiki site. Codes are important, since they identify the namespace for each <page> element.

Below you can find XML code snippets with examples for each of these elements.


<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/
  http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">

<siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.25wmf10</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      [... rest of namespace definitions ...]
    </namespaces>
  </siteinfo>
Namespace Code
Media -2
Special -1
Main (none) 0
Talk 1
User 2
User talk 3
Wikipedia 4
Wikipedia talk 5
File 6
Namespace Code
File talk 7
Mediawiki 8
Mediawiki talk 9
Template 10
Template talk 11
Help 12
Help talk 13
Category 14
Category talk 15

Warning: The above list of namespaces is non-exhaustive. It only shows some of the most common namespaces and their standard codes in Wikipedia. Please refer to the database table namespaces created by WikiDAT for the actual list of namespaces and codes in the Wikipedia dump that you are analyzing.

Main elements

The main body of this XML dump files is a list of <page> elements. Each <page> block shows descriptive information about a wiki page, along with the complete collection of all <revision> elements (edit actions) performed on that page over time.

  • <page>: Element containing information about a wiki page and its complete collection of edits (as a sublist of <revision> elements).
    • <title>: String, the title of this page.
    • <ns>: Integer, the code of the namespace in which this page is stored.
    • <id>: Positive integer, unique numerical identifier for this page.
    • <revision>: Element encapsulating information about a single edit on this page. For each page element, there will be a list of subelements of this type describing the complete record of all edits performed on this page.
  • <revision>: An edit performed on the corresponding page, indicated by its parent <page> element.
    • <id>: Positive integer, unique numerical identifier for this revision. This identifier is globally unique (not within this page, but for the entire database).
    • <parentid>: Positive integer, identifier of the previous revision (the parent of this version), following the same coding as for <id>. If absent, this is the first revision for this page and the parent revision is assumed to be NULL.
    • <timestamp>: String, timestamp (date and time) indicating when this revision was performed. There is no information about time zones. See example below for details about the specific format.
    • <contributor>: Element showing information about the user who performed this revision. There are two options:
      • Anonymous editor: Only the IP address is provided, as a subelement.
      • Registered editor: Both the unique numerical identifier of the user (for the whole database) and the login name of the user are provided as subelements. See the example code below.
    • <minor />: If present, this single tag indicates that this is not a minor revision. If absent, this is a minor revision.
    • <comment>: String, contains the comment introduced by the user to summarize the changes introduced in this revision.
    • <model>: String, identifies the model to interpret the content inside <text>.
    • <format>: String, indicates the format to parse the wiki text inside <text>.
    • <text>: String containing the full content of the page after that revision, including all MediaWiki markup and HTML content.
    • <sha1>: String, provides the SHA-1 hash computed on the text content of this revision.

<page>
    <title>States of India by Punjabi speakers</title>
    <ns>0</ns>
    <id>18949912</id>
    <revision>
    [... metadata and content of first revision ...]
    </revision>
    <revision>
    [... metadata and content of second revision ...]
    </revision>
    [... rest of revisions for this page ...]
  </page>

<revision>
      <id>276395626</id>
      <parentid>275465083</parentid>
      <timestamp>2009-03-10T23:39:42Z</timestamp>
      <contributor>
        <username>Anshuk</username>
        <id>1432885</id>
      </contributor>
      <minor />
      <comment>Disambiguate [[Punjabi]] to [[Punjabi language]] using [[:en:Wikipedia:Tools/Navigation_popups|popups]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">This is a '''list of States and Union Territories of India by speakers of [[Punjabi language|Punjabi]]''' as of [http://www.censusindia.gov.in/Census_Data_2001/Census_Data_Online/Language/Statement3.htm census 2001]. Gross population figures are  [http://www.censusindia.gov.in/Census_Data_2001/Census_data_finder/C_Series/Population_by_religious_communities.htm available online.]
	{| class=&quot;wikitable&quot;
	|-
	! Rank || State || Punjabi speakers
	
	[... rest of text in this revision ...]
	</text>
	<sha1>c1zvmeq0c3ndwm2en1x1hs000efg5ou</sha1>
</revision>

Format: YYYY-MM-DDTHH:MM:SSZ
The 'T' character marks end of the date info, 'Z' marks the end of the time info.
Example: <timestamp>2005-08-30T11:37:22Z</timestamp>

<contributor>
  <ip>24.251.243.233</ip>
</contributor>

<contributor>
  <username>MyLogin Name</username>
  <id>3344555</id>
</contributor>

Example content

The following code snippet shows a complete excerpt of the content in one of these files:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ 
http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="sco">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>scowiki</dbname>
    <base>http://sco.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.25wmf12</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>Inglis leid</title>
    <ns>0</ns>
    <id>2</id>
    <revision>
      <id>7</id>
      <timestamp>2005-06-22T10:17:05Z</timestamp>
      <contributor>
        <ip>24.251.198.251</ip>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">Tha '''Inglis''' (English) leid is a west [[Gairmanic leid]] at cam frae Ingland an thats forebear wis [[auld Inglis]]. Tha name &quot;English&quot; cams frae tha pairt o [[Gairmanie]] caw'd &quot;Angeln&quot;. Inglis is tha waruld's seicont maist widelie spaken first leid, an his aboot 340 million hameborn speikers waruldwide.

[[en:English language]]</text>
      <sha1>6m5yxiaalrm6te7e3x3fiw1aq7wk9ir</sha1>
    </revision>
  </page>
</mediawiki>