Full history dump files contain the complete bitacora describing all edits performed by any user on any page in a Wikipedia language site. Hence, we find the complete history of edits for pages in all namespaces, not only encyclopedic articles (Talk, User, User talk, etc.).
These files are named using the following format:
File: {lang}-{date}-{pages-meta-history}[num].xml[-pSTART-pEND].{compress}
Examples: eswiki-20150429-pages-meta-history1.xml.7z
enwiki-20150304-pages-meta-history1.xml-p000000010p000002944.7z
Mandatory elements are shown in {}, while optional elements are shown in []. The meaning of each field is:
<id>
in element <page>
(see below).
<mediawiki>
: This is the root element. It provides
information about the XML namespace and the URL of the schema describing
some of the XML elements (v0.6, still work in progress).
<siteinfo>
: Includes general information about
this Wikipedia site, with the following subelements:
<sitename>
: Name of this Wikipedia site (in
the corresponding language).
<base>
: Base URL for all pages in this
Wikipedia site.
<generator>
: Version of the MediaWiki engine
that produced this dump.
<case>
: Convention for text case (usually,
first letter of each word).
<namespaces>
: A list of the name (localized
for that language) and the internal numerical code for each namespace
in this MediaWiki site. Codes are important, since they identify the
namespace for each <page>
element.
Below you can find XML code snippets with examples for each of these elements.
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/
http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.25wmf10</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
[... rest of namespace definitions ...]
</namespaces>
</siteinfo>
Namespace | Code |
---|---|
Media | -2 |
Special | -1 |
Main (none) | 0 |
Talk | 1 |
User | 2 |
User talk | 3 |
Wikipedia | 4 |
Wikipedia talk | 5 |
File | 6 |
Namespace | Code |
---|---|
File talk | 7 |
Mediawiki | 8 |
Mediawiki talk | 9 |
Template | 10 |
Template talk | 11 |
Help | 12 |
Help talk | 13 |
Category | 14 |
Category talk | 15 |
Warning: The above list of namespaces is non-exhaustive. It only shows some of the most common namespaces and their standard codes in Wikipedia. Please refer to the database table namespaces created by WikiDAT for the actual list of namespaces and codes in the Wikipedia dump that you are analyzing.
The main body of this XML dump files is a list of <page>
elements.
Each <page> block shows descriptive information about a wiki page,
along with the complete collection of all <revision>
elements
(edit actions) performed on that page over time.
<page>
: Element containing
information about a wiki page and its complete collection of edits (as a sublist
of <revision>
elements).
<title>
: String,
the title of this page.
<ns>
: Integer,
the code of the namespace in which this page is stored.
<id>
: Positive integer,
unique numerical identifier for this page.
<revision>
: Element
encapsulating information about a single edit on this page. For each page element,
there will be a list of subelements of this type describing the complete record of
all edits performed on this page.
<revision>
: An edit
performed on the corresponding page, indicated by its parent <page>
element.
<id>
: Positive integer,
unique numerical identifier for this revision. This identifier is globally
unique (not within this page, but for the entire database).
<parentid>
: Positive integer,
identifier of the previous revision (the parent of this version), following
the same coding as for <id>
. If absent, this is the first
revision for this page and the parent revision is assumed to be
NULL
.
<timestamp>
: String,
timestamp (date and time) indicating when this revision was performed. There
is no information about time zones. See example below for details about
the specific format.
<contributor>
: Element
showing information about the user who performed this revision. There are two
options:
<minor />
: If present, this
single tag indicates that this is not a minor revision. If
absent, this is a minor revision.
<comment>
: String, contains
the comment introduced by the user to summarize the changes introduced in this
revision.
<model>
: String, identifies
the model to interpret the content inside <text>
.
<format>
: String, indicates
the format to parse the wiki text inside <text>
.
<text>
: String containing
the full content of the page after that revision, including all MediaWiki markup
and HTML content.
<sha1>
: String, provides
the SHA-1 hash computed
on the text content of this revision.
<page>
<title>States of India by Punjabi speakers</title>
<ns>0</ns>
<id>18949912</id>
<revision>
[... metadata and content of first revision ...]
</revision>
<revision>
[... metadata and content of second revision ...]
</revision>
[... rest of revisions for this page ...]
</page>
<revision>
<id>276395626</id>
<parentid>275465083</parentid>
<timestamp>2009-03-10T23:39:42Z</timestamp>
<contributor>
<username>Anshuk</username>
<id>1432885</id>
</contributor>
<minor />
<comment>Disambiguate [[Punjabi]] to [[Punjabi language]] using [[:en:Wikipedia:Tools/Navigation_popups|popups]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">This is a '''list of States and Union Territories of India by speakers of [[Punjabi language|Punjabi]]''' as of [http://www.censusindia.gov.in/Census_Data_2001/Census_Data_Online/Language/Statement3.htm census 2001]. Gross population figures are [http://www.censusindia.gov.in/Census_Data_2001/Census_data_finder/C_Series/Population_by_religious_communities.htm available online.]
{| class="wikitable"
|-
! Rank || State || Punjabi speakers
[... rest of text in this revision ...]
</text>
<sha1>c1zvmeq0c3ndwm2en1x1hs000efg5ou</sha1>
</revision>
Format: YYYY-MM-DDTHH:MM:SSZ
The 'T' character marks end of the date info, 'Z' marks the end of the time info.
Example: <timestamp>2005-08-30T11:37:22Z</timestamp>
<contributor>
<ip>24.251.243.233</ip>
</contributor>
<contributor>
<username>MyLogin Name</username>
<id>3344555</id>
</contributor>
The following code snippet shows a complete excerpt of the content in one of these files:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/
http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="sco">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>scowiki</dbname>
<base>http://sco.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.25wmf12</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="100" case="first-letter">Portal</namespace>
<namespace key="101" case="first-letter">Portal talk</namespace>
<namespace key="828" case="first-letter">Module</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>Inglis leid</title>
<ns>0</ns>
<id>2</id>
<revision>
<id>7</id>
<timestamp>2005-06-22T10:17:05Z</timestamp>
<contributor>
<ip>24.251.198.251</ip>
</contributor>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">Tha '''Inglis''' (English) leid is a west [[Gairmanic leid]] at cam frae Ingland an thats forebear wis [[auld Inglis]]. Tha name "English" cams frae tha pairt o [[Gairmanie]] caw'd "Angeln". Inglis is tha waruld's seicont maist widelie spaken first leid, an his aboot 340 million hameborn speikers waruldwide.
[[en:English language]]</text>
<sha1>6m5yxiaalrm6te7e3x3fiw1aq7wk9ir</sha1>
</revision>
</page>
</mediawiki>