History dump files with metadata describing all edits performed by any user on any page in a Wikipedia language site. We can find the complete history of edits for pages in all namespaces, not only encyclopedic articles (Talk, User, User talk, etc.).
These dump files do not contain the text for every revision. Instead, only descriptive metadata fields are provided. Hence, the size of these files can be quite smaller compared to the full history (pages-meta-history files).
These files are named using the following format:
File: {lang}-{date}-stub-meta-history[num].xml.{compress}
Example: eswiki-20150429-stub-meta-history1.xml.gz
Mandatory elements are shown in {}, while optional elements are shown in []. The meaning of each field is:
The heading elements are identical as for the full history version.
<mediawiki>
: This is the root element. It provides
information about the XML namespace and the URL of the schema describing
some of the XML elements (v0.6, still work in progress).
<siteinfo>
: Includes general information about
this Wikipedia site, with the following subelements:
<sitename>
: Name of this Wikipedia site (in
the corresponding language).
<base>
: Base URL for all pages in this
Wikipedia site.
<generator>
: Version of the MediaWiki engine
that produced this dump.
<case>
: Convention for text case (usually,
first letter of each word).
<namespaces>
: A list of the name (localized
for that language) and the internal numerical code for each namespace
in this MediaWiki site. Codes are important, since they identify the
namespace for each <page>
element.
Below you can find XML code snippets with examples for each of these elements.
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/
http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.25wmf10</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
[... rest of namespace definitions ...]
</namespaces>
</siteinfo>
Namespace | Code |
---|---|
Media | -2 |
Special | -1 |
Main (none) | 0 |
Talk | 1 |
User | 2 |
User talk | 3 |
Wikipedia | 4 |
Wikipedia talk | 5 |
File | 6 |
Namespace | Code |
---|---|
File talk | 7 |
Mediawiki | 8 |
Mediawiki talk | 9 |
Template | 10 |
Template talk | 11 |
Help | 12 |
Help talk | 13 |
Category | 14 |
Category talk | 15 |
Warning: The above list of namespaces is non-exhaustive. It only shows some of the most common namespaces and their standard codes in Wikipedia. Please refer to the database table namespaces created by WikiDAT for the actual list of namespaces and codes in the Wikipedia dump that you are analyzing.
The main body of this XML dump files is a list of <page>
elements.
Each <page> block shows descriptive information about a wiki page,
along with the complete collection of all <revision>
elements
(edit actions) performed on that page over time.
The main difference with respect to the
full history version
files is that <text>
elements inside <revision>
do not contain any text. However, additional attributes provide metadata about
the text (in particular, its length in bytes).
<page>
: Element containing
information about a wiki page and its complete collection of edits (as a sublist
of <revision>
elements).
<title>
: String,
the title of this page.
<ns>
: Integer,
the code of the namespace in which this page is stored.
<id>
: Positive integer,
unique numerical identifier for this page.
<revision>
: Element
encapsulating information about a single edit on this page. For each page element,
there will be a list of subelements of this type describing the complete record of
all edits performed on this page.
<revision>
: An edit
performed on the corresponding page, indicated by its parent <page>
element.
<id>
: Positive integer,
unique numerical identifier for this revision. This identifier is globally
unique (not within this page, but for the entire database).
<parentid>
: Positive integer,
identifier of the previous revision (the parent of this version), following
the same coding as for <id>
. If absent, this is the first
revision for this page and the parent revision is assumed to be
NULL
.
<timestamp>
: String,
timestamp (date and time) indicating when this revision was performed. There
is no information about time zones. See example below for details about
the specific format.
<contributor>
: Element
showing information about the user who performed this revision. There are two
options:
<minor />
: If present, this
single tag indicates that this is not a minor revision. If
absent, this is a minor revision.
<comment>
: String, contains
the comment introduced by the user to summarize the changes introduced in this
revision.
<model>
: String, identifies
the model to interpret the content inside <text>
.
<format>
: String, indicates
the format to parse the wiki text inside <text>
.
<text>
: Empty element, with
the following attributes:
id
: Same id
of the revision.
bytes
: Length
of the text of the page on this revision, in bytes.
<sha1>
: String, provides
the SHA-1 hash computed
on the text content of this revision.
<page>
<title>Inglis leid</title>
<ns>0</ns>
<id>2</id>
<revision>
[... metadata and content of first revision ...]
</revision>
<revision>
[... metadata and content of second revision ...]
</revision>
[... rest of revisions for this page ...]
</page>
<revision>
<id>7</id>
<timestamp>2005-06-22T10:17:05Z</timestamp>
<contributor>
<ip>24.251.198.251</ip>
</contributor>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="7" bytes="334" />
<sha1>6m5yxiaalrm6te7e3x3fiw1aq7wk9ir</sha1>
</revision>
Format: YYYY-MM-DDTHH:MM:SSZ
The 'T' character marks end of the date info, 'Z' marks the end of the time info.
Example: <timestamp>2005-06-22T10:17:05Z</timestamp>
<contributor>
<ip>24.251.243.233</ip>
</contributor>
<contributor>
<username>MyLogin Name</username>
<id>3344555</id>
</contributor>
The following code snippet shows a complete excerpt of the content in one of these files:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="sco">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>scowiki</dbname>
<base>http://sco.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.25wmf24</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="100" case="first-letter">Portal</namespace>
<namespace key="101" case="first-letter">Portal talk</namespace>
<namespace key="828" case="first-letter">Module</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>Inglis leid</title>
<ns>0</ns>
<id>2</id>
<revision>
<id>7</id>
<timestamp>2005-06-22T10:17:05Z</timestamp>
<contributor>
<ip>24.251.198.251</ip>
</contributor>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="7" bytes="334" />
<sha1>6m5yxiaalrm6te7e3x3fiw1aq7wk9ir</sha1>
</revision>
<revision>
<id>8</id>
<parentid>7</parentid>
<timestamp>2005-06-22T12:13:55Z</timestamp>
<contributor>
<username>Saforrest</username>
<id>5</id>
</contributor>
<minor/>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="8" bytes="351" />
<sha1>p09d2l9c1gc8tat2e3x2e72o9fxori2</sha1>
</revision>
</page>
</mediawiki>