Wikipedia database dump 维基百科数据库下载

  • Posted on
  • by

维基百科是一个在线的免费的百科全书。上面的内容非常丰富,图文并茂,而且即使很枯燥的东西也经常用很生动的语言描述。

现在,维基百科的数据库提供下载了,但是一直没有中文的说明书。不仅没有中文说明书的,甚至上英文的也要用代理。因此我就把英文的资料拷贝过来。我想如果把上面的数据下载到自己的电脑慢慢看,不仅可以增加知识,也能学习英文。数据库的下载链接是download.wikipedia.org,目前无需代理就可以下载。

以下是英文的说明:

1 Schedule

Starting January 23, 2006 dumps will be run approximately once a week. Since the whole process takes more than a week for all databases, not all databases become available at the same time.
Note that the larger databases such as enwiki, dewiki, and jawiki can take a long time to run, especially when compressing the full edit history. If you see it stuck on one of these for a few hours, or up to nine days, don't worry -- it's not dead, it's just a lot of data.
The download site at http://download.wikimedia.org/ shows the status of each dump, if it's in progress, when it was last dumped, etc.

2 What's available?

Page content
Page-to-page link lists (pagelinks, categorylinks, imagelinks tables)
Image metadata (image, oldimage tables)
Misc bits (interwiki, site_stats tables)

3 What else?

may or may not be available, but is public data
Log data (protection, deletion, uploads) -- see logging.sql.gz
Dump metadata (availability, schedule)
Multi-language dumps (clusters of languages in one file)

4 What's not available?

User data: passwords, e-mail addresses, preferences, watchlists, etc
Deleted page content
At the moment uploaded files are dealt with separately and somewhat less regularly, but we intend to make upload dumps more regularly again in the future.

5 Format
The main page data is provided in the same XML wrapper format that Special:Export produces for individual pages. It's fairly self-explanatory to look at, but there is some documentation at Help:Export.
Three sets of page data are produced for each dump, depending on what you need:
pages-articles.xml
Contains current version of all article pages, templates, and other pages
Excludes discussion pages ('Talk:') and user "home" pages ('User:')
Recommended for republishing of content.
pages-meta-current.xml
Contains current version of all pages, including discussion and user "home" pages.
pages-meta-history.xml
Contains complete text of every revision of every page (can be very large!)
Recommended for research and archives.
The XML itself contains complete, raw text of every revision, so in particular the full history files can be extremely large; en.wikipedia.org would run upwards of six hundred gigabytes raw. Currently we are compressing these XML streams with bzip2 (.bz2 files) and additionally for the full history dump SevenZip (.7z files).
SevenZip's LZMA compression produces significantly smaller files for the full-history dumps, but doesn't do better than bzip2 for our other files.
Several of the tables are also dumped with mysqldump should anyone find them useful; the gzip-compressed SQL dumps (.sql.gz) can be read directly into a MySQL database but may be less convenient for other database formats.

6 What happened to the SQL dumps?
In mid-2005 we upgraded the Wikimedia sites to MediaWiki 1.5, which uses a very different database layout than earlier versions. SQL dumps of the 'cur' and 'old' tables are no longer available because those tables no longer exist.
We don't provide direct dumps of the new 'page', 'revision', and 'text' tables either because aggressive changes to the backend storage make this extra difficult: much data is in fact indirection pointing to another database cluster, and deleted pages which we cannot reproduce may still be present in the raw internal database blobs. The XML dump format provides forward and backward compatibility without requiring authors of third-party dump processing or statistics tools to reproduce our every internal hack. If required, you can use the mwdumper tool (see below) to produce SQL statements compatible with the version 1.4 schema from an XML dump.

7 Tools
Note:
The page import methods mentioned below don't automatically rebuild the auxiliary tables such as the links tables. The non-private auxiliary tables are provided as gzipped SQL dumps which can be imported directly into MySQL.
See also Meta's notes on rebuilding link tables

7.1 importDump.php
MediaWiki 1.5 and above includes a command-line script 'importDump.php' which can be used to import an XML page dump into the database. This requires first configuring and installing MediaWiki. It's also relatively slow; to import a large Wikipedia data dump into a fresh database you should consider mwdumper, below.
As an example invocation, when you have an XML file called temp.xml
php maintenanceimportDump.php < maintenancetemp.xml

7.2 mwdumper
mwdumper is a standalone program for filtering and converting XML dumps. It can produce output as another XML dump as well as SQL statements for inserting data directly into a database in MediaWiki's 1.4 or 1.5 schema.
Future versions of mwdumper will include support for creating a database and configuring a MediaWiki installation directly, but currently it just produces raw SQL which can be piped to MySQL. The program is written in Java and has been tested with Sun's 1.5 JRE and GNU's GCJ 4. Source is in our CVS; a precompiled .jar is available at http://download.wikimedia.org/tools/
Be sure to review the README.txt file which is also provided. It explains the invocation options required. Friendly wiki-version of README with few additional hints is available at http://www.mediawiki.org/wiki/MWDumper

7.3 bzip2
For the .bz2 files, use bzip2 to decompress. bzip2 comes standard with most Linux/Unix/Mac OS X systems these days. For Windows you may need to obtain it separately from the link below.
http://www.bzip.org/downloads.html
mwdumper can read the .bz2 files directly, but importDump.php requires piping like so: bzip2 -dc pages_current.xml.bz2 | php importDump.php

7.4 7-Zip
For the .7z files, you can use 7-Zip or p7zip to decompress. These are available as free software:
Windows: http://www.7-zip.org/
Unix/Linux/Mac OS X: http://p7zip.sourceforge.net/
Something like: 7za e -so pages_current.xml.7z | php importDump.php
will expand the current pages and pipe them to the importDump.php PHP script.

7.5 Perl importing script
This is a script Tbsmith made to import only pages in certain categories. It works for Mediawiki 1.5.
The script

8 Producing your own dumps
MediaWiki 1.5 and above includes a command-line maintenance script dumpBackup.php which can be used to produce XML dumps directly, with or without page history. mwdumper can be used to make filtered dumps (like pages_articles.xml); this is also built into dumpBackup.php in latest CVS.
The program which manages our multi-database dump process is available in our source repository, but likely would require customization for use outside Wikimedia's cluster setup.

9 Where to go for help
If you have trouble with the dump files, you can:
Ask in #wikimedia-tech on irc.freenode.net - Although help is not always available at all times
Ask on wikitech-l on http://mail.wikimedia.org/
Alternatively, if you have a specific bug to report:
File a bug at http://bugzilla.wikimedia.org/
For French speaking people, see also fr:Wikipédia:Requêtes XML

10 What about bittorrent?
bittorrent is not currently used to distribute Wikimedia dumps... at least not officially. Of course some torrents of dumps exist. If you have started torrenting dumps, leave a note here.
Torrentspy search -- currently showing one wikipedia and one wikipedia-fr... 00:00, 2 June 2006 (UTC)

原文地址:http://meta.wikimedia.org/wiki/Data_dumps

--------