Search Engines: Information Retrieval in Practice

Document Collections

See license information below before downloading these collections.

Wiki Small (6043 documents) [tar.gz, 26MB] [corpus, 36MB]
Wiki Large (121,790 documents) [tar.gz, 524MB] [corpus, 715MB]
CACM (3024 documents) [tar.gz, 1MB] [corpus, 1MB] [relevance judgments] [raw queries] [processed queries]

Notes on the Collections

Licenses

The Wiki collections contain text distributed under the GNU Free Documentation License (GFDL), which is described in more detail on the Wikipedia Copyrights page. Please note that extracting these pages and serving them on a webserver contitutes a Wikipedia trademark violation.

The CACM collection comes from the Association of Computing Machinery (ACM), and is meant for research purposes.

Format, Content and Statistics

Each file distributed here is an archive of HTML documents. The tar.gz files are archived with the GNU tar program and compressed with gzip. You can decompress them if you'd like to browse through the text with a web browser. Windows users can use WinZip to decompress them. Mac and Linux users should have GNU tar already installed.

Each file also comes in Corpus format for easy processing with Galago. Unlike other similar formats, it's easy to access random documents in a Corpus file. The random access capability makes them a bit bigger, but the document text is compressed.

Wiki Small and Wiki Large were created from a snapshot of the English Wikipedia downloaded from static.wikipedia.org in early September 2008. All pages containing the words Talk, Category, Portal, Template, User, Image, or Wikipedia in the URL were removed from the snapshot, as well as all redirect pages. This snapshot was then sampled uniformly to create the collections you see here; the Large collection is a 5% sample from everything, and the Small collection is a 5% sample of the Large collection. The corpus files contain pseudo-URLs for each page that start with http://wiki-corpus/.

CACM is a collection of abstracts of articles published in the Communications of the ACM journal between 1958 and 1979. This collection has been used in numerous information retrieval papers, and although it is considered too small for new publications.

MD5 Hashes

wiki-small.tar.gz: ce98d0fd8c251c1bebe643384f6b0591
wiki-small.corpus: 721c358e3f168c05e7f01a4da0f06451
wiki-large.tar.gz: b895bad3e927838bf74d02354678a5f3
wiki-large.corpus: 0d63643b21b21cd1bc5affe039a206e2
cacm.corpus: 37f1b8199b2ef9b77b9495ed1ae885b3
cacm.tar.gz: 59d77bea5ea622c6aa2fc979c8036acb