Search Engines: Information Retrieval in Practice

Document Collections

See license information below before downloading these collections.

Notes on the Collections


The Wiki collections contain text distributed under the GNU Free Documentation License (GFDL), which is described in more detail on the Wikipedia Copyrights page. Please note that extracting these pages and serving them on a webserver contitutes a Wikipedia trademark violation.

The CACM collection comes from the Association of Computing Machinery (ACM), and is meant for research purposes.

Format, Content and Statistics

Each file distributed here is an archive of HTML documents. The tar.gz files are archived with the GNU tar program and compressed with gzip. You can decompress them if you'd like to browse through the text with a web browser. Windows users can use WinZip to decompress them. Mac and Linux users should have GNU tar already installed.

Each file also comes in Corpus format for easy processing with Galago. Unlike other similar formats, it's easy to access random documents in a Corpus file. The random access capability makes them a bit bigger, but the document text is compressed.

Wiki Small and Wiki Large were created from a snapshot of the English Wikipedia downloaded from in early September 2008. All pages containing the words Talk, Category, Portal, Template, User, Image, or Wikipedia in the URL were removed from the snapshot, as well as all redirect pages. This snapshot was then sampled uniformly to create the collections you see here; the Large collection is a 5% sample from everything, and the Small collection is a 5% sample of the Large collection. The corpus files contain pseudo-URLs for each page that start with http://wiki-corpus/.

CACM is a collection of abstracts of articles published in the Communications of the ACM journal between 1958 and 1979. This collection has been used in numerous information retrieval papers, and although it is considered too small for new publications.

MD5 Hashes