Tasks, Rules and Guidelines

The challenge consists of three categorization tasks, involving different documents and category systems. In particular, the largest category system, based on Wikipedia, contains more than 300,000 categories and 2M documents for training. The largest category system ever used in the past for evaluation purposes, to the best of our knowledge, was based on the Yahoo! Directory and contained 130,000 categories and 500,000 training documents. In addition to the largest task, two smaller ones, based on Wikipedia and DMOZ respectively, are included in the challenge. The scale of these is in the order of the first edition of the challenge. All of the datasets in this edition are multi-label. Particularly in the two datasets that are based on Wikipedia, each document is assigned on average to 3.2 and 4.6 categories respectively. Furthermore, the hierarchies are no longer simple tree structures, as both documents and subcategories are allowed to belong to more than one other category. More statistics regarding the datasets can be found at the Datasets link.

Task 1: Dmoz (27,875 categories)

In this task, participants are asked to train and test their system using documents from the Dmoz dataset. A document may belong to more than one category but this phenomenon is rare. The hierarchy of this dataset is a tree, i.e. each node has only one parent.

Task 2: Wikipedia small (36,504 categories)

In this task, participants are asked to train and test their system using a "small" dataset of Wikipedia articles. These data are much more multi-labeled than the ones of task 1. The hierarchy of this dataset is a graph, i.e. each node can have more than one parents. Cycles have been removed from the graph.

Task 3: Wikipedia large (325,056 categories)

In this task, participants are asked to train and test their system using a large dataset of Wikipedia articles. In this task the number of categories and documents increases drastically compared to task 2. The hierarchy is also a graph but cycles have not been removed from it.

The documents of all three datasets were preprocessed to remove stop words and maintain word stems using the libstemmer_c program from http://snowball.tartarus.org. The stemmed tokens were then replaced by unique integers.

Additional information regarding the datasets can be found in the Datasets <http://lshtc.iit.demokritos.gr/LSHTC2_datasets> page. To access this page and to download the datasets, you need first to register by clicking the Login/Register button at the upper-right corner of the page.

*Two-stage Evaluation*

In order to measure classification performance, an online evaluation system will be maintained on this Web site. Participants will be able to submit their results in the format specified below and at a maximum frequency of once per hour and task. Once the results are submitted, the system will measure the performance of the submission by computing the accuracy, example-based F-measure, label-based macro F-measure, label-based micro F-measure and multi-label graph-induced error. A live ranking like the previous competition will be used in order to be able to compare the submitted results with the ones of the other participants.

For more information regarding the Accuracy and the various forms of F-measure for multi-label classification, the interested reader is referred to G. Tsoumakas, I. Vlahavas, “Random k-Labelsets: An Ensemble Method for Multilabel Classification”, Proceedings of the 18th European Conference on Machine Learning (ECML 2007).

At the closing date of the testing phase, participants will be asked to submit the following:

* A short paper describing their method, including an algorithmic description, results of dry-run tests, computational complexity estimates, hardware set-up used for training the classifiers and training times. This paper will be uploaded to the site and will be publicly available.

Additionally, participants who desire to do so can submit:

* An executable of the learning program, adhering to the software and hardware requirements mentioned below, accepting input and producing output in the form provided by the challenge.

Using the executable programs, we will run a scalability test, varying the size of the hierarchy in order to measure the computational and memory requirements of each method.

Information about the Executables

Each participant must upload a compressed directory with their system. The system has to be able to run on a standard GNU/Linux operating system (Kernel 2.6.X) with 4GB of RAM and up to 10GB disk space. Click here for more information regarding the testing operating system that we are going to use.

Each system must implement in the form of bash scripts the following four commands:


initialize [filename]

All steps necessary to install, clean the system and preparer it for use. Read [filename] which contains the hierarchy information about the categories (cat_hier.txt).


train [filename]

Read [filename] which contains the training vectors (training.txt).


classify [filename]

Read [filename] which contains the test vectors (test.txt).The output of this script should be in the form described in the next section.



Used for cleanup (could be empty). After this step no executable should be running on the system.

Details about the purpose and the contents of each of the above-mentioned files are provided in the web page of the Datasets.

Only the four above scripts will be used in order to interact with the submitted systems. It is up to the participants to set up the architecture of their systems in order make these scripts communicate with each other. For example one could use an auxiliary file to store the trained model, while someone else could start a process in the background with which each script will directly communicate.


Output form

The output of each system should be in the form of a “result.txt” file. This file must appear in the same directory as the above scripts after the execution of the finalize script. Each line of this file must contain the numerics of the classes (separated by white spaces)  of the hierarchy chosen by the system for the corresponding vector of the test file.

A typical “result.txt” file for the Wikipedia large dataset should contain 452,167 lines (as the number of vectors in the test file) and should look like this:

543 65


456 5467 78 6945 9068

405 7868

771 5476

1015 797


1354 987 978


Please note that more information may be added to these guidelines if needed during the course of the competition. There is also a forum at the site that can be used for discussion and questions regarding the competition, please feel free to use it.