Tasks, Rules and Guidelines

Tasks

The LSHTC challenge consists of four large-scale classification tasks, with partially overlapping data. These tasks emulate different scenarios for learning and using classifiers, and introduce different challenges for the machine learning community, such as large-scale datasets, hierarchical category systems and non i.i.d. data. Participants should take part in the first task (Basic classification task) and are welcome to take part in any or all of the other three tasks described below. The number of categories considered amounts to 12294.

The data we are considering in the LSHTC challenge is a subpart of the ODP (Open Directory Project) directory. The documents we have retained have been indexed in two ways: (a) content vectors correspond to a direct indexing of the web pages using a standard indexing chain (pre-processing, stemming/lemmatization, stop-word removal), and (b) description vectors correspond to a translation of the ODP descriptions of the web pages and the categories into feature vectors. The ODP descriptions are manually created by ODP editors when placing new documents in to the ODP hierarchy. They are thus available for each document in the ODP hierarchy, but not for new documents.

Each of the indexed web page belongs to a single category. Multi-labeled web pages have not been used in the dataset.

Task 1: Basic

In this task, participants are asked to train and test their system using content vectors only. This task corresponds to a standard text classification task in a large scale setting.

Task 2: Cheap

In this task, participants are asked to train their system using description vectors only; testing will then be performed on content vectors. This task simulates a scenario in which word distributions are different between training and testing (non i.i.d case). Only a short description is used for training (typically a summary), but the complete text is used for classifying new data.

Task 3: Expensive

In this task, participants are asked to train their system using both content and description vectors; as before, testing is performed on content vectors only. The motivation is the same as for the previous task, using this time all the available information for training.

Task 4: Full

In this task, participants are asked to train and test their system using both content and description vectors, i.e. all the information available.

Additional information regarding the datasets can be found in the Datasets <http://lshtc.iit.demokritos.gr/node/3> link. To access this page and to download the datasets, you need first to register by clicking the Login/Register button at the upper-right corner of the page.

*Two-stage Evaluation*

In order to measure classification performance, an online evaluation system will be maintained on this Web site.(Click here in order to access the evaluation system). Participants will be able to submit their results in the format specified below and at a maximum frequency of once per hour and task. Once the results are submitted, the system will measure the performance of the submission by computing the Macro-average F-measure, accuracy and tree-induced error. Two separate running score tables will be updated based on each one of these two measures, showing the ranking of the submissions so far.

At the closing date of the testing phase, participants will be asked to submit the following:

* A short paper describing their method, including an algorithmic description, results of dry-run tests, computational complexity estimates, hardware set-up used for training the classifiers and training times. This paper will be uploaded to the site and will be available to any visitor.

Additionally, participants who desire to do so can submit:

* An executable of the learning program, adhering to the software and hardware requirements mentioned below, accepting input and producing output in the form provided by the challenge.

Using the executable programs, we will run a scalability test, varying the size of the hierarchy from a few hundred to a few thousand categories, in order to measure the computational and memory requirements of each method.

Information about the Executables

Each participant must upload a compressed directory with their system. The system upload procedure will be available by the begining of November (keep an eye on the News section and the competition mailing list for future information). The system has to be able to run on a standard GNU/Linux operating system (Kernel 2.6.X) with 4GB of RAM and up to 10GB disk space. Click here for more information regarding the testing operating system that we are going to use.

Each system must implement in the form of bash scripts the following four commands:

 

initialize [filename]

All steps necessary to install, clean the system and preparer it for use. Read [filename] which contains the hierarchy information about the categories (cat_hier.txt).

 

train [filename1] [filename2] [filename3]

Read [filename1] which contains the training vectors (training.txt) and [filename2] which contains the class description vectors (classDescr.txt). If no class descriptions are available for the task (as in Task 1) this argument should have the value “-none”. Finally read [filename3] which contains the validation vectors (validation.txt).

 

classify [filename]

Read [filename] which contains the test vectors (test.txt).The output of this script should be in the form described in the next section.

 

finalize

Used for cleanup (could be empty). After this step no executable should be running on the system.

Details about the purpose and the contents of each of the above-mentioned files are provided in the web page of the Datasets.

Only the four above scripts will be used in order to interact with the submited systems. It is up to the participants to set up the architecture of their systems in order make these scripts communicate with each other. For example one could use an auxiliary file in order to store the trained model, while someone else could start a process in the background with which each script will directly communicate.

 

Output form

The output of each system should be in the form of a “result.txt” file. This file must appear in the same directory as the above scripts by the execution of the finalize script. Each line of this file must have a number with the class of the hierarchy chosen by the system for the corresponding vector of the test file.

A typical “result.txt” file for the large dataset should conatin 34880 lines (as the number of vectors in the test file) and should look like this:

9

33

156

405

771

1015

1170

1354

 

Please note that more information may be added to these guidelines if needed during the course of the competition.