Tracks, Rules and Guidelines

The challenge consists of 3 tracks, involving different category systems with different data properties and focusing on different learning and mining problems. The challenge is based on two large datasets: one created from the ODP web directory (DMOZ) and one from Wikipedia. The datasets are multi-class, multi-label and hierarchical. The number of categories range between 13,000 and 325,000 roughly and the number of the documents between 380,000 and 2,400,000.

 

Track 1: Large Scale Hierarchical Classification.

This track is the standard large scale hierarchical classification task, base on Wikipedia data, and comprises two different subtasks:

  • Medium-size: This subtask is based on a medium-sized Wikipedia dataset with a total of 36,500 categories. The participants will be provided with one pre-processed version of the dataset, where features are replaced by numeric ids, and one version of the original text data, without any pre-processing. The participants will be able to experiment with several pre-processing techniques in order to boost the performance of their systems. The hierarchy of this dataset is an acyclic graph, i.e. each node can have more than one parent.
  • Large Wikipedia: In this subtask the number of categories is expanded to roughly 325,000. The hierarchy is a graph that can have cycles. Participants will receive pre-processed data for this subtask.

 

Track 2: Multi-task Learning.

This track introduces a multitask learning track between DMOZ and the medium-sized Wikipedia datasets. Multitask learning aims at leveraging classification in one category system with the classification results obtained in a different, yet related category system. One makes use of the shared information between the two category systems in order to improve classification performance on each of the individual tasks. For the challenge the participants will be provided with DMOZ and medium-sized Wikipedia datasets under a common feature space. The participating methods will be assessed on test sets from both datasets.

 

Track 3: Refinement Learning.

By refinement we refer to the process of creating new categories in the hierarchy by splitting old ones. In the context of hierarchy development, the creation of new categories and thus the expansion of the hierarchy corresponds to a scenario in which users interact with the taxonomy and modify it so as to best match their need. After the creation of several new categories, the system has to reassign documents to them. The development of automated solutions that can support this procedure in an efficient way has direct, practical impact. The proposed track addresses this scenario. It comprises two subtasks: a semi-supervised and an unsupervised one:

  • Semi-supervised:  Participants will receive a reduced hierarchy of the DMOZ dataset comprising ca. 12,000 categories, and an expansion of it with ca. 2,000 new categories, each of which will contain only a small number of documents (around 2) that will serve as seeds for the learning phase. The objective will be also to increase the overall performance of hierarchical classification on the test set.
  • Unsupervised: In this case the participants will receive a reduced hierarchy like in the semi-supervised subtask. No further information about the hierarchy will be provided except the number of expected classes in the expanded hierarchy. The objective here is to break a number of classes down into sub-classes. The participants will be assessed on the similarity of the constructed hierarchy to the true one.

The hierarchy for this track is a tree.

 

Evaluation

The participants will be able to upload their results on an online evaluation system. For more information please refere to the evaluation section.

 

At the closing date of the testing phase, participants will be asked to submit the following:

* A short paper describing their method, including an algorithmic description, results of dry-run tests, computational complexity estimates, hardware set-up used for training the classifiers and training times. This paper will be uploaded to the site and will be publicly available.