Third Pascal Challenge on
Large Scale Hierarchical Text classification
We are pleased to announce the 3rd edition of the Pascal Large Scale Hierarchical Text Classification (LSHTC) and ECML/PKDD 2012 Discovery Challenge. The LSHTC Challenge is a hierarchical text classification competition, using large datasets. This year's challenge focuses on interesting learning problems like multi-task and refinement learning.
Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for the learning methods.
The challenge consists of 3 tracks, involving different category systems with different data properties and focusing on different learning and mining problems. The challenge is based on two large datasets: one created from the ODP web directory (DMOZ) and one from Wikipedia. The datasets are multi-class, multi-label and hierarchical. The number of categories range between 13,000 and 325,000 roughly and the number of the documents between 380,000 and 2,400,000. More information regarding the tracks and challenge rules can be found at the "Tasks, Rules and Guidelines" page.
Participants will be able to smoothly and continuously submit runs, in order to improve their systems. This year we also plan a two-stage evaluation of the participating methods: one measuring classification performance and one for computational performance. It is important to measure both, as they are dependent.
In order to register for the challenge and gain access to the datasets you must have an account at the challenge Web site.
Ion Androutsopoulos, AUEB, Athens, Greece
Thierry Artières, LIP6, Paris, France
Patrick Gallinari, LIP6, Paris, France
Eric Gaussier, LIG, Grenoble, France
Aris Kosmopoulos, NCSR "Demokritos" & AUEB, Athens, Greece
George Paliouras, NCSR "Demokritos", Athens, Greece
Ioannis Partalas, LIG, Grenoble, France