Fourth Challenge on
Large Scale Hierarchical Text classification
Please cite the following paper if you use datasets from LSHTC:
LSHTC: A Benchmark for Large-Scale Text Classification, Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, Patrick Galinari, CoRR abs/1503.08581, 2015
We are pleased to announce the 4th edition of the Large Scale Hierarchical Text Classification (LSHTC) Challenge. The LSHTC Challenge is a hierarchical text classification competition, using very large datasets. This year's challenge focuses on interesting learning problems like multi-task and refinement learning.
Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for new learning methods.
The challenge consists of 3 tracks, involving different category systems with different data properties and focusing on different learning and mining problems. The challenge is based on two large datasets: one created from the ODP web directory (DMOZ) and one from Wikipedia. The datasets are multi-class, multi-label and hierarchical. The number of categories range between 13,000 and 325,000 roughly and number of the documents between 380,000 and 2,400,000. More information regarding the tracks and challenge rules can be found at the "Datasets, Tracks, Rules and Guidelines" page.
Participants will be able to smoothly and continuously submit runs, in order to improve their systems.
In order to register for the challenge and gain access to the datasets you must have an account at the challenge Web site.
Massih-Reza Amini, LIG, Grenoble, France
Ion Androutsopoulos, AUEB, Athens, Greece
Thierry Artières, LIP6, Paris, France
Nicolas Baskiotis, LIP6, Paris, France
Patrick Gallinari, LIP6, Paris, France
Eric Gaussier, LIG, Grenoble, France
Aris Kosmopoulos, NCSR "Demokritos" & AUEB, Athens, Greece
George Paliouras, NCSR "Demokritos", Athens, Greece
Ioannis Partalas, LIG, Grenoble, France