LSHTC: A Benchmark for Large-Scale Text Classification

LSHTC is a series of challenges which aims to assess the performance of classification systems in large-scale classification in a a large number of classes (up to hundreds of thousands). Four editions of the LSHTC challenge were organized from 2010 to 2014.

In this page you can download the datasets used in the different editions of the challenge.

The datasets are provided in the Libsvm format where each line corresponds to an instance. For each dataset, we also provide a hierarchy file which contains parent-child relations for the categories of the dataset. For further details, please refer to the corrsponding paper LSHTC: A Benchmark for Large-Scale Text Classification




All datasets in this edition are multi-label. The DMOZ dataset contains a tree hierarchy while the Wikipedia small dataset hierarchy is a DAG. Finally, the Wiki-large is a graph with cycles.


The main addition in this edition of the challenges was the release of the raw text data for the small Wikipedia dataset.


The fourth edition of LSHTC run as a Kaggle competition using the large Wikipedia dataset.


Descriptions of the participating systems can be found in github


Please cite the following paper if you use any of the LSHTC datasets:

  author    = {Ioannis Partalas and
               Aris Kosmopoulos and
               Nicolas Baskiotis and
               Thierry Arti{\`{e}}res and
               George Paliouras and
               {\'{E}}ric Gaussier and
               Ion Androutsopoulos and
               Massih{-}Reza Amini and
               Patrick Gallinari},
  title     = {{LSHTC:} {A} Benchmark for Large-Scale Text Classification},
  journal   = {CoRR},
  volume    = {abs/1503.08581},
  year      = {2015}


For further requests or questions please contact Ioannis Partalas: firstname [dot] lastname [at] gmail [dot] com