In order to measure classification performance, an online evaluation system will be maintained on this Web site. Participants will be able to submit their results in the format specified below and at a maximum frequency of once per hour and task. Once the results are submitted, the system will measure the performance of the submission by computing the accuracy, example-based F-measure, label-based macro F-measure, label-based micro F-measure, multi-label graph-induced error, and hierarchical precision, recall and F-measure. Real-time ranking of the participating methods will be available, in order for participants to be able to compare their performance with that of other participants.


For more information regarding the hierarchical versions of precision, recall and F-measure the interested reader is referred to Kiritchenko, S.: Hierarchical text categorization and its application to bio-informatics. Ph.D. thesis, University of Ottawa Ottawa, Ont., Canada (2006).


The evaluation of the unsupervised track 3b will be based on ontology alignment. The participating systems will be assessed with measures similar to precision, recall and f-measure. For more information the interested reader is referred to Elias Zavlitsanos, Georgios Paliouras and George Vouros: Gold Standard Evaluation of Ontology Learning Methods through Ontology Transformation and Alignment, IEEE Transactions on Knowledge and Data Engineering, 23 (11) 1635-1648, 2011.

Please note that for the evalution of track 3b the participants must provide two files: one with the predicted hierarchy (in the same format as the provided hierarchy), and one with the predicted labels on the test file. Hint: the expected number of new leaves to the predicted hierarchy is around 2200.


Output Format

The output of each system should be in plain text format. Each line of this file must contain the numerics of the classes (separated by white spaces)  of the hierarchy chosen by the system for the corresponding vector of the test file. Note that, in addition to leaves, inner-nodes of the hierarchy are valid classification answers.


A typical “result.txt” file for the Wikipedia large dataset should contain 452,167 lines (as the number of vectors in the test file) and should look like this:


543 65


456 5467 78 6945 9068

405 7868

771 5476

1015 797


1354 987 978


Please note that more information may be added to these guidelines if needed during the course of the competition. There is also a forum at the site that can be used for discussion and questions regarding the competition, please feel free to use it.