In order to measure classification performance, an online evaluation system will be maintained on this Web site. Participants will be able to submit their results in the format specified below and at a maximum frequency of once per hour and task. Once the results are submitted, the system will evaluate the performance of the submission by computing different measures for each track. Real-time ranking of the participating systems will be available, in order for participants to be able to compare their performance with that of other participants.

The presented results will be calculated on 20% of the test set in order to avoid over-tuning of the participating systems. The performance on the whole test set will be reported after the end of the evaluation procedure.

Track 1

For the first track the evaluation system will report the following measures: accuracy, example-based F-measure, label-based macro F-measure, label-based micro F-measure, multi-label graph-induced error, and Lowest Common Ancestor precision, recall and F-measure.

For more information regarding the hierarchical evaluation measures, the interested reader is referred to: Aris Kosmopoulos, Ioannis Partalas, Eric Gaussier, Georgios Paliouras and Ion Androutsopoulos, Evaluation Measures for Hierarchical Classification: a unified view and novel approaches (2013). (The package that implements the evaluation measures can be found here)

Track 2

In track 2 we will report accuracy, precision, recall and their hierarchical variants.

Track 3

The evaluation of the Refinement Learning track will be based on ontology alignment. The participating systems will be assessed with measures similar to precision, recall and f-measure. For more information the interested reader is referred to Elias Zavlitsanos, Georgios Paliouras and George Vouros: Gold Standard Evaluation of Ontology Learning Methods through Ontology Transformation and Alignment, IEEE Transactions on Knowledge and Data Engineering, 23 (11) 1635-1648, 2011.

Please note that for the evalution of track 3 the participants must provide two files: one with the predicted hierarchy (in the same format as the provided hierarchy), and one with the predicted labels on the test file. Hint: the expected number of new leaves to the predicted hierarchy is around 2200.

Output Format

The output of each system should be in plain text format. Each line of this file must contain the predicted classes (separated by white spaces)  of the hierarchy chosen by the system for the corresponding vector of the test file. Note that, in addition to leaves, inner-nodes of the hierarchy are valid classification answers.

A typical “result.txt” file (for Tracks 1 and 2) for the Wikipedia large dataset should contain 452,167 lines (as the number of vectors in the test file) and should look like this:

543 65


456 5467 78 6945 9068

405 7868

771 5476

1015 797


1354 987 978

For Track 3 in addition to the above result file, the participants should upload a txt file containing the refined hierarchy in the same manner as the original hierarchy provided in the tarball.

Please note that more information may be added to these guidelines if needed during the course of the competition. There is also a forum at the site that can be used for discussion and questions regarding the competition; please feel free to use it.