In task 1 we offer 3 files (train.txt, validation.txt, test.txt).
Each one of them contains content vectors which correspond to a direct indexing of the web pages using a standard indexing chain (pre-processing, stemming/lemmatization, stop-word removal).
In task 2 we offer 4 files (train.txt, classDescr.txt, validation.txt, test.txt)
The train.txt contains description vectors which correspond to a translation of the ODP descriptions of the web pages into feature vectors.
The classDescr.txt which correspond to a translation of the ODP descriptions of the the categories into feature vectors. This file does not contain a vector for each category because the RDF of the ODP does not contain a description for every class but only for a subset of them.
Both of the above files should be used for training.
The validation.txt contains description vectors which correspond to a translation of the ODP descriptions of the web pages into feature vectors.
The test.txt file contains content vectors which correspond to a direct indexing of the web pages using a standard indexing chain (pre-processing, stemming/lemmatization, stop-word removal).
In task 3 we offer 4 files (train.txt, classDescr.txt, validation.txt, test.txt)
The train.txt contains vectors which are a combination of content and description vectors of the web pages.
So each vector contains features from both the direct indexing of the web page and the translation of the ODP description of the web page.
The classDescr.txt which correspond to a translation of the ODP descriptions of the the categories into feature vectors. This file does not contain a vector for each category because the RDF of the ODP does not contain a description for every class but only for a subset of them.
Both of the above files should be used for training.
The validation.txt contains vectors which are a combination of content and description vectors of the web pages.
So each vector contains features from both the direct indexing of the web page and the translation of the ODP description of the web page.
The test.txt file contains content vectors which correspond to a direct indexing of the web pages using a standard indexing chain (pre-processing, stemming/lemmatization, stop-word removal).
In task 4 we offer 4 files (train.txt, classDescr.txt, validation.txt, test.txt)
The train.txt contains vectors which are a combination of content and description vectors of the web pages.
So each vector contains features from both the direct indexing of the web page and the translation of the ODP description of the web page.
The classDescr.txt which correspond to a translation of the ODP descriptions of the the categories into feature vectors. This file does not contain a vector for each category because the RDF of the ODP does not contain a description for every class but only for a subset of them.
Both of the above files should be used for training.
The validation.txt and test.txt files contain content vectors which are a combination of content and description vectors of the web pages.
So each vector contains features from both the direct indexing of the web page and the translation of the ODP description of the web page.