Questions on class descriptions

My understanding is that Task 2 relies only on the class descriptions for training. The classDescr.txt files in the dry run data contain only 394 lines, presumably one per class description. Does that mean there is a description for only 394 out of 1139 categories? What information are we supposed to use for learning the other categories? Second point: assuming the 394 classes described are the same in Task 2, Task 3 and Task 4, why are the indexed sparse vectors corresponding to the descriptions not the same in all three classDescr.txt files? This makes no sense...

Questions on class descriptions

Task 2 relies only on description vectors for training. We never said that it relies only on the class descriptions for training. Train.txt contains the description vectors for each site and classDescr.txt contains the description vectors for some of the categories of dmoz. Both of them should be used for training in task 2.

The RDF of dmoz does not contain a description for every category but only for a subset of them. This is why we do not provide a description for every class. So for all classes you can use for training only the information provided by the train.txt while in some others additional information is provided by the classDescr.txt file.

As for your second point in the Dataset section we state that:

"Each feature number is associated to a stemmed word. The mapping between the integers and the stemmed words is different in each task. Therefore, the models trained on the training set of one task cannot be used on the test set of another."

This is why the classDescr.txt files seem different in every task.

Thanks ...

... for the quick reply.

It would be good to clarify that on the description page (I see I was not the only one struck with that misunderstanding).

Also: in tasks 3 and 4, where are the document descriptions then? It seems that only the content is available for train/validation/test.

Q2:
Regarding the difference is classDescr.txt: Is that a convoluted way of saying that you used different indexes for the different tasks within the same dataset?

Is there any reason for doing that?

Any chance you could make available a lightly processed plain-text version of the documents/descriptions so that we can do our own indexing + text processing?

Q2 answer

The reason for this is that we do not want somebody to be able to use files from one task in order to train a classifier and use it on the test file of another task.
So for example feature with id=78 is a different token in each dataset.

A lightly processed plain-text of the documents/descriptions will not be available.

Clarifying things.

I will try to clarify things.

In task 1 we offer 3 files (train.txt, validation.txt, test.txt).

Each one of them contains content vectors which correspond to a direct indexing of the web pages using a standard indexing chain (pre-processing, stemming/lemmatization, stop-word removal).

In task 2 we offer 4 files (train.txt, classDescr.txt, validation.txt, test.txt)

The train.txt contains description vectors which correspond to a translation of the ODP descriptions of the web pages into feature vectors.

The classDescr.txt which correspond to a translation of the ODP descriptions of the the categories into feature vectors. This file does not contain a vector for each category because the RDF of the ODP does not contain a description for every class but only for a subset of them.

Both of the above files should be used for training.

The validation.txt and test.txt files contain content vectors which correspond to a direct indexing of the web pages using a standard indexing chain (pre-processing, stemming/lemmatization, stop-word removal).

In task 3 we offer 4 files (train.txt, classDescr.txt, validation.txt, test.txt)

The train.txt contains vectors which are a combination of content and description vectors of the web pages.
So each vector contains features from both the direct indexing of the web page and the translation of the ODP description of the web page.

The classDescr.txt which correspond to a translation of the ODP descriptions of the the categories into feature vectors. This file does not contain a vector for each category because the RDF of the ODP does not contain a description for every class but only for a subset of them.

Both of the above files should be used for training.

The validation.txt and test.txt files contain content vectors which correspond to a direct indexing of the web pages using a standard indexing chain (pre-processing, stemming/lemmatization, stop-word removal).

In task 4 we offer 4 files (train.txt, classDescr.txt, validation.txt, test.txt)

The train.txt contains vectors which are a combination of content and description vectors of the web pages.
So each vector contains features from both the direct indexing of the web page and the translation of the ODP description of the web page.

The classDescr.txt which correspond to a translation of the ODP descriptions of the the categories into feature vectors. This file does not contain a vector for each category because the RDF of the ODP does not contain a description for every class but only for a subset of them.

Both of the above files should be used for training.

The validation.txt and test.txt files contain content vectors which are a combination of content and description vectors of the web pages.
So each vector contains features from both the direct indexing of the web page and the translation of the ODP description of the web page.

OK

Dear admin,

I guess «translation» should not be taken literaly here -- there is no language translation involved, right?

Also, in task 3 and task 4, there is no way to differentiate between the document *content* and the document *description* in the train.txt (and validation.txt and test.txt for T4) files provided for the challenge, correct?

Thanx.

Correct!

Correct!