Overview

This is a dataset for multi-task learning with a large number (957) of tasks, first used in [1]. It was created by selecting a subset of products from Julian McAuley's Amazon product dataset [2], for which there are at least 300 positive reviews (with scores 4 or 5) and at least 300 negative reviews (with scores 1 or 2). Each of the resulting 957 products is treated as a binary classification task of predicting whether a review is positive or negative.

Data Format

Each file corresponds to one task, i.e. product, and its name is the corresponding ProductID in the original dataset. In files, every line corresponds to one review. First 25 numbers are 25-dimensional feature vector of the original text review, obtained by pre-processing (removing all non-alphabetical characters, transforming the rest into lower case and removing stop words) and then applying the sentence embedding procedure of [3] using 25-dimensional GloVe word embedding [4]. The last value, 0 or 1, is the label, negative or positive.

Experimental protocol

We suggest the following experimental protocol (as in [1]): from each task, select 500 examples as training data and all remaining examples as test data. The evaluation measure is average per-task accuracy. Methods that require labeled data should take a subset of the 500, e.g. 400 in [1], as labeled. Potential validation data should be taken from the training part, leaving the test data unchanged.

Downloads

version 1.0, June 7 2017: MDPR.zip (749MB)

References

If you use or adapt this data, please cite the following paper:

[1] A. Pentina and C. H. Lampert. "Multi-task Learning with Labeled and Unlabeled Tasks”, ICML 2017

Reference for the original review data:

[2] J.J. McAuley, C. Targett, Q. Shi and A. van den Hengel "Image-based recommendations on styles and substitutes”, SIGIR 2015

Reference for sentence embeddings:

[3] S. Arora, Y. Liang and T. Ma. "A simple but tough-to-beat baseline for sentence embeddings” ICLR 2017

Reference for word embeddings:

[4] J. Pennington, R. Socher and C. D. Manning "GloVe: Global vectors for word representation”, EMNLP 2014

Multitask Dataset of Product Reviews

Overview

Data Format

Experimental protocol

Downloads

References

Supported by