In each row of the included datasets(train.csv and test.csv), products X(description_x) and Y(description_y) are considered to refer to the same security(same_security) if they have the same ticker(ticker_x,ticker_y), even if the descriptions don't exactly match. You can make use of these descriptions to predict whether each pair in the test set also refers to the same security.

Train - description_x, description_y, ticker_x, ticker_y, same_security. Test - description_x, description_y, same_security(to be predicted)

This dataset is pretty similar to the Quora Question Pairs . You can also check out my kernel for dataset exploration and n-gram analysis N-gram analysis on stock data.

There are several good ways to approach this, check out this algorithm, and see how far you can go with it: https://en.wikipedia.org/wiki/Tf–idf http://scikit learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html. You can also try doing n-gram analysis(check out my kernel). I would suggest using log-loss as your evaluation metric since it gives you a number between 0 and 1 instead of binary classification, which is not so effective in this case.


Quovo stock data.

