Text Similarity

Context:

Natural Language Processing(NLP), Text Similarity(lexical and semantic)

Content:

In each row of the included datasets(train.csv and test.csv), products X(description_x) and Y(description_y) are considered to refer to the same security(same_security) if they have the same ticker(ticker_x,ticker_y), even if the descriptions don't exactly match. You can make use of these descriptions to predict whether each pair in the test set also refers to the same security.

Dataset info:

Train - description_x, description_y, ticker_x, ticker_y, same_security. Test - description_x, description_y, same_security(to be predicted)

Past Research:

This dataset is pretty similar to the Quora Question Pairs . You can also check out my kernel for dataset exploration and n-gram analysis N-gram analysis on stock data.

How to Approach:

There are several good ways to approach this, check out this algorithm, and see how far you can go with it: https://en.wikipedia.org/wiki/Tf–idf http://scikit learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html. You can also try doing n-gram analysis(check out my kernel). I would suggest using log-loss as your evaluation metric since it gives you a number between 0 and 1 instead of binary classification, which is not so effective in this case.

Acknowledgements:

Quovo stock data.

데이터와 리소스

추가 정보

필드
소스 https://www.kaggle.com/rishisankineni/text-similarity
저자 Rishi Sankineni
최종 업데이트 5월 2, 2021, 06:42 (UTC)
생성됨 5월 2, 2021, 06:42 (UTC)
kaggle_id 1008
kaggle_lastUpdated 2017-03-19T08:02:56.03Z
kaggle_ref rishisankineni/text-similarity