Datasets

Review Datasets

  • Yelp Dataset: 4.1M reviews and 947K tips by 1M users for 144K businesses [Download (1.8gb)]
  • SAR14 Dataset: An independent score-associated dataset of 233600 movie reviews. [Download (120mb)]
  • Amazon Books Reviews: 12,886,488 reviews [Download (4.4gb)]
  • Amazon Music Reviews: 6,396,350 reviews [Download (2.1gb)]
  • Amazon Movie&TV Reviews: 7,850,072 reviews [Download (2.8gb)]
  • Sentiment Datasets

  • Large Movie Review Dataset (IMDB Review Dataset): 25,000 highly polar movie reviews for training, and 25,000 for testing [Download (80mb)]
  • Multi-Domain Sentiment Dataset [Download (30mb)]
  • Twitter Sentiment Corpos: 5513 hand-classified tweets [Download (150kb)]
  • Word Embeddings

  • FastText Turkish Word Embeddings from Wikipedia: 300 dimension [Download (3.4gb)]
  • FastText English Word Embeddings from Wikipedia: 300 dimension [Download (9.6gb)]
  • GloVe Word Embeddings:
  • Google News Vector: (300d) [Download (1.5gb)]

    Question&Answer Datasets

  • TREC-QA 2013 [Download (9mb)]
  • Misc Datasets

  • Books unlabeled data [Download (2.7mb)]
  • Yahoo! Answers dataset: 189,467 question and answer pairs from 20 top-level categories from the Yahoo! Answers website; 10,000 question/answer pairs per category [Download (127mb)]
  • Titles for all products on Amazon [Download (34mb)]
  • 20 NewsGroup Dataset [Download (17mb)]
  • Quora Question Pairs Dataset: over 400,000 lines of potential question duplicate pairs [Download (55mb)]
  • Cornell Movie--Dialogs Corpus: 220,579 conversational exchanges between 10,292 pairs of movie characters [Download (1mb)]