Webis ChangeMyView Corpus 2020 (Webis-CMV-20)
The Webis-CMV-20 dataset comprises all available posts and comments in the ChangeMyView subreddit from the foundation of the subreddit in 2005, until September 2017. From these, we have derived two sub-datasets for the tasks of persuasiveness prediction, and opinion malleability prediction. In addition, the corpus comprises historical posts by CMV authors, and derived personal characteristics.
All files are in bzip2-compressed JSON Lines format.
- threads.jsonl: contains all the selected discussion threads from CMV
- pairs.jsonl: each record contains submission, delta-comment and nondelta-comment and the comments' similarity score
- posts-malleability.jsonl: contains posts for opinion mallebility prediction, in the format provided in the original Reddit Crawl dataset
- author_entity_category.jsonl: each record contains the author and list of Wikipedia entities mentioned by the author in the messages across all subreddits. For each mentioned entity we provide the following data:
[title, wikidata_id, wikipedia_page_id, mentioned_entity_title, wikifier_score, subreddit_name, subreddit_id, subreddit_category_name, subreddit_topcategory_name]
- author_liwc.jsonl: personality traits features computed with LIWC for the authors from pairs.jsonl and post_malleability.jsonl datasets
- author_subreddit.jsonl: for each author statistics of all number of all posts (submissions/comments) across all subreddits are provided
- author_subreddit_category.jsonl: similar to author_subreddit.jsonl, the statistics of all author posts is grouped by top-categories and categories according to snoopsnoo.com