RedMed: Extending drug lexicons for social media applications

2019-07-31T12:21:21Z (GMT) by Adam Lavertu Russ Altman

Data associated with the RedMed project.

Details for the process behind the data creation can be found in the associated paper:

Lavertu, A. & Altman, R. B. "RedMed: Extending drug lexicons for social media applications"
Journal of Biomedical Informatics, (2019)

https://doi.org/10.1016/j.jbi.2019.103307

RedMed embedding model:

Word vectors trained on comments from health related subreddits and optimized for drug synonym retrieval.

The Redmed model was train using only social media data from Reddit and achieves comparable performance on the UMNSRS and MayoSRS similarity tasks. Vectors are 64 dimensional.

redmed_model_vectors.tsv.gz - Tab-separated word vectors (token\tdim1\tdim2\t...dim64)

redmed_model.bin - Binary word2vec file saved using gensim, can be loaded into python gensim

Other Files:

supp_file_1_sidebar_subreddits.txt - List of health-related subreddits based on "r/Health" and "r/Drugs" sidebars
supp_file_2_enrichment_based_subreddits.txt - List of health-related subreddits based on amount of health-related content
supp_file_3_custom_stopword_list.txt - List of stopwords based on counts derived from Reddit comments