Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
This dataset contains the models for interpretable Word Sense Disambiguation (WSD) that were employed in Panchenko et al. (2017; the paper can be accessed at https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/publications/EACL_Interpretability___FINAL__1_.pdf).
The files were computed on a 2015 dump from the English Wikipedia. Their contents:
- Induced Sense Inventories: wp_stanford_sense_inventories.tar.gz
This file contains 3 inventories (coarse, medium fine)
- Language Model (3-gram): wiki_text.3.arpa.gz
This file contains all n-grams up to n=3 and can be loaded into an index
- Weighted Dependency Features: wp_stanford_lemma_LMI_s0.0_w2_f2_wf2_wpfmax1000_wpfmin2_p1000.gz
This file contains weighted word--context-feature combinations and includes their count and an LMI significance score
- Distributional Thesaurus (DT) of Dependency Features: wp_stanford_lemma_BIM_LMI_s0.0_w2_f2_wf2_wpfmax1000_wpfmin2_p1000_simsortlimit200_feature expansion.gz
This file contains a DT of context features. The context feature similarities can be used for context expansion
For further information, consult the paper and the companion page: http://jobimtext.org/wsd/
Panchenko A., Ruppert E., Faralli S., Ponzetto S. P., and Biemann C. (2017): Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL'2017). Valencia, Spain. Association for Computational Linguistics.