Data associated with "A collaborative filtering based approach to biomedical knowledge discovery"

2018-04-24T04:49:32Z (GMT) by Lever, Jake

This is the data set associated with the publication: "A collaborative filtering based approach to biomedical knowledge discovery" published in Bioinformatics.

The data are sets of cooccurrences of biomedical terms extracted from published abstracts and full text articles. The cooccurrences are then represented in sparse matrix form. There are three different splits of this data denoted by the prefix number on the files.

1. All - All cooccurrences combined in a single file

2. Training/Validation - All cooccurrences in publications before 2010 in training, all novel cooccurrences in publication in 2010 go in validation

3. Training+Validation/Test - All cooccurrences in publication upto and including 2010 in training+validation. All novel cooccurrences after 2010 in year by year increments and also all combined together


Furthermore there are subset files which are used in some experiments to deal with the computational cost of evaluating the full set. The associated cuids.txt file containing a link between the row/column in the matrix with the UMLS Metathesaurus CUIDs. Hence the first row of cuids.txt matches up to the 0th row/column in the matrix. Note that the matrix is square and symmetric. This work was done with UMLS Metathesaurus 2016AB.