A Machine Learning Approach for Data Source and Type Identification to Support Metadata Discovery
Current approaches to metadata discovery are dependent on manual curations which are time consuming processes and it is critical to develop automatic and/or semiautomatic metadata discovery methods to realize the full potential of Big Data technologies in biomedicine, enhance research reproducibility and increase efficiency in translational biomedical sciences. We are developing a two-step metadata discovery workflow: (1) Identification of data source and type using their intrinsic document structure, and (2) Discovery of detailed metadata within the data by associating specific metadata discovery tools based on the data’s source and type ascertained from (1). Here we discuss our initial results for (1) using machine learning.
In this work we included biomedical data from various sources and in different file formats: 1) Human genetic variants - ClinVar (XML, tab-delimited), 2) Protein structure (PDB, mmCIF, XML) 3) Biomedical literature, and General English corpus. We tokenized the data files using the Natural Language Toolkit and considered various document structural features: normalized count of numerical tokens, negative numerical tokens, number of words, number of capitalized words, number of words with all upper letters, and median length of tokens. We developed a decision tree model with these structural features to classify these data and types, and evaluated the performance of the model using 10-fold cross-validation and a test set for the following metrics: precision, recall, and F1 score using scikit-learn. Our model was able to distinguish protein structure, genetic variant, scientific paper and general English files with an average F1 score of 0.997, 0.997, 0.886 and 0.919 when evaluated using cross-validation, and 1, 0.999, 0.980, and 0.935 when using independent test sets.. Our approach shows it is possible to automatically identify data sources and types using only document structural features and therefore reasonable to programmatically associate metadata extraction tools specific for each data source and type as next steps.