Datasets for practical model selection for prospective virtual screening

This repository contains datasets for the manuscript "Practical model selection for prospective virtual screening":

  • pria_rmi_cv.tar.gz: A compressed directory containing chemical screening data for the PriA-SSB ASPriA-SSB FP, and RMI-FANCM FP binary datasets.  The files also contain the associated continuous % inhibition values and chemical features represented as SMILES and ECFP4 fingerprints.  The dataset has been split into five folds for cross validation.
  • pria_rmi_pcba_cv.tar.gz: A compressed directory containing chemical screening data for the PriA-SSB ASPriA-SSB FP, and RMI-FANCM FP binary datasets as well as public PubChem BioAssay datasets.  The files also contain the PriA-SSB and RMI-FANCM continuous % inhibition values and chemical features represented as SMILES and ECFP4 fingerprints.  The dataset has been split into five folds for cross validation.  Missing values are left blank.
  • pria_prospective.csv.gz: A compressed file containing chemical screening data for the binary dataset PriA-SSB prospective.  The file also contains the continuous % inhibition values and chemical features represented as SMILES and ECFP4 fingerprints.

PubChem data were provided by the PubChem database.  Follow the PubChem citation guidelines if you use the PubChem data.