KNIME workflow to calculate Tanimoto similarities based upon molecular fingerprints

2018-08-21T00:00:00Z (GMT) by Mellor, Claire

[Explanatory text prepared by Claire Mellor and Richard Marchese Robinson]

This is a workflow designed to take in a dataset of molecules in SDF format and return a matrix with pairwise Tanimoto similarities for all molecules calculated according to a variety of molecular fingerprints, e.g. MACCS and ECFP4. [N.B. Descriptors are calculated at an intermediate stage of the workflow, but are not incorporated into the output.]

Please note the following:

1. The full names (i.e. including complete directory names, which must already exist) of the input file (SDF Reader) and output file (CSV Writer) will need to be edited to apply this workflow to a new dataset.

2. The output file (e.g. tanimoto.csv) has the following structure:

Title row = row ID, [FINGERPRINT NAME]_Arr[x], ....

Here, "row ID" refers to the identifier of the relevant molecule in the input file, i.e. "RowY", where "Y" denotes the number of the molecule in the input dataset. For example, "Row0" denotes the first molecule and "Row[N - 1]" denotes the last molecule, assuming a dataset with N molecules. Similarly, "x" also denotes the identifier of the relevant molecule in the input file, such that the table entry corresponding to "row ID" = "RowY" and "[FINGERPRINT NAME]_Arr[x]" presents the Tanimoto similarity between molecule "Y" and molecule "x" in the dataset, computed in terms of the fingerprint "[FINGERPRINT NAME]".

Please note that, whilst the first row below the title corresponds to the first molecule ("row ID" = "Row0"), the corresponding similarity values for pairwise comparison to this first molecule only appear in the final column for the respective fingerprint.

To obtain a pairwise similarity matrix, where the entry for row I and column J corresponded to the similarity between molecule I and molecule J for a given fingerprint, the last column for a given fingerprint would need to be moved to the front and the columns corresponding to all other fingerprints would need to be deleted.