Fix incorrect usage of KMeans clustering #16

cwalker7 · 2020-10-16T18:35:30Z

I've realized that KMeans clustering in sklearn does not have the option to input a distance matrix to the fitting function, which is how I've been using it. In other words, we have a [n_sample x n_sample] matrix, rather than a [n_sample x n_feature] matrix. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans. To get a [n_sample x n_feature] matrix, it seems we would need to know beforehand what the reference structure (native structure) should be (single feature is rmsd to that reference), and it wouldn't work for identifying multiple secondary structures.

The reason that the KMeans clustering runs at all and gives an ok solution for medoids, is that it interprets the rsmd to each frame as its own feature. This, I think, is not what we want, so I would avoid using it for rmsd-based clustering to find a native state.

These are the algorithms in sklearn that can take a distance matrix as input:

AffinityPropagation
AgglomerativeClustering (4 different types of linkage to consider)
DBSCAN (requires careful tuning of eps, min_sample parameters)
OPTICS (requires careful tuning of min_sample, xi parameters)
SpectralClustering

Density-based clustering has been tricky to get working (the parameters seem very system-dependent). I will look into the others.

cwalker7 · 2020-10-16T18:36:34Z

@tlfobe Just want you to be aware of this issue, I know you were planning on using KMeans with the cg_pyrosetta structures.

cwalker7 · 2020-10-21T16:50:49Z

Playing with a bunch of these methods, I found that none are that robust at dealing with the very noisy dataset we have for weakly cooperative transitions (need to finely tune clustering parameters, etc.).

However, if we first filter out data points with few neighbors within a cutoff radius (based on RMSD distances), we can reliably identify high density regions of conformational space (filtering criteria can be specified by percent, so very generally applicable). I got the idea from here: https://link.springer.com/article/10.1007/s10822-013-9689-8, where DBSCAN is used successfully on the filtered data, where only 1% of the highest density data passes through. I will add this in a PR later today.

As for the KMeans issues, we can instead use this formulation of KMedoids: https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
Though it doesn't seem to give decent results on the unfiltered data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect usage of KMeans clustering #16

Fix incorrect usage of KMeans clustering #16

cwalker7 commented Oct 16, 2020

cwalker7 commented Oct 16, 2020

cwalker7 commented Oct 21, 2020

Fix incorrect usage of KMeans clustering #16

Fix incorrect usage of KMeans clustering #16

Comments

cwalker7 commented Oct 16, 2020

cwalker7 commented Oct 16, 2020

cwalker7 commented Oct 21, 2020