Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect usage of KMeans clustering #16

Open
cwalker7 opened this issue Oct 16, 2020 · 2 comments
Open

Fix incorrect usage of KMeans clustering #16

cwalker7 opened this issue Oct 16, 2020 · 2 comments

Comments

@cwalker7
Copy link
Collaborator

I've realized that KMeans clustering in sklearn does not have the option to input a distance matrix to the fitting function, which is how I've been using it. In other words, we have a [n_sample x n_sample] matrix, rather than a [n_sample x n_feature] matrix. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans. To get a [n_sample x n_feature] matrix, it seems we would need to know beforehand what the reference structure (native structure) should be (single feature is rmsd to that reference), and it wouldn't work for identifying multiple secondary structures.

The reason that the KMeans clustering runs at all and gives an ok solution for medoids, is that it interprets the rsmd to each frame as its own feature. This, I think, is not what we want, so I would avoid using it for rmsd-based clustering to find a native state.

These are the algorithms in sklearn that can take a distance matrix as input:

  • AffinityPropagation
  • AgglomerativeClustering (4 different types of linkage to consider)
  • DBSCAN (requires careful tuning of eps, min_sample parameters)
  • OPTICS (requires careful tuning of min_sample, xi parameters)
  • SpectralClustering

Density-based clustering has been tricky to get working (the parameters seem very system-dependent). I will look into the others.

@cwalker7
Copy link
Collaborator Author

@tlfobe Just want you to be aware of this issue, I know you were planning on using KMeans with the cg_pyrosetta structures.

@cwalker7
Copy link
Collaborator Author

Playing with a bunch of these methods, I found that none are that robust at dealing with the very noisy dataset we have for weakly cooperative transitions (need to finely tune clustering parameters, etc.).

However, if we first filter out data points with few neighbors within a cutoff radius (based on RMSD distances), we can reliably identify high density regions of conformational space (filtering criteria can be specified by percent, so very generally applicable). I got the idea from here: https://link.springer.com/article/10.1007/s10822-013-9689-8, where DBSCAN is used successfully on the filtered data, where only 1% of the highest density data passes through. I will add this in a PR later today.

As for the KMeans issues, we can instead use this formulation of KMedoids: https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
Though it doesn't seem to give decent results on the unfiltered data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant