You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've realized that KMeans clustering in sklearn does not have the option to input a distance matrix to the fitting function, which is how I've been using it. In other words, we have a [n_sample x n_sample] matrix, rather than a [n_sample x n_feature] matrix. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans. To get a [n_sample x n_feature] matrix, it seems we would need to know beforehand what the reference structure (native structure) should be (single feature is rmsd to that reference), and it wouldn't work for identifying multiple secondary structures.
The reason that the KMeans clustering runs at all and gives an ok solution for medoids, is that it interprets the rsmd to each frame as its own feature. This, I think, is not what we want, so I would avoid using it for rmsd-based clustering to find a native state.
These are the algorithms in sklearn that can take a distance matrix as input:
AffinityPropagation
AgglomerativeClustering (4 different types of linkage to consider)
DBSCAN (requires careful tuning of eps, min_sample parameters)
OPTICS (requires careful tuning of min_sample, xi parameters)
SpectralClustering
Density-based clustering has been tricky to get working (the parameters seem very system-dependent). I will look into the others.
The text was updated successfully, but these errors were encountered:
Playing with a bunch of these methods, I found that none are that robust at dealing with the very noisy dataset we have for weakly cooperative transitions (need to finely tune clustering parameters, etc.).
However, if we first filter out data points with few neighbors within a cutoff radius (based on RMSD distances), we can reliably identify high density regions of conformational space (filtering criteria can be specified by percent, so very generally applicable). I got the idea from here: https://link.springer.com/article/10.1007/s10822-013-9689-8, where DBSCAN is used successfully on the filtered data, where only 1% of the highest density data passes through. I will add this in a PR later today.
I've realized that KMeans clustering in sklearn does not have the option to input a distance matrix to the fitting function, which is how I've been using it. In other words, we have a [n_sample x n_sample] matrix, rather than a [n_sample x n_feature] matrix. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans. To get a [n_sample x n_feature] matrix, it seems we would need to know beforehand what the reference structure (native structure) should be (single feature is rmsd to that reference), and it wouldn't work for identifying multiple secondary structures.
The reason that the KMeans clustering runs at all and gives an ok solution for medoids, is that it interprets the rsmd to each frame as its own feature. This, I think, is not what we want, so I would avoid using it for rmsd-based clustering to find a native state.
These are the algorithms in sklearn that can take a distance matrix as input:
Density-based clustering has been tricky to get working (the parameters seem very system-dependent). I will look into the others.
The text was updated successfully, but these errors were encountered: