Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colors don't correspond with clusters #46

Open
rurounigit opened this issue Dec 18, 2024 · 7 comments
Open

colors don't correspond with clusters #46

rurounigit opened this issue Dec 18, 2024 · 7 comments

Comments

@rurounigit
Copy link

maybe I'm not getting this but I thought the colors are meant to represent the clustering (what is close to each other, what is in the same cluster, gets the same color).

I ran this for clustering (simple enough, I thought):

green<|SEP|>light green<|SEP|>red<|SEP|>blue<|SEP|>pink<|SEP|>dark green<|SEP|>dark blue<|SEP|>light blue

now I see this:
Screenshot 2024-12-18 at 13 54 18

the yellow and red dots that are so close to each other are "light blue" and "dark blue". They are close, as they should be. But why are they in a different color, makes no sense to me at all.

@KennethEnevoldsen
Copy link

I believe that they do correspond to clusters but simply that the 3d makes this hard to see. How does it look if you use 2d? (it might also be that the clustering happens in higher dimensions and doesn't align with the dimensionality reduction)

@Muennighoff correct me if I am wrong

@Muennighoff
Copy link
Contributor

Note that the coloring is based on a KMeans model that is fitted on the multidimensional embeddings of the model so as @KennethEnevoldsen said, it may not always visualize nicely in 3/2D.

@rurounigit
Copy link
Author

rurounigit commented Dec 23, 2024

Note that the coloring is based on a KMeans model that is fitted on the multidimensional embeddings of the model so as @KennethEnevoldsen said, it may not always visualize nicely in 3/2D.

I see, so if I (or other people) evaluate the results, what should I care about ? So basically, only the colors matter and the 2d/3d might or might not be useful? Why would dark blue and light blue not be in the same cluster (the rest is clustered by color correctly)? In 3D the 2 dots are very close from all angles.

@KennethEnevoldsen
Copy link

Def. agree that ideally the kmeans clusters and down projection should align - My subjective experience is that they generally do, but do let me know if you consistently find errors?

@Muennighoff
Copy link
Contributor

Muennighoff commented Dec 23, 2024

what should I care about ?

Great question - I think a good model would embed texts such that:

  1. The KMeans model fitted on these embeddings clearly puts similar concepts in similar clusters (This is visualized via the colors)
  2. Applying a dimensionality reduction method (e.g. PCA) to the embeddings and then displaying them (e.g. in 3D/2D) will clearly have similar concepts close to each other

So I think there are two things to care about and the better model is the one for which both of those are better. If this makes sense, I can add it to the 'Rules' section on the Clustering tab?

@rurounigit
Copy link
Author

what should I care about ?

Great question - I think a good model would embed texts such that:

  1. The KMeans model fitted on these embeddings clearly puts similar concepts in similar clusters (This is visualized via the colors)
  2. Applying a dimensionality reduction method (e.g. PCA) to the embeddings and then displaying them (e.g. in 3D/2D) will clearly have similar concepts close to each other

So I think there are two things to care about and the better model is the one for which both of those are better. If this makes sense, I can add it to the 'Rules' section on the Clustering tab?

I mean, that is what I thought but in many cases with 4+ clusters, I got conflicting clustering (clusters by color <> clusters by position). In those cases, there is two results then basically. Which counts and what do we even need both of them for if having colors and the graph does not help evaluating the result but actually confuses and makes it harder? Either just displaying the color clusters somehow, in lists or having the dots colored according to coordinates seems much more logical to me. But I'm neither mathematician nor researcher or programmer, so just my two cents.

@Muennighoff
Copy link
Contributor

having the dots colored according to coordinates seems much more logical to me.

Yeah maybe this is better. I think we could do this by instead fitting the KMeans on the dimensionality-reduced embeddings and then colors should more closely correspond to actual positions in the graph. We could consider having a toggle for this at first and see what people prefer. Should be easy to implement here

arena/models.py

Line 273 in 624d345

def clustering(self, queries, model_name, ncluster=1, ndim="3D", dim_method="PCA", clustering_method="KMeans", single_ui=True):
in case someone has bandwidth!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants