-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colors don't correspond with clusters #46
Comments
I believe that they do correspond to clusters but simply that the 3d makes this hard to see. How does it look if you use 2d? (it might also be that the clustering happens in higher dimensions and doesn't align with the dimensionality reduction) @Muennighoff correct me if I am wrong |
Note that the coloring is based on a KMeans model that is fitted on the multidimensional embeddings of the model so as @KennethEnevoldsen said, it may not always visualize nicely in 3/2D. |
I see, so if I (or other people) evaluate the results, what should I care about ? So basically, only the colors matter and the 2d/3d might or might not be useful? Why would dark blue and light blue not be in the same cluster (the rest is clustered by color correctly)? In 3D the 2 dots are very close from all angles. |
Def. agree that ideally the kmeans clusters and down projection should align - My subjective experience is that they generally do, but do let me know if you consistently find errors? |
Great question - I think a good model would embed texts such that:
So I think there are two things to care about and the better model is the one for which both of those are better. If this makes sense, I can add it to the 'Rules' section on the Clustering tab? |
I mean, that is what I thought but in many cases with 4+ clusters, I got conflicting clustering (clusters by color <> clusters by position). In those cases, there is two results then basically. Which counts and what do we even need both of them for if having colors and the graph does not help evaluating the result but actually confuses and makes it harder? Either just displaying the color clusters somehow, in lists or having the dots colored according to coordinates seems much more logical to me. But I'm neither mathematician nor researcher or programmer, so just my two cents. |
Yeah maybe this is better. I think we could do this by instead fitting the KMeans on the dimensionality-reduced embeddings and then colors should more closely correspond to actual positions in the graph. We could consider having a toggle for this at first and see what people prefer. Should be easy to implement here Line 273 in 624d345
|
maybe I'm not getting this but I thought the colors are meant to represent the clustering (what is close to each other, what is in the same cluster, gets the same color).
I ran this for clustering (simple enough, I thought):
green<|SEP|>light green<|SEP|>red<|SEP|>blue<|SEP|>pink<|SEP|>dark green<|SEP|>dark blue<|SEP|>light blue
now I see this:
the yellow and red dots that are so close to each other are "light blue" and "dark blue". They are close, as they should be. But why are they in a different color, makes no sense to me at all.
The text was updated successfully, but these errors were encountered: