Learning Objectives Slides
- The first unsupervised learning algorithm is the topic of this class
- Kmean clustering algorithm and its visualization will be presented
- The number of cluster and how to choose it
- The application of kmeans clustering to text and iris data will be explored
- Clustering is a type of unsupervised learning
- This is very often used because we usually don’t have labeled data
- K-Means clustering is one of the popular clustering algorithm
- The goal of any cluster algorithm is to find groups (clusters) in the given data
- Text Clustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++')
print(model.cluster_centers_.argsort()[:, ::-1])
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
- Cluster movie dataset -> we expect the movie which their genres are similar clustered in the same group
- News Article Clustering
Assume the inputs are
- Step 1 - Pick
$$K$$ random points as cluster centers called centroids - Step 2 - Assign each
$$x_i$$ to nearest cluster by calculating its distance to each centroid - Step 3 - Find new cluster center by taking the average of the assigned points
- Step 4 - Repeat Step 2 and 3 until none of the cluster assignments change
from figures import plot_kmeans_interactive
distortions = []
K = range(1, 10)
for k in K:
km = KMeans(n_clusters=k).fit(X)
distortions.append(sum(np.min(cdist(X, km.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.title('The Elbow Method showing the optimal k')
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(data_transformed)
choose arbitrary K
1- Compute all of the distances of red points to red centroid
2- Do step (1) for other colors (purple, blue, ...)
3- Add them up