Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 686 Bytes

examine_and_train.md

File metadata and controls

8 lines (5 loc) · 686 Bytes

Part 2: Examine Tweets and Train a Model

The second program examines the data found in tweets and trains a language classifier using KMeans clustering on the tweets:

  • Examine - Spark SQL is used to gather data about the tweets - to look at a few of them, and to count the total number of tweets for the most common languages of the user.
  • Train - Spark MLLib is used for applying the KMeans algorithm for clustering the tweets. The number of clusters and the number of iterations of algorithm are configurable. After training the model, some sample tweets from the different clusters are shown.

See here for the command to run part 2.