A small project to demonstrate the usage of Twitter API and NLP techniques. The idea is to download tweets from specified accounts (news companies), cluster tweets into topics, detect the hottest topic, and output the most relevant news tweet from that topic.
This is not production-ready code, more like a proof of concept.
The project was done using the following tools and techiques:
- Twitter API (python-twitter implementation)
- Google Word2Vec feature generation (pre-trained vectors trained on part of Google News dataset)
- k-means clustering (sklearn.cluster.KMeans)
- Silhouette values to estimate the number of clusters (sklearn.metrics.silhouette_score)
The code split into separate files to make debugging and testing easier.
get_tweets.py - downloads news tweets.
create_hist_dataset.py - cleans and saves dataset.
save_vectors.py - converts sentences to vectors and save result for further modelling.
detect_hot.py - prints out 'hot' tweets.
There is some thought process recorded in the Jupyter Notebooks.
NLP explore.ipynb - some exploration on clustering.
Tune parameters.ipynb - some exploration on tuning heuristic parameters.