This is a repository of the code for the Insight Project developed during the first weeks and contains Python scripts for preprocessing the data, computing the clusters and the web application.
cleaning_books.py
The first steps consists of cleaning up the GoodReads book data, specifically extracting meaningful tags and science-fiction related books.
users_books_ratings.py
Getting the matrix of users x books ratings from the database
clustering.py
A two step clustering based on user-book ratings and publication year/secondary theme tag
find_users.py
Matching users for a specific user according to similarity matrix within each cluster
The folder tests contains unit testing code for the python scripts and some mock data to try the unit tests