The implementation of data mining algorithms
Assignments in this repository are all about the implementation of algorithm to mine massive data under python and spark.
Assignments | Topic | Framework | Related Algorithm |
---|---|---|---|
Assignment 1 | Yelp Rating Counting | Spark | Spark Operations(Transformation, Action) |
Assignment 2 | Finding Frequent Item Set | Spark | A-priori, SON |
Assignment 3 | Collaborating Filtering | Spark | LSH, User/Item-based CF |
Assignment 4 | Community Detection | Spark | Girvan-Newman |
Assignment 5 | Clustering | Spark | BFR algorithm |
Assignment 6 | Streaming Data Algorithms | Spark | Bloom Filtering, Flajolet Martin, Fixed Size Sampling |
Python 3.6 and Spark 2.3.2.
You can find assignments description and source code in corressponding assighment folder
Data set is called Yelp Datasets Challenge, and came from the website Yelp Datasets
Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman: [Mining of Massive Datasets]