Skip to content

JianhengHou/Mining_of_Massive_Datasets

Repository files navigation

Mining of Massive Datasets

The implementation of data mining algorithms

Description:

Assignments in this repository are all about the implementation of algorithm to mine massive data under python and spark.

Assignments Topic Framework Related Algorithm
Assignment 1 Yelp Rating Counting Spark Spark Operations(Transformation, Action)
Assignment 2 Finding Frequent Item Set Spark A-priori, SON
Assignment 3 Collaborating Filtering Spark LSH, User/Item-based CF
Assignment 4 Community Detection Spark Girvan-Newman
Assignment 5 Clustering Spark BFR algorithm
Assignment 6 Streaming Data Algorithms Spark Bloom Filtering, Flajolet Martin, Fixed Size Sampling

Programming Environment

Python 3.6 and Spark 2.3.2.

How To Use

You can find assignments description and source code in corressponding assighment folder

Data Source

Data set is called Yelp Datasets Challenge, and came from the website Yelp Datasets

References

Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman: [Mining of Massive Datasets]

About

The implementation of data mining algorithms on python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages