GitHub - parth-code/SparkPageRank: Applying PageRank algorithm on the dblp dataset using spark

PageRank on the DBLP dataset using Spark

Installation requirements:

Spark 2.4.x
Hadoop 3.x.x
Java 8
Scala 2.1x.x
Sbt version 2.1x.x

I installed hadoop and spark on windows and thus, did not need cloudera or hortonworks for executing my jars.

The algorithm is modified from the implementation at https://github.com/abbas-taher/pagerank-example-spark2.0-deep-dive/blob/master/SparkPageRank.scala

How to run:

1)Get the dblp.xml and dblp.dtd file from https://dblp.uni-trier.de/. Place both in same folder.

2)Go to the project folder and run the following: sbt clean compile

3)Method 1: Without jar

1) `sbt run <input-file-location> <output-file-location>` in terminal(cmd or in Intellij). 

Use absolute paths and '\\' instead of '\'(Windows). 

If running using Intellij, add the argument -Xmx6000m in Run-> Edit Configuration-> VM Options

This increases memory allocation to the VM.

Method 2: Using jar

1) Create the fat jar using
  ~sbt assembly

2) use
  `spark-submit --class prtest pagerank.jar <input-file-location> <output-file-location>`

The output folder will have files called part-xxxxx. Open as a text file. This is the required result.

Tests can be run using sbt clean compile test

Output Format:

The output generated will be of the form

(University of Paris-Sud, Orsay, France,0.8154981739701659)

(Elena,0.9850243302878131)

(John Bell,1.4621033282930214)

(University of Nice Sophia Antipolis, France,0.5678480111313514)

(Joseph Fourier University, Grenoble, France,0.8154981739701659)

(Elena Zheleva,1.3690036520596678)

(Acta Inf.,0.9850243302878131)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
conf		conf
project		project
src		src
.gitignore		.gitignore
Design.md		Design.md
Project Requirements.md		Project Requirements.md
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PageRank on the DBLP dataset using Spark

Installation requirements:

How to run:

Output Format:

About

Releases

Packages

Languages

parth-code/SparkPageRank

Folders and files

Latest commit

History

Repository files navigation

PageRank on the DBLP dataset using Spark

Installation requirements:

How to run:

Output Format:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages