GitHub - misha-slyusarev/spark-training: Several scripts to try my hands on Spark

ex3_topten.py

This script reads a folder of files and count top ten countries with most often occurrences in the files.

ex5_meanSalary.R

Main goal of the task is to try Spark SQL in action.

In this file there is a R program that reads JSON with vacancies and calculates mean salaries for programmers, designers, and system administrators with per city breakdown.

I chose SparkR as implementation language because the task reads as a perfect match of what the language is designed for. However I found that SparkR is incomplete and inefficient for complex tasks and hence choosing it was a bad idea.

ex7_browser

This example reads data from S3, then count number of visits per Browser per month, and save it into MySQL. This is a Scala project managed by SBT. Compile it with Assembly plugin like this:

cd ex7_browser   
sbt -verbose assembly

Then you can submit assembled jar into Spark:

path/to/spark-submit ./target/scala-2.10/browser-usage.jar

Lastly, I used following Spark command to submit it into a cluster:

path/to/spark-submit --deploy-mode cluster --master spark://big-data:6066  ./target/scala-2.10/browser-usage.jar

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ex7_browser		ex7_browser
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ex3_topten.py		ex3_topten.py
ex5_meanSalary.R		ex5_meanSalary.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ex3_topten.py

ex5_meanSalary.R

ex7_browser

About

Releases

Packages

Languages

License

misha-slyusarev/spark-training

Folders and files

Latest commit

History

Repository files navigation

ex3_topten.py

ex5_meanSalary.R

ex7_browser

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages