-
Notifications
You must be signed in to change notification settings - Fork 63
GettingStarted
miloveme edited this page Dec 11, 2011
·
9 revisions
This document is for those who …
- …tried R but are not used to Hadoop and Map/Reduce.
- …tried using SQL but not proficient with it and neither with working by typing in Command Lines.
- …want to keep or at least similarly maintain the current R-using analysis style but not fussing much about distributed environments and massive volumes.
- …want to be quickly immersed and experienced in distributed or big data analysis.
Contents
- Installation required for operating RHive, configuring settings and simple troubleshooting.
- RHive examples
- Those R users who want to use RHive for big data analysis can use the user guide to learn basic instructions and accomplish fundamentals of analysis.
RHive – GettingStarted
This document is written in simple, intuitive instructions for users new to RHive.
What is RHive?
R language + Hive (Extension software of Hadoop that allows using SQL syntax to approach files stored in Hadoop and Map/Reduce task)
RHive is an R package that makes distributed data processing and big data analysis easy by using R language’s syntax to approach R or use it for Map/Reduce framework.
As a composite of R package and Hive, users can simultaneously use R language and Hive and also supports use of R language and SQL syntax for big data analysis.
Things to know before using RHive
- Basic User
- R language (GNU-R) syntax
- R is an open source language for statistics and analysis(can be used for open source’s SAS, SPSS, MATLAB etc.)
- Basic Concepts of Map/Reduce
- Basic concepts of Map/Reduce, which is the most widely used method of processing distributed data.
- Basic Concepts of Hadoop
- Basic Concepts of Hive
- R language (GNU-R) syntax
- Advanced User
- Hive SQL syntax, concepts and the workings of UDF, UDAF
- Hadoop file system(HDFS)
- Advanced R syntax
Introduction to RHive
Understanding RHive
- Enables using big data and map/reduce work to be dealt without making implementations for them but using only R and SQL syntax.
- With but simple tasks in R, aggregating, preprocessing, and basic statistical analyzing become easy.
- There’s no need to strain to obtain a complete understanding of Map/reduce.
- Reference : Hive UDF
Pros
- With nothing else required but knowledge of R and SQL syntax, anyone can process big data.
Cons
- Requires at least an inkling of understanding of map/reduce and Hive.
- Analysts virtually know nil about this.
- Must know about SQL syntax.
- Quite a significant population of analysts do not know about this.
- Debugging is difficult (the root of this problem lies not in RHive, but in distributed environment).