Data Manipulation of Big data using plyrmr function #77

AVIN8233 · 2016-05-19T14:52:36Z

Hi all,
Lets say, I have a very big data 10-20GB(e.g BIGDATA.csv or .txt) which is in .csv or .txt format.
Obviously, I would not like to read the data in memory. So what I want is that, put that data in hdfs by
hdfs.put() in some HDFS directory. Then if I want to do some row or column operation,how I can use plyrmr function?
For simplicity,let say I have the data as "mtcars.csv". Now I want to put this data in hdfs directory and then calculate carb.per.cycle=carb/cycle. So please suggest me how to perform?
I am using rmr.options(backend="hadoop") #backend is hadoop
What I tried but its throwing error:

hdfs.mkdir("/user/cloudera/data")
hdfs.put("mtcars.csv","/user/cloudera/data")
bind.cols(input("/user/cloudera/data"), cycle=carb/cyl)

Output:
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob4514819939415411484.jar tmpDir=null
16/05/19 07:20:32 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/19 07:20:33 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/19 07:20:35 INFO mapred.FileInputFormat: Total input paths to process : 1
16/05/19 07:20:35 INFO mapreduce.JobSubmitter: number of splits:2
16/05/19 07:20:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1463635066767_0013
16/05/19 07:20:36 INFO impl.YarnClientImpl: Submitted application application_1463635066767_0013
16/05/19 07:20:36 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1463635066767_0013/
16/05/19 07:20:36 INFO mapreduce.Job: Running job: job_1463635066767_0013
16/05/19 07:20:48 INFO mapreduce.Job: Job job_1463635066767_0013 running in uber mode : false
16/05/19 07:20:48 INFO mapreduce.Job: map 0% reduce 0%
16/05/19 07:21:11 INFO mapreduce.Job: Task Id : attempt_1463635066767_0013_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

16/05/19 07:21:29 INFO mapreduce.Job: Task Id : attempt_1463635066767_0013_m_000000_1, Status : FAILED

AVIN8233 changed the title ~~Column Manipulation in Big data~~ Data Manipulation in Big data May 19, 2016

AVIN8233 changed the title ~~Data Manipulation in Big data~~ Data Manipulation of Big data using plyrmr function May 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Manipulation of Big data using plyrmr function #77

Data Manipulation of Big data using plyrmr function #77

AVIN8233 commented May 19, 2016

Data Manipulation of Big data using plyrmr function #77

Data Manipulation of Big data using plyrmr function #77

Comments

AVIN8233 commented May 19, 2016