Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Manipulation of Big data using plyrmr function #77

Open
AVIN8233 opened this issue May 19, 2016 · 0 comments
Open

Data Manipulation of Big data using plyrmr function #77

AVIN8233 opened this issue May 19, 2016 · 0 comments

Comments

@AVIN8233
Copy link

Hi all,
Lets say, I have a very big data 10-20GB(e.g BIGDATA.csv or .txt) which is in .csv or .txt format.
Obviously, I would not like to read the data in memory. So what I want is that, put that data in hdfs by
hdfs.put() in some HDFS directory. Then if I want to do some row or column operation,how I can use plyrmr function?
For simplicity,let say I have the data as "mtcars.csv". Now I want to put this data in hdfs directory and then calculate carb.per.cycle=carb/cycle. So please suggest me how to perform?
I am using rmr.options(backend="hadoop") #backend is hadoop
What I tried but its throwing error:

hdfs.mkdir("/user/cloudera/data")
hdfs.put("mtcars.csv","/user/cloudera/data")
bind.cols(input("/user/cloudera/data"), cycle=carb/cyl)

Output:
packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob4514819939415411484.jar tmpDir=null
16/05/19 07:20:32 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/19 07:20:33 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/05/19 07:20:35 INFO mapred.FileInputFormat: Total input paths to process : 1
16/05/19 07:20:35 INFO mapreduce.JobSubmitter: number of splits:2
16/05/19 07:20:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1463635066767_0013
16/05/19 07:20:36 INFO impl.YarnClientImpl: Submitted application application_1463635066767_0013
16/05/19 07:20:36 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1463635066767_0013/
16/05/19 07:20:36 INFO mapreduce.Job: Running job: job_1463635066767_0013
16/05/19 07:20:48 INFO mapreduce.Job: Job job_1463635066767_0013 running in uber mode : false
16/05/19 07:20:48 INFO mapreduce.Job: map 0% reduce 0%
16/05/19 07:21:11 INFO mapreduce.Job: Task Id : attempt_1463635066767_0013_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

16/05/19 07:21:29 INFO mapreduce.Job: Task Id : attempt_1463635066767_0013_m_000000_1, Status : FAILED

@AVIN8233 AVIN8233 changed the title Column Manipulation in Big data Data Manipulation in Big data May 19, 2016
@AVIN8233 AVIN8233 changed the title Data Manipulation in Big data Data Manipulation of Big data using plyrmr function May 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant