-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I get the raw record bytes from ebcdic file w/out parsing #656
Comments
Yes, exactly, Note. You can use the maximum possible size in PIC. The field will be automatically truncated for each record if it is bigger than the record size. If packed decimals are at fixed positions, I'd recommend splitting the segment:
Or you can get the entire record. This can be done in one of 3 ways:
This will create 'Recoord_bytes' binary field that you can process using, say, an UDF, and that way extract packed
|
Thanks ! That seems to be working. |
hi. @yruslan The program seems to be working fine with the above options. However, I am not able to achieve the parallelization. on my map functions on the dataframe output. Once. I get Dataset after reading, I need to call a map function on each Row. This map function parses the bytes in the Row and creates. Dataset. This works fine with smaller-size files. However, for 1gb file, everything is working in single-threaded fashion. I did not see any parallelism. The. number of partitions was always 1. Is there anything I should pass during the read option. Any idea, how can I achieve parallelization after reading and why do. I get only one partition even for 1 gb file. I ran program on local and cluster both and saw only one partition. |
It is strange, parallelism should be available for bigger files no matter the options. The way parallelism works for sequence works with variable length records is
Please, list all reader options again, will try to figure out what might be causing this. Also, what is your cluster and spark-submit parmeters? Are you running on Yarn or some other setup? What the the number of executors? |
I am trying to parse an ebcdic file for which I do not have a copybook. I know that whether it has RDW and/or BDW. It is one of the old legacy format file. We have written our own program that knows how to parse an individual record.
Is there a way that I can use Cobrix library only to parse and individual record in ebcdic bytes ? Once I get those bytes in rdd, I can write my 'map' function to parse the individual segments. I have used Cobrix library get an individual record. I have used following setup,
I have defined my copybook in a simple structute like below,
I am able to parse the records based on the "RDW" value correctly. I get ROW object with only one element in it (as specified in my copybook with name SEGMENT). This SEGMENT is coming as string. I convert this string in "ibm500" character code set (convert to ebcdic) and parse it with the parsing program that we have written. Our program can parse the record based on the byte position. However, we are not able to parse the 'packed' decimal properly. It seems that the conversion from string to/from ebcdic bytes is losing the position and format. Is there a way for us to get origina raw bytes format as it appears in the file as part of the dataset that we get out. In shore, can 'SEGMENT' field in my example, represent the actual raw bytes for the entire record in the file ?
Question
Is above method an acceptable way that I can use this library ? I like the Corbix's way of parallel processing of reading large ebcdic file and all I want help from this library is to parse rdw/bdw value and return the entire record in raw bytes that I can use for my own parsing logic. As I do not have proper copybood for the bytes segment.
The text was updated successfully, but these errors were encountered: