Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to parse the Mainframe copybook which has a COBOL datatype of BBBB which means empty spacesc #734

Open
suryagits opened this issue Dec 24, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@suryagits
Copy link

Describe the bug

We are using CoBrix with PySpark and executing it on AWS EMR.
We have the EBCDIC file and it's corresponding copybook in the AWS S3 bucket. While trying to parse the EBCDIC file using the Copybook, we are getting an error.

Error message :
py4j.protocol.Py4jJavaError : An error occurred while calling o2021.loa : za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException : Syntax error in the copybook at line 29 : Invalid input 'BBBB' at position 29:45

Code snippet that caused the issue

try : 
 file_path = f's3://{s3_bucket}/{ebcdic_file_path}'
 spark.read
   .format("cobol")
   .option("copybook_contents", copybook)
   .option("encoding", ebcdic)
   .option("schema_retention_policy", "collapse_root")
   .option("generate_record_id", True)
   .load(file_path)
except Exception as e:
   log_message = f'spark job failed with error : {e}'
   logging.error(log_message)
  raise e

Expected behavior

We expected the Cobrix to successfully parse the EBCDIC file record column using the Cobybook which has this datatype of 'BBBB'

Context

PySpark Jar dependencies :

  • cobol-parser_2.12-2.6.7.jar
  • hadoop-lzo-0.4.3.jar
  • scodec-bits_2.12-1.1.12.jar
  • scodec-core_2.12-1.11.4.jar
  • spark-cobol_2.12-2.6.7.jar
  • Operating system: AWS EMR (Linux Image)

Copybook (if possible)

                    15 EL02-267-COLNAME-A
                      20 EL02-267-COLNAME-B
                                                       PIC X(19).
                      .........
                      .........
                      .........
                      20 EL02-267-COLNAME-C  REDEFINES
                                    EL02-267-COLNAME-D
                                                       PIC 9(06)BBBB. (This is what is causing the issue we suppose)
GP5WHB        20 FILLER                 pic X(285).                      CLEAN-UP

Attach a small data file that can help reproduce the issue, if possible : Need to check the feasibility due to confidentiality of the data. Will get back.

@suryagits suryagits added the bug Something isn't working label Dec 24, 2024
@yruslan
Copy link
Collaborator

yruslan commented Dec 27, 2024

Hi,

Yes, 'BBBB' is something Cobrix does not support mainly because we are not sure at the moment how to properly handle it.
This might be a relevant issue: #505

Does it work if you remove 'BBBB'? Does it produce the expected output in this case?

@suryagits
Copy link
Author

Hi @yruslan ,

Thank you so much for your response!

As adviced, I will try once by removing the 'BBBB' from my Copybook file , rerun the Cobrix program and get will back to you asap.

Thank you

@suryagits
Copy link
Author

suryagits commented Dec 31, 2024

Hi @yruslan ,

One query, could you advice on what could be a replacement for 'BBBB', I mean, is there any other Cobol datatype definition that could be analogous to the use-case of 'BBBB' and works with Cobrix too?

Please note, I am yet to try out your advice on removing the 'BBBB' and give a try. Sorry for the delay, will get back on that asap!

Thank you

@yruslan
Copy link
Collaborator

yruslan commented Dec 31, 2024

Hi @suryagits,

Since 'B' means just inserting spaces in the data representation of the number, and because Cobrix converts numbers to Spark native binary formats, 'B' should not need a replacement. We may eventually implement it so Cobrix ignores all 'B' in numbers. We haven't done it yet since we haven't encountered such PICs in our organization so we can't confirm that ignoring 'B's would be an expected behavior.

Once you confirm that removing 'B's from PICs produces correct output in numeric fields we are going to implement the support 'B's natively.

@suryagits
Copy link
Author

Hi @yruslan ,
Firstly apologies for the delay in my response as our Team was having multiple operational challenges to replicate the file. Please find below my observations on the 'BBBB':

Given a copybook as below :

20 col_a
PIC X(10).
20 col_b
PIC 9(06)BBBB.

  1. Firstly , what we understood from our team is B stands for a blank space hence BBBB = 4 blank spaces.

  2. When we removed the 'BBBB' from the copybook, the Cobrix was successfully able to parse our EBCDIC file and generate the output.

  3. Here, col_a which is the Parent column is of 10 bytes and col_b which is the child is having 6 bytes + 4 blank spaces. So suppose, col_b value is 123456 + 4 spaces then after removing BBBB, the output contains only 123456 which means the 4 spaces got trimmed.

  4. However what we did was after the output, we manually added 4 spaces to the column via our Spark code.

  5. Also, we are trying to update the col_b definition to PIC 9(10) or X(10) after removing the 4 B's to completely occupy the 10 bytes without the Bs as Cobrix does not support it.

So to answer your query, the output post removing BBBB did help and we are able to proceed further. Thank you so much!

Based on the above observations, request you to kindly let us know if you will be adding support for the B's in future Cobrix releases so that we are aligned to it.

Thank you

@yruslan
Copy link
Collaborator

yruslan commented Jan 31, 2025

Hi @suryagits, thanks for the detailed description! It is very helpful.

Yes, I think the support for 'B's can be added to Cobtix eventually. Let's keep this issue open.

Just a couple of more questions in order to understand how Cobrix should interpret BBBs.

  • When you converted PIC 9(06)BBBB to PIC X(10), what happenned? Did the field ended up with 4 spaces are required, or the record size mismatched?
  • I guess, if BBBs mean spaces which are not part of the data itself, PIC X(6) might work better in this case. But, yes a special support in Cobrix is needed in order to insert spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants