Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed length Cobol EBCDIC file with header, body and trailer records - How to skip header and trailer #556

Closed
ManojKolisetty-git opened this issue Dec 30, 2022 · 8 comments
Labels
accepted Accepted for implementation bug Something isn't working

Comments

@ManojKolisetty-git
Copy link

Background [Optional]

We have fixed length ebcdic file with file structure starts with header, body records and trailer. We user cobrix library (za.co.absa.cobrix:spark-cobol_2.12:2.6.0) in pyspark to read the file and create dataframe.

currently to process header, body and trailer we are creating 3 different dataframes (each has their own copybook).
By doing this we endup in reading file multiple times, and when fetching body records we have to skip first and last record (header, trailer) using Record_id sequence (sorting the dataframe to get last and first id's) to filter out. This approach is taking long time as data needs to shuffled before filter.

Question

when trying to optimize I have seen below file read options but observed that file_end_offset is skipping last record from each partition instead of one record from file. For instance my file is getting processed in 100 partitions, 100 last records from each partition is getting removed.
.option("file_start_offset", 11000)
.option("file_end_offset", 11000)

Kindly suggest way/approach to efficiently skip header, trailer or just fetch header/trailer alone.

Many Thanks
Manoj

@ManojKolisetty-git ManojKolisetty-git added the question Further information is requested label Dec 30, 2022
@yruslan
Copy link
Collaborator

yruslan commented Jan 2, 2023

Hi, the way we usually work with headers and footers is we create a redefine for them, e.g.

01 RECORD
  05 HEADER GROUP.
    ...
  05 PAYLOAD GROUP REDEFINES HEADER.
    ...
  05 FOOTER GROUP REDEFINES HEADER.

The first record of the dataframe is then treated as the header, the last one as the footer, and everything else as the payload.

However, .option("file_end_offset", 11000) should work as well. And from what you described it looks like a bug. Will check and confirm.

@ManojKolisetty-git
Copy link
Author

ManojKolisetty-git commented Jan 2, 2023

Thanks for the reply.
We have standard ebcdic fixed length files coming from source with below format. It is necessary for us to read Header, footer and detail records and apply necessary transformations/validations before loading into target format. As you see, we have to read file 3 times with necessary copybook to fetch and load in different dataframes (so that we can perform necessary validations downstream).

Skipping header record is easy to filter out as it is first record (we use monotonically_increasing_id function to generate index for each record)
skipping footer record is undergoing shuffles/repartition which is time consuming . We get the last record index and apply filter

I see options file_start_offset is working at file level i.e. before parallelize but file_end_offset is working on dataframe (i.e. after parallelize/partitoned).
I thought if file_end_offset works as file_start_offset, I can cache it so that we save processing time.

=========== Copy books structure for header, record and Trailer ===============
01 HEADER.
10 Attributes (Header Attributes)
...........
01 RECORD.
10 Attributes (Record Attributes)
...........
01 TRAILER.
10 Attributes (TRAILER Attributes)
............

Header (which is used to identify file type and file date)
...
Record
..
..
Footer (which contains record count)

@yruslan yruslan added bug Something isn't working accepted Accepted for implementation and removed question Further information is requested labels Jan 3, 2023
@yruslan
Copy link
Collaborator

yruslan commented Jan 3, 2023

Note REDEFINES in the above solution. I could help you avoid reading the file multiple times.

The bug with file_end_offset is confirmed.

yruslan added a commit that referenced this issue Jan 3, 2023
yruslan added a commit that referenced this issue Jan 3, 2023
@ManojKolisetty-git
Copy link
Author

Thanks for checking the issue.
Regarding REDEFINES - copybooks which we work don't have any redefines. They are all individual copybooks (each for header, record and Trailer). Not sure how we can read entire file in single go with 3 different copybook contents

@yruslan
Copy link
Collaborator

yruslan commented Jan 3, 2023

The file_end_offset bug is fixed in 2.6.2. Please, let us know if there are any issues.

@yruslan
Copy link
Collaborator

yruslan commented Jan 3, 2023

Sorry, I might not added enough context. I suggest that you can use REDEFINES to solve your use case when you need to read the header, the footer and all the records in one read.

You can define your copybook to include the header, the footer and the record group. These groups should redefine each other. Then, you can use Cobrix to read the file, and use corresponding redefines for the first and the last record.

Take a look again at the proposed copybook structure (i renamed a couple o fields for clarity):

01 ROOT
  05 HEADER GROUP.
    ...
  05 RECORD GROUP REDEFINES HEADER.
    ...
  05 FOOTER GROUP REDEFINES HEADER.

@ManojKolisetty-git
Copy link
Author

ManojKolisetty-git commented Jan 3, 2023

verified file_end_offset fix - It is working as expected i.e. ignoring last record (in my case). Many thanks for actioning it quickly.

One more issue i observed regarding file_start_offset, file_end_offset - It appears like offsets are set accept INT, if i give any value greater than max integer range, it is erroring. Could you please change these 2 options to accept BIGINT (i.e. LongType) from INT?

Regarding redefines copybook - As suggested above i have merged 3 copybooks under 01 ROOT. ... Now, I am getting below error "za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 1: Invalid input '1' at position 1:6". Any other options i need to include?

recordsRead = spark.read
.format("cobol")
.option("schema_retention_policy", "collapse_root")
.option("ebcdic_code_page", "cp037")
.option("string_trimming_policy", "right")
.option("copybook_contents", merged_copybook_contents)
.load(filename)

merged_copybook_contents - 3 copybooks merged into this variable as per below structure
01 ROOT
05 HEADER GROUP.
...
05 RECORD GROUP REDEFINES HEADER.
...
05 FOOTER GROUP REDEFINES HEADER.

@yruslan
Copy link
Collaborator

yruslan commented Jan 3, 2023

The error is a syntax error. Remember that by default first 6 characters are ignored in copybooks.
See truncate_comments option in readme for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants