Fixed length Cobol EBCDIC file with header, body and trailer records - How to skip header and trailer #556

ManojKolisetty-git · 2022-12-30T09:31:44Z

Background [Optional]

We have fixed length ebcdic file with file structure starts with header, body records and trailer. We user cobrix library (za.co.absa.cobrix:spark-cobol_2.12:2.6.0) in pyspark to read the file and create dataframe.

currently to process header, body and trailer we are creating 3 different dataframes (each has their own copybook).
By doing this we endup in reading file multiple times, and when fetching body records we have to skip first and last record (header, trailer) using Record_id sequence (sorting the dataframe to get last and first id's) to filter out. This approach is taking long time as data needs to shuffled before filter.

Question

when trying to optimize I have seen below file read options but observed that file_end_offset is skipping last record from each partition instead of one record from file. For instance my file is getting processed in 100 partitions, 100 last records from each partition is getting removed.
.option("file_start_offset", 11000)
.option("file_end_offset", 11000)

Kindly suggest way/approach to efficiently skip header, trailer or just fetch header/trailer alone.

Many Thanks
Manoj

yruslan · 2023-01-02T08:37:21Z

Hi, the way we usually work with headers and footers is we create a redefine for them, e.g.

01 RECORD
  05 HEADER GROUP.
    ...
  05 PAYLOAD GROUP REDEFINES HEADER.
    ...
  05 FOOTER GROUP REDEFINES HEADER.

The first record of the dataframe is then treated as the header, the last one as the footer, and everything else as the payload.

However, .option("file_end_offset", 11000) should work as well. And from what you described it looks like a bug. Will check and confirm.

ManojKolisetty-git · 2023-01-02T10:58:56Z

Thanks for the reply.
We have standard ebcdic fixed length files coming from source with below format. It is necessary for us to read Header, footer and detail records and apply necessary transformations/validations before loading into target format. As you see, we have to read file 3 times with necessary copybook to fetch and load in different dataframes (so that we can perform necessary validations downstream).

Skipping header record is easy to filter out as it is first record (we use monotonically_increasing_id function to generate index for each record)
skipping footer record is undergoing shuffles/repartition which is time consuming . We get the last record index and apply filter

I see options file_start_offset is working at file level i.e. before parallelize but file_end_offset is working on dataframe (i.e. after parallelize/partitoned).
I thought if file_end_offset works as file_start_offset, I can cache it so that we save processing time.

=========== Copy books structure for header, record and Trailer ===============
01 HEADER.
10 Attributes (Header Attributes)
...........
01 RECORD.
10 Attributes (Record Attributes)
...........
01 TRAILER.
10 Attributes (TRAILER Attributes)
............

Header (which is used to identify file type and file date)
...
Record
..
..
Footer (which contains record count)

yruslan · 2023-01-03T11:04:39Z

Note REDEFINES in the above solution. I could help you avoid reading the file multiple times.

The bug with file_end_offset is confirmed.

…tion instead of end of each file.

ManojKolisetty-git · 2023-01-03T13:57:06Z

Thanks for checking the issue.
Regarding REDEFINES - copybooks which we work don't have any redefines. They are all individual copybooks (each for header, record and Trailer). Not sure how we can read entire file in single go with 3 different copybook contents

yruslan · 2023-01-03T13:58:22Z

The file_end_offset bug is fixed in 2.6.2. Please, let us know if there are any issues.

yruslan · 2023-01-03T14:01:39Z

Sorry, I might not added enough context. I suggest that you can use REDEFINES to solve your use case when you need to read the header, the footer and all the records in one read.

You can define your copybook to include the header, the footer and the record group. These groups should redefine each other. Then, you can use Cobrix to read the file, and use corresponding redefines for the first and the last record.

Take a look again at the proposed copybook structure (i renamed a couple o fields for clarity):

01 ROOT
  05 HEADER GROUP.
    ...
  05 RECORD GROUP REDEFINES HEADER.
    ...
  05 FOOTER GROUP REDEFINES HEADER.

ManojKolisetty-git · 2023-01-03T14:43:18Z

verified file_end_offset fix - It is working as expected i.e. ignoring last record (in my case). Many thanks for actioning it quickly.

One more issue i observed regarding file_start_offset, file_end_offset - It appears like offsets are set accept INT, if i give any value greater than max integer range, it is erroring. Could you please change these 2 options to accept BIGINT (i.e. LongType) from INT?

Regarding redefines copybook - As suggested above i have merged 3 copybooks under 01 ROOT. ... Now, I am getting below error "za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 1: Invalid input '1' at position 1:6". Any other options i need to include?

recordsRead = spark.read
.format("cobol")
.option("schema_retention_policy", "collapse_root")
.option("ebcdic_code_page", "cp037")
.option("string_trimming_policy", "right")
.option("copybook_contents", merged_copybook_contents)
.load(filename)

merged_copybook_contents - 3 copybooks merged into this variable as per below structure
01 ROOT
05 HEADER GROUP.
...
05 RECORD GROUP REDEFINES HEADER.
...
05 FOOTER GROUP REDEFINES HEADER.

yruslan · 2023-01-03T15:05:06Z

The error is a syntax error. Remember that by default first 6 characters are ignored in copybooks.
See truncate_comments option in readme for details.

ManojKolisetty-git added the question Further information is requested label Dec 30, 2022

yruslan added bug Something isn't working accepted Accepted for implementation and removed question Further information is requested labels Jan 3, 2023

yruslan added a commit that referenced this issue Jan 3, 2023

#556 Fix the option 'file_end_offset' truncates records of each parti…

3431d1a

…tion instead of end of each file.

yruslan added a commit that referenced this issue Jan 3, 2023

#556 Fix the option 'file_end_offset' truncates records of each parti…

e0596e2

…tion instead of end of each file.

ManojKolisetty-git closed this as completed Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed length Cobol EBCDIC file with header, body and trailer records - How to skip header and trailer #556

Fixed length Cobol EBCDIC file with header, body and trailer records - How to skip header and trailer #556

ManojKolisetty-git commented Dec 30, 2022

yruslan commented Jan 2, 2023

ManojKolisetty-git commented Jan 2, 2023 •

edited

Loading

yruslan commented Jan 3, 2023

ManojKolisetty-git commented Jan 3, 2023

yruslan commented Jan 3, 2023

yruslan commented Jan 3, 2023

ManojKolisetty-git commented Jan 3, 2023 •

edited

Loading

yruslan commented Jan 3, 2023

Fixed length Cobol EBCDIC file with header, body and trailer records - How to skip header and trailer #556

Fixed length Cobol EBCDIC file with header, body and trailer records - How to skip header and trailer #556

Comments

ManojKolisetty-git commented Dec 30, 2022

Background [Optional]

Question

yruslan commented Jan 2, 2023

ManojKolisetty-git commented Jan 2, 2023 • edited Loading

=========== Copy books structure for header, record and Trailer =============== 01 HEADER. 10 Attributes (Header Attributes) ........... 01 RECORD. 10 Attributes (Record Attributes) ........... 01 TRAILER. 10 Attributes (TRAILER Attributes) ............

yruslan commented Jan 3, 2023

ManojKolisetty-git commented Jan 3, 2023

yruslan commented Jan 3, 2023

yruslan commented Jan 3, 2023

ManojKolisetty-git commented Jan 3, 2023 • edited Loading

yruslan commented Jan 3, 2023

ManojKolisetty-git commented Jan 2, 2023 •

edited

Loading

=========== Copy books structure for header, record and Trailer ===============
01 HEADER.
10 Attributes (Header Attributes)
...........
01 RECORD.
10 Attributes (Record Attributes)
...........
01 TRAILER.
10 Attributes (TRAILER Attributes)
............

ManojKolisetty-git commented Jan 3, 2023 •

edited

Loading