-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed length Cobol EBCDIC file with header, body and trailer records - How to skip header and trailer #556
Comments
Hi, the way we usually work with headers and footers is we create a redefine for them, e.g.
The first record of the dataframe is then treated as the header, the last one as the footer, and everything else as the payload. However, |
Thanks for the reply. Skipping header record is easy to filter out as it is first record (we use monotonically_increasing_id function to generate index for each record) I see options file_start_offset is working at file level i.e. before parallelize but file_end_offset is working on dataframe (i.e. after parallelize/partitoned). =========== Copy books structure for header, record and Trailer ===============
|
Note The bug with |
Thanks for checking the issue. |
The file_end_offset bug is fixed in 2.6.2. Please, let us know if there are any issues. |
Sorry, I might not added enough context. I suggest that you can use REDEFINES to solve your use case when you need to read the header, the footer and all the records in one read. You can define your copybook to include the header, the footer and the record group. These groups should redefine each other. Then, you can use Cobrix to read the file, and use corresponding redefines for the first and the last record. Take a look again at the proposed copybook structure (i renamed a couple o fields for clarity):
|
verified file_end_offset fix - It is working as expected i.e. ignoring last record (in my case). Many thanks for actioning it quickly. One more issue i observed regarding file_start_offset, file_end_offset - It appears like offsets are set accept INT, if i give any value greater than max integer range, it is erroring. Could you please change these 2 options to accept BIGINT (i.e. LongType) from INT? Regarding redefines copybook - As suggested above i have merged 3 copybooks under 01 ROOT. ... Now, I am getting below error "za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 1: Invalid input '1' at position 1:6". Any other options i need to include? recordsRead = spark.read merged_copybook_contents - 3 copybooks merged into this variable as per below structure |
The error is a syntax error. Remember that by default first 6 characters are ignored in copybooks. |
Background [Optional]
We have fixed length ebcdic file with file structure starts with header, body records and trailer. We user cobrix library (za.co.absa.cobrix:spark-cobol_2.12:2.6.0) in pyspark to read the file and create dataframe.
currently to process header, body and trailer we are creating 3 different dataframes (each has their own copybook).
By doing this we endup in reading file multiple times, and when fetching body records we have to skip first and last record (header, trailer) using Record_id sequence (sorting the dataframe to get last and first id's) to filter out. This approach is taking long time as data needs to shuffled before filter.
Question
when trying to optimize I have seen below file read options but observed that file_end_offset is skipping last record from each partition instead of one record from file. For instance my file is getting processed in 100 partitions, 100 last records from each partition is getting removed.
.option("file_start_offset", 11000)
.option("file_end_offset", 11000)
Kindly suggest way/approach to efficiently skip header, trailer or just fetch header/trailer alone.
Many Thanks
Manoj
The text was updated successfully, but these errors were encountered: