variable length record optimization #521

sree018 · 2022-10-03T22:19:47Z

Background [Optional]

we have 76 GB variable width variable length (BDW+RDW) with multi-segmented file(47 segments) , file contains 470,000,000 records with 700 columns. I am trying to convert parquet file. It's creating single index(1)( single partition). For parsing file and able to see data correctly with df.show()

Question

How to do I parallelize job across executors?

df.write taking single thread while write into parquet file.

options used

inputsplit records
input split size

yruslan · 2022-10-04T06:22:21Z

Hi, by default, Cobrix should be able to parallelize its work, so we'd need some additional context to understand why it is not happening in your case. Please, add the following to the issue description:

The version of Cobrix used
The full code snippet that you used to read the file
The command line you used to run the application

sree018 · 2022-10-04T09:34:55Z

@yruslan thanks for response

The version of Cobrix used: 2.5.1, scala_211,spark_2.4.0

val cobrixOptions:Map[String,String]=Map("record_format"->"VB",
"bdw_adjustment"->"-4,"
"rdw_adjustment"->"-4,"
"is_rdw_big_endian"-> "true",
"is_bdw_big_endian"->"true",
"input_split_size_mb"->"30")

val df=spark.read.format("za.co.absa.cobrix.spark.cobol.source").option("copybook",copyBookPath).options(cobrixOptions).load(path)

val df1=SparkUtils.flattenSchema(df,true)

df1.write.mode("overwrite").parquet(outboundpath)

yruslan · 2022-10-04T09:50:55Z

Interesting... the index consists of a single item only. This usually happens when the input file is not decoded correctly.

How many records 'df2.count' returns?

sree018 · 2022-10-04T10:24:30Z

450,786,280

yruslan · 2022-10-04T10:42:58Z

Looks good, thanks! Investigating

yruslan · 2022-10-04T10:58:23Z

The issue is confirmed. We will fix it in the next release.

…BDW+RDW).

yruslan · 2022-10-10T07:19:24Z

This issue should be fixed in the latest 'master' branch.
You can try it out by cloning master and building from source, or you can wait for the release of Cobrix 2.6.0, which should be soon.

sree018 · 2022-10-13T22:45:15Z

Hi @yruslan

is there any tentative date for 2.6.0 release?

yruslan · 2022-10-14T07:31:28Z

Actually, releasing 2.6.0 today. It should be in Maven Central shortly.

sree018 · 2022-10-14T14:34:33Z

Hi @yruslan

I am still seeing same issue,

IndexBuilder:214 - Index elements count:1, number of partitions =1

yruslan · 2022-10-14T16:05:22Z

Please,

Make sure you are using Cobvix v.2.6.0. This should be indicated by Cobrix banner displayed in logs or in the terminal (if you are using spark-shell)
Remove all index-related options ('split_input_size_mb' etc)

and try again. If this won't work, probably we'd need to add some debug logging to investigate further.

sree018 · 2022-10-14T16:13:35Z

Shall I use input_split_size_mb?

yruslan · 2022-10-14T16:16:09Z

Shall I use input_split_size_mb?

No, not at this time. When indexes are confirmed working you can fine-tune with options like this. But for now, let's keep all default.

sree018 · 2022-10-14T16:37:58Z

Hi @yruslan

just passed

val cobrixOptions:Map[String,String]=Map("record_format"->"VB",
"bdw_adjustment"->"-4,"
"rdw_adjustment"->"-4,"
"is_rdw_big_endian"-> "true",
"is_bdw_big_endian"->"true")

Still it creating 1located indexes

Can you suggest proper options vb files ?

yruslan · 2022-10-14T16:43:22Z

These options seem alright.

Will try to reproduce next week. But probably we won'd need to add more debug info in order to understand why it didn't create more than 1 entry in the index.

yruslan · 2022-10-19T10:55:15Z

Tried generating a big file with BDW+RDW (record_format = VB), and it generated more than 1 index entry.

Are you sure you are using the latest version?

Can you post the log line showing spark-cobol version number?

For my env it looks like:

22/10/19 12:52:37 INFO DefaultSource: Cobrix 'spark-cobol' build 2.6.1-SNAPSHOT (2022-10-17T06:46:18Z)

sree018 · 2022-10-19T12:59:55Z

Hi @yruslan

Is it 2.6.0 or 2.6.1 ?

I am using 2.6.0

yruslan · 2022-10-19T13:04:36Z

Correct, should be 2.6.0 for you. Do you see the log line? Please, post it here

sree018 · 2022-10-19T13:35:37Z

yruslan · 2022-10-19T13:44:28Z

The version is correct, but I'm also seeing 'segment_id_level8', 'segment_id_level17', etc.

Are the options you specified in cobrixOptions map above complete or there are other options to Cobrix that you are passing, but haven't mentioned?

sree018 · 2022-10-19T14:13:51Z

It is vb and multi segments file
Sorry for not mentioning about it

yruslan · 2022-10-19T14:17:33Z

Please, list all the options. It might be important.

Or try loading the data file without specifying multisegment options

sree018 · 2022-10-19T14:24:27Z

This file is vb and it has 10 segments present.

yruslan · 2022-10-19T14:48:11Z

Dear @sree018 , currently I'm unable to reproduce the issue. In order for me to succeed I need as much context as possible. With all due respect, could you please provide all options that you are passing to Cobrix? Otherwise, we are playing hide and seek, and I'm quite busy to be honest and might not have enough time available to me trying to guess what kind of options you might have passed to Cobrix to have such an effect. Thank you.

sree018 · 2022-10-19T14:55:22Z

sorry for miscommunication. Above options I used

yruslan · 2022-10-19T16:02:03Z

Thanks! Try removing all options that start with 'segment_id_level', and see if the number of index elements is still the same.

sree018 · 2022-10-26T14:02:12Z

Hi @yruslan

I removed all segments levels, it able to create multiple partitions.

yruslan · 2022-10-26T15:37:21Z

Did you achieve the expected result?

I'm asking because I've never encountered a situation with more than 2 levels of hierarchy in practice, that's why it was surprising for me to see 17 levels of hierarchy.

sree018 · 2022-10-26T15:40:59Z

Yes, we need add functionality to separate data to segments levels and store in db.

sree018 · 2022-10-26T15:43:42Z

For fixed length or variable length(rdw), is there any option to change custom hdfs block size ?

yruslan · 2022-10-26T15:50:32Z

For fixed length or variable length(rdw), is there any option to change custom hdfs block size ?

Sorry, I don't understand the question. Do you want to save output files with a block size that is different from the defaults configured for the HDFS cluster? If yes, it is a Spark feature, not Cobrix feature:
https://stackoverflow.com/a/40959126/1038282

sree018 added the question Further information is requested label Oct 3, 2022

sree018 mentioned this issue Oct 3, 2022

.option("input_split_size_mb", 100) #480

Closed

yruslan self-assigned this Oct 4, 2022

yruslan added bug Something isn't working accepted Accepted for implementation and removed question Further information is requested labels Oct 4, 2022

yruslan added a commit that referenced this issue Oct 5, 2022

WIP - #521 Fix index generation for files having record_format = VB (…

3abaa23

…BDW+RDW).

yruslan added a commit that referenced this issue Oct 6, 2022

#521 Fix index generation for files having record_format = VB (BDW+RDW).

aadc18e

yruslan added a commit that referenced this issue Oct 7, 2022

#521 Add the description of the bugfix to README.

5973b67

yruslan added a commit that referenced this issue Oct 10, 2022

#521 Fix index generation for files having record_format = VB (BDW+RDW).

0333076

yruslan added a commit that referenced this issue Oct 10, 2022

#521 Add the description of the bugfix to README.

8a205e0

yruslan closed this as completed Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

variable length record optimization #521

variable length record optimization #521

sree018 commented Oct 3, 2022

yruslan commented Oct 4, 2022

sree018 commented Oct 4, 2022

yruslan commented Oct 4, 2022

sree018 commented Oct 4, 2022

yruslan commented Oct 4, 2022

yruslan commented Oct 4, 2022

yruslan commented Oct 10, 2022

sree018 commented Oct 13, 2022

yruslan commented Oct 14, 2022

sree018 commented Oct 14, 2022

yruslan commented Oct 14, 2022

sree018 commented Oct 14, 2022

yruslan commented Oct 14, 2022

sree018 commented Oct 14, 2022 •

edited

Loading

yruslan commented Oct 14, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022 •

edited

Loading

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 26, 2022

yruslan commented Oct 26, 2022

sree018 commented Oct 26, 2022

sree018 commented Oct 26, 2022

yruslan commented Oct 26, 2022

variable length record optimization #521

variable length record optimization #521

Comments

sree018 commented Oct 3, 2022

Background [Optional]

Question

yruslan commented Oct 4, 2022

sree018 commented Oct 4, 2022

yruslan commented Oct 4, 2022

sree018 commented Oct 4, 2022

yruslan commented Oct 4, 2022

yruslan commented Oct 4, 2022

yruslan commented Oct 10, 2022

sree018 commented Oct 13, 2022

yruslan commented Oct 14, 2022

sree018 commented Oct 14, 2022

yruslan commented Oct 14, 2022

sree018 commented Oct 14, 2022

yruslan commented Oct 14, 2022

sree018 commented Oct 14, 2022 • edited Loading

yruslan commented Oct 14, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022 • edited Loading

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 19, 2022

yruslan commented Oct 19, 2022

sree018 commented Oct 26, 2022

yruslan commented Oct 26, 2022

sree018 commented Oct 26, 2022

sree018 commented Oct 26, 2022

yruslan commented Oct 26, 2022

sree018 commented Oct 14, 2022 •

edited

Loading

sree018 commented Oct 19, 2022 •

edited

Loading