Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

variable length record optimization #521

Closed
sree018 opened this issue Oct 3, 2022 · 31 comments
Closed

variable length record optimization #521

sree018 opened this issue Oct 3, 2022 · 31 comments
Assignees
Labels
accepted Accepted for implementation bug Something isn't working

Comments

@sree018
Copy link

sree018 commented Oct 3, 2022

Background [Optional]

we have 76 GB variable width variable length (BDW+RDW) with multi-segmented file(47 segments) , file contains 470,000,000 records with 700 columns. I am trying to convert parquet file. It's creating single index(1)( single partition). For parsing file and able to see data correctly with df.show()

Question

How to do I parallelize job across executors?

df.write taking single thread while write into parquet file.

options used

  1. inputsplit records
  2. input split size
@sree018 sree018 added the question Further information is requested label Oct 3, 2022
@yruslan
Copy link
Collaborator

yruslan commented Oct 4, 2022

Hi, by default, Cobrix should be able to parallelize its work, so we'd need some additional context to understand why it is not happening in your case. Please, add the following to the issue description:

  • The version of Cobrix used
  • The full code snippet that you used to read the file
  • The command line you used to run the application

@sree018
Copy link
Author

sree018 commented Oct 4, 2022

@yruslan thanks for response

The version of Cobrix used: 2.5.1, scala_211,spark_2.4.0

val cobrixOptions:Map[String,String]=Map("record_format"->"VB",
"bdw_adjustment"->"-4,"
"rdw_adjustment"->"-4,"
"is_rdw_big_endian"-> "true",
"is_bdw_big_endian"->"true",
"input_split_size_mb"->"30")

val df=spark.read.format("za.co.absa.cobrix.spark.cobol.source").option("copybook",copyBookPath).options(cobrixOptions).load(path)

val df1=SparkUtils.flattenSchema(df,true)

df1.write.mode("overwrite").parquet(outboundpath)

WhatsApp Image 2022-10-04 at 5 29 56 AM
WhatsApp Image 2022-10-04 at 5 33 36 AM

@yruslan
Copy link
Collaborator

yruslan commented Oct 4, 2022

Interesting... the index consists of a single item only. This usually happens when the input file is not decoded correctly.

How many records 'df2.count' returns?

@sree018
Copy link
Author

sree018 commented Oct 4, 2022

450,786,280

@yruslan
Copy link
Collaborator

yruslan commented Oct 4, 2022

Looks good, thanks! Investigating

@yruslan
Copy link
Collaborator

yruslan commented Oct 4, 2022

The issue is confirmed. We will fix it in the next release.

@yruslan yruslan self-assigned this Oct 4, 2022
@yruslan yruslan added bug Something isn't working accepted Accepted for implementation and removed question Further information is requested labels Oct 4, 2022
@yruslan
Copy link
Collaborator

yruslan commented Oct 10, 2022

This issue should be fixed in the latest 'master' branch.
You can try it out by cloning master and building from source, or you can wait for the release of Cobrix 2.6.0, which should be soon.

@sree018
Copy link
Author

sree018 commented Oct 13, 2022

Hi @yruslan

is there any tentative date for 2.6.0 release?

@yruslan
Copy link
Collaborator

yruslan commented Oct 14, 2022

Actually, releasing 2.6.0 today. It should be in Maven Central shortly.

@sree018
Copy link
Author

sree018 commented Oct 14, 2022

Hi @yruslan

I am still seeing same issue,

IndexBuilder:214 - Index elements count:1, number of partitions =1

@yruslan
Copy link
Collaborator

yruslan commented Oct 14, 2022

Please,

  • Make sure you are using Cobvix v.2.6.0. This should be indicated by Cobrix banner displayed in logs or in the terminal (if you are using spark-shell)
  • Remove all index-related options ('split_input_size_mb' etc)

and try again. If this won't work, probably we'd need to add some debug logging to investigate further.

@sree018
Copy link
Author

sree018 commented Oct 14, 2022

Shall I use input_split_size_mb?

@yruslan
Copy link
Collaborator

yruslan commented Oct 14, 2022

Shall I use input_split_size_mb?

No, not at this time. When indexes are confirmed working you can fine-tune with options like this. But for now, let's keep all default.

@sree018
Copy link
Author

sree018 commented Oct 14, 2022

Hi @yruslan

just passed

val cobrixOptions:Map[String,String]=Map("record_format"->"VB",
"bdw_adjustment"->"-4,"
"rdw_adjustment"->"-4,"
"is_rdw_big_endian"-> "true",
"is_bdw_big_endian"->"true")

Still it creating 1located indexes

Can you suggest proper options vb files ?

@yruslan
Copy link
Collaborator

yruslan commented Oct 14, 2022

These options seem alright.

Will try to reproduce next week. But probably we won'd need to add more debug info in order to understand why it didn't create more than 1 entry in the index.

@yruslan
Copy link
Collaborator

yruslan commented Oct 19, 2022

Tried generating a big file with BDW+RDW (record_format = VB), and it generated more than 1 index entry.

Are you sure you are using the latest version?

Can you post the log line showing spark-cobol version number?

For my env it looks like:

22/10/19 12:52:37 INFO DefaultSource: Cobrix 'spark-cobol' build 2.6.1-SNAPSHOT (2022-10-17T06:46:18Z) 

@sree018
Copy link
Author

sree018 commented Oct 19, 2022

Hi @yruslan

Is it 2.6.0 or 2.6.1 ?

I am using 2.6.0

@yruslan
Copy link
Collaborator

yruslan commented Oct 19, 2022

Correct, should be 2.6.0 for you. Do you see the log line? Please, post it here

@sree018
Copy link
Author

sree018 commented Oct 19, 2022

7EBEE4E7-D39B-4A4F-807D-A71911FCF21C

@yruslan
Copy link
Collaborator

yruslan commented Oct 19, 2022

The version is correct, but I'm also seeing 'segment_id_level8', 'segment_id_level17', etc.

Are the options you specified in cobrixOptions map above complete or there are other options to Cobrix that you are passing, but haven't mentioned?

@sree018
Copy link
Author

sree018 commented Oct 19, 2022

It is vb and multi segments file
Sorry for not mentioning about it

@yruslan
Copy link
Collaborator

yruslan commented Oct 19, 2022

Please, list all the options. It might be important.

Or try loading the data file without specifying multisegment options

@sree018
Copy link
Author

sree018 commented Oct 19, 2022

This file is vb and it has 10 segments present.

@yruslan
Copy link
Collaborator

yruslan commented Oct 19, 2022

Dear @sree018 , currently I'm unable to reproduce the issue. In order for me to succeed I need as much context as possible. With all due respect, could you please provide all options that you are passing to Cobrix? Otherwise, we are playing hide and seek, and I'm quite busy to be honest and might not have enough time available to me trying to guess what kind of options you might have passed to Cobrix to have such an effect. Thank you.

@sree018
Copy link
Author

sree018 commented Oct 19, 2022

C641E62D-39F4-4122-8E5F-124C29435A9D

sorry for miscommunication. Above options I used

@yruslan
Copy link
Collaborator

yruslan commented Oct 19, 2022

Thanks! Try removing all options that start with 'segment_id_level', and see if the number of index elements is still the same.

@sree018
Copy link
Author

sree018 commented Oct 26, 2022

Hi @yruslan

I removed all segments levels, it able to create multiple partitions.

@yruslan
Copy link
Collaborator

yruslan commented Oct 26, 2022

Did you achieve the expected result?

I'm asking because I've never encountered a situation with more than 2 levels of hierarchy in practice, that's why it was surprising for me to see 17 levels of hierarchy.

@sree018
Copy link
Author

sree018 commented Oct 26, 2022

Yes, we need add functionality to separate data to segments levels and store in db.

@sree018
Copy link
Author

sree018 commented Oct 26, 2022

For fixed length or variable length(rdw), is there any option to change custom hdfs block size ?

@yruslan
Copy link
Collaborator

yruslan commented Oct 26, 2022

For fixed length or variable length(rdw), is there any option to change custom hdfs block size ?

Sorry, I don't understand the question. Do you want to save output files with a block size that is different from the defaults configured for the HDFS cluster? If yes, it is a Spark feature, not Cobrix feature:
https://stackoverflow.com/a/40959126/1038282

@yruslan yruslan closed this as completed Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants