-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
variable length record optimization #521
Comments
Hi, by default, Cobrix should be able to parallelize its work, so we'd need some additional context to understand why it is not happening in your case. Please, add the following to the issue description:
|
@yruslan thanks for response The version of Cobrix used: 2.5.1, scala_211,spark_2.4.0 val cobrixOptions:Map[String,String]=Map("record_format"->"VB", val df=spark.read.format("za.co.absa.cobrix.spark.cobol.source").option("copybook",copyBookPath).options(cobrixOptions).load(path) val df1=SparkUtils.flattenSchema(df,true) df1.write.mode("overwrite").parquet(outboundpath) |
Interesting... the index consists of a single item only. This usually happens when the input file is not decoded correctly. How many records 'df2.count' returns? |
450,786,280 |
Looks good, thanks! Investigating |
The issue is confirmed. We will fix it in the next release. |
This issue should be fixed in the latest 'master' branch. |
Hi @yruslan is there any tentative date for 2.6.0 release? |
Actually, releasing 2.6.0 today. It should be in Maven Central shortly. |
Hi @yruslan I am still seeing same issue, IndexBuilder:214 - Index elements count:1, number of partitions =1 |
Please,
and try again. If this won't work, probably we'd need to add some debug logging to investigate further. |
Shall I use input_split_size_mb? |
No, not at this time. When indexes are confirmed working you can fine-tune with options like this. But for now, let's keep all default. |
Hi @yruslan just passed val cobrixOptions:Map[String,String]=Map("record_format"->"VB", Still it creating 1located indexes Can you suggest proper options vb files ? |
These options seem alright. Will try to reproduce next week. But probably we won'd need to add more debug info in order to understand why it didn't create more than 1 entry in the index. |
Tried generating a big file with BDW+RDW (record_format = VB), and it generated more than 1 index entry. Are you sure you are using the latest version? Can you post the log line showing spark-cobol version number? For my env it looks like:
|
Hi @yruslan Is it 2.6.0 or 2.6.1 ? I am using 2.6.0 |
Correct, should be 2.6.0 for you. Do you see the log line? Please, post it here |
The version is correct, but I'm also seeing 'segment_id_level8', 'segment_id_level17', etc. Are the options you specified in |
It is vb and multi segments file |
Please, list all the options. It might be important. Or try loading the data file without specifying multisegment options |
This file is vb and it has 10 segments present. |
Dear @sree018 , currently I'm unable to reproduce the issue. In order for me to succeed I need as much context as possible. With all due respect, could you please provide all options that you are passing to Cobrix? Otherwise, we are playing hide and seek, and I'm quite busy to be honest and might not have enough time available to me trying to guess what kind of options you might have passed to Cobrix to have such an effect. Thank you. |
Thanks! Try removing all options that start with 'segment_id_level', and see if the number of index elements is still the same. |
Hi @yruslan I removed all segments levels, it able to create multiple partitions. |
Did you achieve the expected result? I'm asking because I've never encountered a situation with more than 2 levels of hierarchy in practice, that's why it was surprising for me to see 17 levels of hierarchy. |
Yes, we need add functionality to separate data to segments levels and store in db. |
For fixed length or variable length(rdw), is there any option to change custom hdfs block size ? |
Sorry, I don't understand the question. Do you want to save output files with a block size that is different from the defaults configured for the HDFS cluster? If yes, it is a Spark feature, not Cobrix feature: |
Background [Optional]
we have 76 GB variable width variable length (BDW+RDW) with multi-segmented file(47 segments) , file contains 470,000,000 records with 700 columns. I am trying to convert parquet file. It's creating single index(1)( single partition). For parsing file and able to see data correctly with df.show()
Question
How to do I parallelize job across executors?
df.write taking single thread while write into parquet file.
options used
The text was updated successfully, but these errors were encountered: