data in string format apart from native target datatype format #545

saikumare-a · 2022-12-12T15:27:28Z

Background

cobrix converts the data to native type ( decimal, integer etc.,) based on the copybook information.

Feature

having an option of just dividing the record to columns and having them in string format(as it is , without any trimming) instead of converting to native type would be helpful and provide the below benefits

if there is discrepancy between data and copybook , having all columns in string type would help in debugging issues
can be helpful to do Data Quality by downstream applications and reports issues (currently invalid data becomes null by spark)

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas

one approach could be handling this using option("debug","original")

yruslan · 2022-12-12T15:42:32Z

Nice idea, but it can only work for fields having 'DISPLAY' usage, and also encoding (ascii/ebcdic) dependent.
Binary, BCD, floating point numbers contain bytes that can't be converted to characters.

.option("debug", "true") aka .option("debug", "hex") works well for investigating copybook discrepancy issues, and can be used for quality control, e.g. expecting nulls to be only for '0x00 0x00...' byte stream.

.option("debug", "raw") helps preserving original data, which you can use to convert to sitting if you want.

Can you give a concrete example (field, its PIC, and value) that would help debugging it as a string?

yruslan · 2022-12-13T08:15:29Z

After thinking about it, the above feature makes sense for ASCII files, but not for EBCDIC.
I see how it could be helpful for ASCII.

saikumare-a · 2022-12-13T08:45:24Z

Thanks for reply and as rightly said, this would be very useful in case of ASCII case.

please provide thoughts on adding this feature (plan and time etc.,) . Thank you for the support

yruslan · 2022-12-15T15:56:50Z

It is hard to say for certain. Maybe end of this year, or Jan next year.

… D2).

yruslan · 2022-12-19T14:39:55Z

This is done and available in the latest 'master'

saikumare-a · 2022-12-19T15:57:51Z

Hi @yruslan ,

Thanks a lot, i am from python world and no idea about creating the jar file .could you help with steps to create a jar file or attach the jar file to this issue, so that i can test and let you know

yruslan · 2022-12-19T16:01:51Z

Sure. Which Spark and Scala version are you using?

saikumare-a · 2022-12-19T16:09:14Z

Using Spark 3.1.2, Scala 2.12

currently using the below cobrix version

groupId: za.co.absa.cobrix
artifactId: spark-cobol_2.12
version: 2.6.1

yruslan · 2022-12-20T10:09:52Z

Here, you can try this one:
spark-cobol-assembly-2.6.2-SNAPSHOT.zip

saikumare-a · 2022-12-20T10:45:15Z

Awesome, validated and working as expected. Thanks for the quick turnaround with this enhancement

saikumare-a · 2022-12-20T10:54:37Z

Hi @yruslan,

with option("debug","string"), we see string data in <col_name>_debug fields, how above showing this string data in actual fields instead of <col_name>_debug fields. this would help in showing actual data in actual columns and downstream can take care of handling next step

one option, we can handle this post cobrix by custom code,
handling in cobrix, might help other cobrix users

yruslan · 2022-12-21T07:55:56Z

So basically what you need is to slice ASCII records based on field lengths from a copybook with all columns are strings, right?

I think in ASCII files you can only have numbers with usage DISPLAY. So if numbers could be retained as strings, it could help you, right?

Here is another feature request related to this: #25

saikumare-a · 2022-12-21T08:09:13Z

Yes, correct,

is this #25 , already available currently in cobrix?, if yes, please add this info in documentation as i dont see this in documentation

yruslan · 2022-12-21T14:20:18Z

No, it is not implemented yet. But it is the plans to implement it in the future.

saikumare-a added the enhancement New feature or request label Dec 12, 2022

saikumare-a changed the title ~~data in native string format instead of string format~~ data in string format instead of native target datatype format Dec 12, 2022

saikumare-a changed the title ~~data in string format instead of native target datatype format~~ data in string format apart from native target datatype format Dec 12, 2022

yruslan added the accepted Accepted for implementation label Dec 13, 2022

yruslan added a commit that referenced this issue Dec 19, 2022

#545 Add support for 'string' debug columns for ASCII file format (D,…

566ef0c

… D2).

yruslan added a commit that referenced this issue Dec 19, 2022

#545 Add support for 'string' debug columns for ASCII file format (D,…

b705e10

… D2).

This was referenced Dec 27, 2022

IndexBuilder step running in single thread for ASCII variable length files #543

Closed

option variable_occurs: true not working #553

Closed

yruslan closed this as completed Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data in string format apart from native target datatype format #545

data in string format apart from native target datatype format #545

saikumare-a commented Dec 12, 2022

yruslan commented Dec 12, 2022

yruslan commented Dec 13, 2022

saikumare-a commented Dec 13, 2022

yruslan commented Dec 15, 2022

yruslan commented Dec 19, 2022

saikumare-a commented Dec 19, 2022

yruslan commented Dec 19, 2022

saikumare-a commented Dec 19, 2022 •

edited

Loading

yruslan commented Dec 20, 2022

saikumare-a commented Dec 20, 2022

saikumare-a commented Dec 20, 2022

yruslan commented Dec 21, 2022 •

edited

Loading

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

data in string format apart from native target datatype format #545

data in string format apart from native target datatype format #545

Comments

saikumare-a commented Dec 12, 2022

Background

Feature

Example [Optional]

Proposed Solution [Optional]

yruslan commented Dec 12, 2022

yruslan commented Dec 13, 2022

saikumare-a commented Dec 13, 2022

yruslan commented Dec 15, 2022

yruslan commented Dec 19, 2022

saikumare-a commented Dec 19, 2022

yruslan commented Dec 19, 2022

saikumare-a commented Dec 19, 2022 • edited Loading

yruslan commented Dec 20, 2022

saikumare-a commented Dec 20, 2022

saikumare-a commented Dec 20, 2022

yruslan commented Dec 21, 2022 • edited Loading

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 19, 2022 •

edited

Loading

yruslan commented Dec 21, 2022 •

edited

Loading