Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data in string format apart from native target datatype format #545

Closed
saikumare-a opened this issue Dec 12, 2022 · 14 comments
Closed

data in string format apart from native target datatype format #545

saikumare-a opened this issue Dec 12, 2022 · 14 comments
Labels
accepted Accepted for implementation enhancement New feature or request

Comments

@saikumare-a
Copy link

Background

cobrix converts the data to native type ( decimal, integer etc.,) based on the copybook information.

Feature

having an option of just dividing the record to columns and having them in string format(as it is , without any trimming) instead of converting to native type would be helpful and provide the below benefits

  1. if there is discrepancy between data and copybook , having all columns in string type would help in debugging issues
  2. can be helpful to do Data Quality by downstream applications and reports issues (currently invalid data becomes null by spark)

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas

  1. one approach could be handling this using option("debug","original")
@saikumare-a saikumare-a added the enhancement New feature or request label Dec 12, 2022
@saikumare-a saikumare-a changed the title data in native string format instead of string format data in string format instead of native target datatype format Dec 12, 2022
@saikumare-a saikumare-a changed the title data in string format instead of native target datatype format data in string format apart from native target datatype format Dec 12, 2022
@yruslan
Copy link
Collaborator

yruslan commented Dec 12, 2022

Nice idea, but it can only work for fields having 'DISPLAY' usage, and also encoding (ascii/ebcdic) dependent.
Binary, BCD, floating point numbers contain bytes that can't be converted to characters.

.option("debug", "true") aka .option("debug", "hex") works well for investigating copybook discrepancy issues, and can be used for quality control, e.g. expecting nulls to be only for '0x00 0x00...' byte stream.

.option("debug", "raw") helps preserving original data, which you can use to convert to sitting if you want.

Can you give a concrete example (field, its PIC, and value) that would help debugging it as a string?

@yruslan
Copy link
Collaborator

yruslan commented Dec 13, 2022

After thinking about it, the above feature makes sense for ASCII files, but not for EBCDIC.
I see how it could be helpful for ASCII.

@yruslan yruslan added the accepted Accepted for implementation label Dec 13, 2022
@saikumare-a
Copy link
Author

Thanks for reply and as rightly said, this would be very useful in case of ASCII case.

please provide thoughts on adding this feature (plan and time etc.,) . Thank you for the support

@yruslan
Copy link
Collaborator

yruslan commented Dec 15, 2022

It is hard to say for certain. Maybe end of this year, or Jan next year.

@yruslan
Copy link
Collaborator

yruslan commented Dec 19, 2022

This is done and available in the latest 'master'

@saikumare-a
Copy link
Author

Hi @yruslan ,

Thanks a lot, i am from python world and no idea about creating the jar file .could you help with steps to create a jar file or attach the jar file to this issue, so that i can test and let you know

@yruslan
Copy link
Collaborator

yruslan commented Dec 19, 2022

Sure. Which Spark and Scala version are you using?

@saikumare-a
Copy link
Author

saikumare-a commented Dec 19, 2022

Using Spark 3.1.2, Scala 2.12

currently using the below cobrix version

groupId: za.co.absa.cobrix
artifactId: spark-cobol_2.12
version: 2.6.1

@yruslan
Copy link
Collaborator

yruslan commented Dec 20, 2022

Here, you can try this one:
spark-cobol-assembly-2.6.2-SNAPSHOT.zip

@saikumare-a
Copy link
Author

Awesome, validated and working as expected. Thanks for the quick turnaround with this enhancement

@saikumare-a
Copy link
Author

Hi @yruslan,

with option("debug","string"), we see string data in <col_name>_debug fields, how above showing this string data in actual fields instead of <col_name>_debug fields. this would help in showing actual data in actual columns and downstream can take care of handling next step

one option, we can handle this post cobrix by custom code,
handling in cobrix, might help other cobrix users

@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

So basically what you need is to slice ASCII records based on field lengths from a copybook with all columns are strings, right?

I think in ASCII files you can only have numbers with usage DISPLAY. So if numbers could be retained as strings, it could help you, right?

Here is another feature request related to this: #25

@saikumare-a
Copy link
Author

Yes, correct,

is this #25 , already available currently in cobrix?, if yes, please add this info in documentation as i dont see this in documentation

@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

No, it is not implemented yet. But it is the plans to implement it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants