Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove metadata associated with column #590

Closed
kbarathi opened this issue Mar 20, 2023 · 5 comments
Closed

Remove metadata associated with column #590

kbarathi opened this issue Mar 20, 2023 · 5 comments
Labels
accepted Accepted for implementation enhancement New feature or request

Comments

@kbarathi
Copy link

Background [Optional]

I'm using Cobrix v2.6.2 to convert the cobol file to parquet. I noticed that recently your team has added the 'maxLength' metadata for spark schema string fields in v2.6.0. This addition of metadata is causing validation issues for us during dataframe transformations and we are having hard time in removing this metadata associated with each fields.

scala> df.schema("CODE").metadata
res0: org.apache.spark.sql.types.Metadata = {"maxLength":2}

Question

I do not need this metadata info to be associated with the field. Is there a way to disable metadata generation? something like .option("metadata", false)
Appreciate your response.

@kbarathi kbarathi added the question Further information is requested label Mar 20, 2023
@yruslan yruslan added enhancement New feature or request accepted Accepted for implementation and removed question Further information is requested labels Mar 21, 2023
@yruslan
Copy link
Collaborator

yruslan commented Mar 21, 2023

Sure, will add an option to disable metadata generation. By default metadata will be generated though because it helps migrating data to relational databases. E.g. {"maxLength":2} corresponds to VARCHAR(2) in a relational schema.

Just out of the curiosity, what is the purpose of the validation and why metadata is the issue here?

@yruslan
Copy link
Collaborator

yruslan commented Mar 22, 2023

This should be available at the current master branch.

Use

.option("metadata", "false")

@kbarathi
Copy link
Author

Thanks for your quick response @yruslan.

The purpose of the validation is, when we send a file from one program to another, on arrival we do a schema validation through spark. To do this we create a default StructType schema that we can then compare to the schema that comes in from our other program. This compare will compare the entirety of the schema, even hidden values such as metadata since we use the .diff() method in spark Scala. Since not all the data we use goes through cobrix, we can’t change the StructField we are expecting to contain metadata, and so this is where our issue stemmed from.

@yruslan
Copy link
Collaborator

yruslan commented Mar 22, 2023

Thanks for describing your use case!

@yruslan
Copy link
Collaborator

yruslan commented Apr 12, 2023

This is released in 2.6.5

@yruslan yruslan closed this as completed Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants