Remove metadata associated with column #590

kbarathi · 2023-03-20T17:42:14Z

Background [Optional]

I'm using Cobrix v2.6.2 to convert the cobol file to parquet. I noticed that recently your team has added the 'maxLength' metadata for spark schema string fields in v2.6.0. This addition of metadata is causing validation issues for us during dataframe transformations and we are having hard time in removing this metadata associated with each fields.

scala> df.schema("CODE").metadata
res0: org.apache.spark.sql.types.Metadata = {"maxLength":2}

Question

I do not need this metadata info to be associated with the field. Is there a way to disable metadata generation? something like .option("metadata", false)
Appreciate your response.

yruslan · 2023-03-21T11:47:48Z

Sure, will add an option to disable metadata generation. By default metadata will be generated though because it helps migrating data to relational databases. E.g. {"maxLength":2} corresponds to VARCHAR(2) in a relational schema.

Just out of the curiosity, what is the purpose of the validation and why metadata is the issue here?

yruslan · 2023-03-22T07:36:46Z

This should be available at the current master branch.

Use

.option("metadata", "false")

kbarathi · 2023-03-22T14:45:46Z

Thanks for your quick response @yruslan.

The purpose of the validation is, when we send a file from one program to another, on arrival we do a schema validation through spark. To do this we create a default StructType schema that we can then compare to the schema that comes in from our other program. This compare will compare the entirety of the schema, even hidden values such as metadata since we use the .diff() method in spark Scala. Since not all the data we use goes through cobrix, we can’t change the StructField we are expecting to contain metadata, and so this is where our issue stemmed from.

yruslan · 2023-03-22T18:21:32Z

Thanks for describing your use case!

yruslan · 2023-04-12T13:20:29Z

This is released in 2.6.5

kbarathi added the question Further information is requested label Mar 20, 2023

yruslan added enhancement New feature or request accepted Accepted for implementation and removed question Further information is requested labels Mar 21, 2023

yruslan added a commit that referenced this issue Mar 22, 2023

#590 Add more options to control metadata generation.

a8d2217

yruslan added a commit that referenced this issue Mar 22, 2023

#590 Add more options to control metadata generation.

b7d8ba8

yruslan closed this as completed Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove metadata associated with column #590

Remove metadata associated with column #590

kbarathi commented Mar 20, 2023

yruslan commented Mar 21, 2023

yruslan commented Mar 22, 2023

kbarathi commented Mar 22, 2023

yruslan commented Mar 22, 2023

yruslan commented Apr 12, 2023

Remove metadata associated with column #590

Remove metadata associated with column #590

Comments

kbarathi commented Mar 20, 2023

Background [Optional]

Question

yruslan commented Mar 21, 2023

yruslan commented Mar 22, 2023

kbarathi commented Mar 22, 2023

yruslan commented Mar 22, 2023

yruslan commented Apr 12, 2023