SparkUtils.flattenSchema method throws null pointer exception #466

codealways · 2022-01-29T19:38:07Z

Describe the bug

SparkUtils.flattenSchema method throws NPE for the empty data frame having occurs or schema contains arrays

To Reproduce

create a data frame from empty data file and copybook with occurs and flatten the schema with SparkUtils.flattenSchema method

Expected behaviour

It should return the flatten the schema with zero records as it's empty dataset

Screenshots

Additional context

codealways · 2022-01-30T16:57:09Z

LOC which is causing NPE

var maxInd = df.agg(max(expr(s"size($path${structField.name})"))).collect()(0)(0).toString.toInt

in function flattenStructArray

I understand the above code depends on data because of OCCURS DEPENDING ON implementation and it will throw NPE in case for zero records in dataframe

Potential Fix

Approach 1: Before calculating maxInd We can check if dataframe is empty. If its empty we can get the max index of a group from copybook for OCCURS or OCCURS DEPENDING on.For empty dataset it is not required to get the maxInd from the data

This will return the dataframe with maximum possible columns from Occurs or OCCOURS DEPENDING on.
If it is OCCURS DEPENDING ON FROM 0 to 3 then we will consider 3. We have to pass copybookcontents in this case to the required function.

Approach 2:
We can only update the code to read the maxInd from copybook in case of OCCURS. For OCCURS DEPENDING ON we can use the existing code.

As per my view Approach 1 is good. Kindly let me know your view.

yruslan · 2022-01-31T07:58:25Z

Thanks for the report! We will fix the NPE exception.

The issue with getting info from OCCURS is once the Cobol schema is converted to Spark schema the maximum array size is lost because Spark arrays do not have maximum number of elements as part of metadata. That's why the maximum array elements are determined by the actual data.

I have some ideas, will let you know after I try them.

codealways · 2022-01-31T08:29:56Z

Sure thanks. Please check once Approach 1 where we can pass optional parameter as copybook contents and can derive max occurrences based on that.

for the below copybook in case of zero byte file the dataset will be generated as mentioned below.

  01 RECORD.
      02 COUNT PIC 9(1).
      02 GROUP OCCURS 0 TO 2 TIMES DEPENDING ON COUNT.
         03 INNER-COUNT PIC 9(1).
         03 INNER-GROUP OCCURS 0 TO 3 TIMES
                            DEPENDING ON INNER-COUNT.
            04 FIELD PIC X.

-----+-------------------+---------------------------+---------------------------+---------------------------+-------------------+---------------------------+---------------------------+---------------------------+
|COUNT|GROUP_0_INNER_COUNT|GROUP_0_INNER_GROUP_0_FIELD|GROUP_0_INNER_GROUP_1_FIELD|GROUP_0_INNER_GROUP_2_FIELD|GROUP_1_INNER_COUNT|GROUP_1_INNER_GROUP_0_FIELD|GROUP_1_INNER_GROUP_1_FIELD|GROUP_1_INNER_GROUP_2_FIELD|
+-----+-------------------+---------------------------+---------------------------+---------------------------+-------------------+---------------------------+---------------------------+---------------------------+
+-----+-------------------+---------------------------+---------------------------+---------------------------+-------------------+---------------------------+---------------------------+---------------------------+

…from a copybook.

…sizes.

…from a copybook.

…sizes.

yruslan · 2022-02-01T08:30:22Z

The idea worked. When creaing a Spark schema from a copybook 'minElements' and 'maxElements' metadata fields are added to arrays from OCCURS. This way the program no longer needs to determine the maximum array size from the data itself. If metadata fields are not available the program will get these maximums the old ways - through querying the data.

You can check the update in master, or wait until 2.4.8 s released.

codealways · 2022-02-01T09:09:08Z

@yruslan Thanks a lot. Any tentative date for 2.4.8 release.

yruslan · 2022-02-01T09:14:44Z

Probably until the end of the week.

But I would encourage you to check if the updated flattenSchema() works for you from the current master, so if not changes could be made before the release.

codealways · 2022-02-01T09:24:21Z

sure let me pull the changes and check via UTCs

codealways · 2022-02-01T09:35:52Z

As we are currently on 2.1.3 which uses spark 2.4.5 I suppose the newer version should be backward compatible as currently we may not use spark 3.x

codealways · 2022-02-01T10:04:55Z

@yruslan I checked with below copybook

01 RECORD.
02 COUNT PIC 9(1).
02 GROUP OCCURS 2 TIMES.
03 INNER-COUNT PIC 9(1).
03 INNER-GROUP OCCURS 3 TIMES.
04 FIELD PIC X.

As per expectation i should get the columns as below

FIELD should come 2*3 = 6
INNER-COUNT 2 TIMES
COUNT 1 TIME

but i am getting

yruslan · 2022-02-01T11:56:36Z

Did you use spark-cobol dependency with version 2.4.8-SNAPSHOT ?

codealways · 2022-02-01T12:42:55Z

@yruslan yes I am using master branch and its with version 2.4.8-SNAPSHOT

yruslan · 2022-02-01T15:38:30Z

Strange. What is the code snippet you are using?

codealways · 2022-02-01T15:58:12Z

I am running a simple testcase with below code. Using the master branch itself. Didnt change anything in POM or main code.

Is it working for you as expected ? i.e creating below columns

val df = spark
.read
.format("cobol")
.option("copybook", inputCopybookPath)
.option("encoding", "ascii")
.option("schema_retention_policy", "collapse_root")
.load(inputDataPath)

yruslan · 2022-02-02T07:30:09Z

I confirm that inner occurs were not flattenned properly for empty files. It is fixed.

You can pull the latest master and try again. It is good that you checked, otherwise we wouldn't have spotted it!

codealways · 2022-02-02T07:30:28Z

@yruslan let me test again post your commit

codealways · 2022-02-02T08:21:30Z

Working fine

codealways added the bug Something isn't working label Jan 29, 2022

yruslan added a commit that referenced this issue Feb 1, 2022

#466 Add array size metadata from OCCURS to a Spark schema generated …

1dce5ab

…from a copybook.

yruslan added a commit that referenced this issue Feb 1, 2022

#466 Use schema metadata in the flatten() routine to determine array …

3f5646c

…sizes.

yruslan added a commit that referenced this issue Feb 1, 2022

#466 Update the changelog.

b912f6c

yruslan added a commit that referenced this issue Feb 1, 2022

#466 Add array size metadata from OCCURS to a Spark schema generated …

5578ef6

…from a copybook.

yruslan added a commit that referenced this issue Feb 1, 2022

#466 Use schema metadata in the flatten() routine to determine array …

55549a8

…sizes.

yruslan added a commit that referenced this issue Feb 1, 2022

#466 Update the changelog.

746a09a

yruslan added a commit that referenced this issue Feb 2, 2022

#466 Fix flattenSchema maximum array size for inner arrays.

bade69b

yruslan added a commit that referenced this issue Feb 2, 2022

#466 Fix flattenSchema maximum array size for inner arrays.

3014923

yruslan closed this as completed Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparkUtils.flattenSchema method throws null pointer exception #466

SparkUtils.flattenSchema method throws null pointer exception #466

codealways commented Jan 29, 2022

codealways commented Jan 30, 2022 •

edited

Loading

yruslan commented Jan 31, 2022

codealways commented Jan 31, 2022 •

edited

Loading

yruslan commented Feb 1, 2022

codealways commented Feb 1, 2022

yruslan commented Feb 1, 2022

codealways commented Feb 1, 2022

codealways commented Feb 1, 2022

codealways commented Feb 1, 2022 •

edited

Loading

yruslan commented Feb 1, 2022 •

edited

Loading

codealways commented Feb 1, 2022

yruslan commented Feb 1, 2022

codealways commented Feb 1, 2022

yruslan commented Feb 2, 2022

codealways commented Feb 2, 2022

codealways commented Feb 2, 2022

SparkUtils.flattenSchema method throws null pointer exception #466

SparkUtils.flattenSchema method throws null pointer exception #466

Comments

codealways commented Jan 29, 2022

Describe the bug

To Reproduce

Expected behaviour

Screenshots

Additional context

codealways commented Jan 30, 2022 • edited Loading

yruslan commented Jan 31, 2022

codealways commented Jan 31, 2022 • edited Loading

yruslan commented Feb 1, 2022

codealways commented Feb 1, 2022

yruslan commented Feb 1, 2022

codealways commented Feb 1, 2022

codealways commented Feb 1, 2022

codealways commented Feb 1, 2022 • edited Loading

yruslan commented Feb 1, 2022 • edited Loading

codealways commented Feb 1, 2022

yruslan commented Feb 1, 2022

codealways commented Feb 1, 2022

yruslan commented Feb 2, 2022

codealways commented Feb 2, 2022

codealways commented Feb 2, 2022

codealways commented Jan 30, 2022 •

edited

Loading

codealways commented Jan 31, 2022 •

edited

Loading

codealways commented Feb 1, 2022 •

edited

Loading

yruslan commented Feb 1, 2022 •

edited

Loading