-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SparkUtils.flattenSchema method throws null pointer exception #466
Comments
LOC which is causing NPE var maxInd = df.agg(max(expr(s"size( in function flattenStructArray I understand the above code depends on data because of OCCURS DEPENDING ON implementation and it will throw NPE in case for zero records in dataframe Potential Fix Approach 1: Before calculating maxInd We can check if dataframe is empty. If its empty we can get the max index of a group from copybook for OCCURS or OCCURS DEPENDING on.For empty dataset it is not required to get the maxInd from the data This will return the dataframe with maximum possible columns from Occurs or OCCOURS DEPENDING on. Approach 2: As per my view Approach 1 is good. Kindly let me know your view. |
Thanks for the report! We will fix the NPE exception. The issue with getting info from I have some ideas, will let you know after I try them. |
Sure thanks. Please check once Approach 1 where we can pass optional parameter as copybook contents and can derive max occurrences based on that. for the below copybook in case of zero byte file the dataset will be generated as mentioned below.
-----+-------------------+---------------------------+---------------------------+---------------------------+-------------------+---------------------------+---------------------------+---------------------------+ |
The idea worked. When creaing a Spark schema from a copybook 'minElements' and 'maxElements' metadata fields are added to arrays from OCCURS. This way the program no longer needs to determine the maximum array size from the data itself. If metadata fields are not available the program will get these maximums the old ways - through querying the data. You can check the update in |
@yruslan Thanks a lot. Any tentative date for 2.4.8 release. |
Probably until the end of the week. But I would encourage you to check if the updated |
sure let me pull the changes and check via UTCs |
As we are currently on 2.1.3 which uses spark 2.4.5 I suppose the newer version should be backward compatible as currently we may not use spark 3.x |
@yruslan I checked with below copybook 01 RECORD. As per expectation i should get the columns as below |COUNT|GROUP_0_INNER_COUNT|GROUP_0_INNER_GROUP_0_FIELD|GROUP_0_INNER_GROUP_1_FIELD|GROUP_0_INNER_GROUP_2_FIELD|GROUP_1_INNER_COUNT|GROUP_1_INNER_GROUP_0_FIELD|GROUP_1_INNER_GROUP_1_FIELD|GROUP_1_INNER_GROUP_2_FIELD| FIELD should come 2*3 = 6 but i am getting |COUNT|GROUP_0_INNER_COUNT|GROUP_0_INNER_GROUP_0_FIELD|GROUP_1_INNER_COUNT|GROUP_1_INNER_GROUP_0_FIELD| |
Did you use spark-cobol dependency with version |
@yruslan yes I am using master branch and its with version 2.4.8-SNAPSHOT |
Strange. What is the code snippet you are using? |
I am running a simple testcase with below code. Using the master branch itself. Didnt change anything in POM or main code. Is it working for you as expected ? i.e creating below columns COUNT|GROUP_0_INNER_COUNT|GROUP_0_INNER_GROUP_0_FIELD|GROUP_0_INNER_GROUP_1_FIELD|GROUP_0_INNER_GROUP_2_FIELD|GROUP_1_INNER_COUNT|GROUP_1_INNER_GROUP_0_FIELD|GROUP_1_INNER_GROUP_1_FIELD|GROUP_1_INNER_GROUP_2_FIELD| val df = spark |
I confirm that inner occurs were not flattenned properly for empty files. It is fixed. You can pull the latest master and try again. It is good that you checked, otherwise we wouldn't have spotted it! |
@yruslan let me test again post your commit |
Working fine |
Describe the bug
SparkUtils.flattenSchema method throws NPE for the empty data frame having occurs or schema contains arrays
To Reproduce
create a data frame from empty data file and copybook with occurs and flatten the schema with SparkUtils.flattenSchema method
Expected behaviour
It should return the flatten the schema with zero records as it's empty dataset
Screenshots
Additional context
The text was updated successfully, but these errors were encountered: