-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Join function does not properly handle divergent schemas being j… #4560
Conversation
Tried running the code example present in the issue
This gives the following output:
While going through the tables of the streams, when a missing column is found in the result schema, I am adding it on the fly to the result schema. Hence all the added columns stay in the result and get filled with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a question related to the L/R blocks being slightly different, but looks good to me otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you said you are still going to add more tests here, is that right?
The product changes look good to me. This version of join has lots of strange edge cases and is hard to understand. Hopefully we can replace it with something just as useful but less difficult to reason about.
For the tests you will be adding, what additional cases did you have in mind? The cases that I think tend to be confusing and bug prone are those where some of the columns in the group key overlap with the join key, and/or where the group key is different between the two sides. Adding some cases like these that include missing columns/heterogeneous schemas seems like a good idea.
I can still think of some cases that should produce an error, like when a column exists on both sides but has a different type. I wonder if we have cases that cover that?
9aabdb9
to
2f99ef7
Compare
Added following test cases today -
In the third case, the code is only joining the very first table from the left and right stream and discarding the rest of the tables. I am looking into it. |
2f99ef7
to
7133d44
Compare
…oined together
When one of the input streams passed into join contains tables with different schemas, it causes join to fail. This is because join produces the schema of its final output based on the first table it finds in each of the two input streams it receives. It does this once at the beginning of the transformation. Then, while it's processing tables, if it finds a column that it doesn't recognize as part of the schema, it throws an error.
Ideally, the join transformation would be able to handle this situation gracefully by just adding the newly found column to the schema, and populating any rows that don't have a value for that column with nulls. The goal of this fix is to modify join so that it can do just that.
Related issues - #4310
fixes: #4506 #4315
Done checklist