Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do case insensitive comparison between dereferenced fields and internal ORC field names #7350

Closed
wants to merge 3 commits into from
Closed

Do case insensitive comparison between dereferenced fields and internal ORC field names #7350

wants to merge 3 commits into from

Conversation

willmostly
Copy link
Contributor

Trino will currently return NULLs if table metadata contains a struct that uses caps, e.g. entryId struct<custId:string,acctGuid:string,requestDate:string> and the ORC internal field definition does not. Reading the code, I actually suspect that NULL would be returned even if the internal field name did match the casing in the table def'n, but I have not tested this scenario.

@cla-bot cla-bot bot added the cla-signed label Mar 19, 2021
@findepi
Copy link
Member

findepi commented Mar 19, 2021

We should definitely have a test for this.

@willmostly
Copy link
Contributor Author

FYI - this product test will fail without the code change introduced here.

@findepi
Copy link
Member

findepi commented Mar 23, 2021

I added a label to run all Hive product tests and also made small cleanup in a test and added an additional assertion.
@willmostly Let me know the build results.

@findepi
Copy link
Member

findepi commented Mar 24, 2021

build was almost green (expcet main and tests which seemed unrelated). i did restart all of it to be on the safe side

Copy link
Member

@phd3 phd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willmostly thanks for making the change, lgtm. Also, could you please squash your commits.

toList())));
}
else {
projectionsByColumnIndex = projections.stream()
.collect(Collectors.groupingBy(
HiveColumnHandle::getBaseHiveColumnIndex,
mapping(
column -> column.getHiveColumnProjectionInfo().map(HiveColumnProjectionInfo::getDereferenceNames).orElse(ImmutableList.<String>of()),
column -> column.getHiveColumnProjectionInfo()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is now getting a bit longer at two places, we could extract it out to simplify. (I'd leave it up to you if you'd like to tackle it separately.)

    projectionsByColumnName = projections.stream()
            .collect(Collectors.groupingBy(
                        HiveColumnHandle::getBaseColumnName,
                        mapping(OrcPageSourceFactory::getDereferencesAsList, toList())));


    private static List<String> getDereferencesAsList(HiveColumnHandle column)
    {
        return column.getHiveColumnProjectionInfo()
                .map(info -> info.getDereferenceNames().stream()
                        .map(dereference -> dereference.toLowerCase(ENGLISH))
                        .collect(toList()))
                .orElse(ImmutableList.<String>of());
    }

assertThat(query("SELECT c_struct.testCustId FROM " + tableName)).containsOnly(row("1234"));
assertThat(query("SELECT c_struct.testcustid FROM " + tableName)).containsOnly(row("1234"));
assertThat(query("SELECT c_struct.requestDate FROM " + tableName)).containsOnly(row("some day"));
setProjectionPushdownEnabled(onTrino().getConnection(), false);
Copy link
Member

@phd3 phd3 Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onTrino.getConnection() and query( use different connections, so seems like disabling the pushdown doesn't get applied really. Can you use onTrino().executeQuery? (Or something similar to AbstractTestHiveViews#setSessionProperty)

willmostly and others added 3 commits April 8, 2021 23:41
if failed with permission denied when setting session properties in
suite-2,3
@findepi
Copy link
Member

findepi commented Apr 9, 2021

CI hit #7535

@findepi
Copy link
Member

findepi commented Apr 9, 2021

Merged as 8bac015, thanks!

@findepi findepi added this to the 356 milestone Apr 9, 2021
@findepi findepi closed this Apr 9, 2021
@findepi findepi mentioned this pull request Apr 9, 2021
9 tasks
@aakashnand
Copy link
Member

aakashnand commented Jul 13, 2021

Even though this has been fixed. @willmostly @findepi I found out that. I was getting NULL even for small case nested struct type of data until the Trino version 355. For example, I created an internal table in a hive using the following JSON with added NULL values and I was still getting NULL when I executed select data.timeonsite from small_case_int_hdfs;.

Data

{"data":{"visits":"1","hits":"8","pageviews":"4","timeonsite":"107","sessionqualitydim":"99"}}
{"data":{"visits":"1","hits":"19","pageviews":"10","timeonsite":"250","sessionqualitydim":"56"}}
{"data":{"visits":"1","hits":"10","pageviews":"5","timeonsite":"1439","sessionqualitydim":"23"}}
{"data":{"visits":"1","hits":"10","pageviews":"5","sessionqualitydim":"45"}}
{"data":{"visits":"1","hits":"10","pageviews":"5","sessionqualitydim":"45"}}
{"data":{"visits":"1","hits":"19","pageviews":"10","sessionqualitydim":"56"}}
{"data":{"visits":"1","hits":"19","pageviews":"10","timeonsite":"250"}}
{"data":{"visits":"1","hits":"19","pageviews":"10","timeonsite":"250"}}
{"data":{"visits":"1","hits":"19","pageviews":"10","timeonsite":"250"}}
{"data":{"visits":"1","hits":"19","pageviews":"10","timeonsite":"250"}}

Terminal Output

trino:test_schema> select data.timeonsite from small_case_int_hdfs;
 timeonsite 
------------
 NULL       
 NULL       
 NULL       
 NULL       
 NULL       
 NULL       
 NULL       
 NULL       
 NULL       
 NULL    
(10 rows)
Query 20210713_080046_00000_a9fyn, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
3.26 [10 rows, 0B] [7 rows/s, 0B/s] 

@findepi
Copy link
Member

findepi commented Jul 14, 2021

@aakashnand please consider creating a new issue to avoid this being forgotten about

@aakashnand
Copy link
Member

@findepi my above comment was just a note and the problem is solved after this PR. Should I still create a new issue and say that it was fixed by this PR? I just wanted to highlight that the problem is not only for camel cases but also for a small cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants