Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new anonymizer for new engine #266

Merged
merged 22 commits into from
May 29, 2023

Conversation

matthewryanwells
Copy link

@matthewryanwells matthewryanwells commented May 17, 2023

Description

Created a new listener for the SQLSyntaxParser to anonymize all queries going through the new engine, calling the old anonymizer only if the query falls back to the legacy engine.

Through the update of the anonymizer there are a few differences between between the current and old anonymization including:

  • Labels all tables as identifier instead of table
  • Includes using dots to access elements of table such as identifier.identifier instead of simplifying as just identifier
  • Capitalizes all functions
  • Adds a space between all nodes except for before and after a dot, and before a comma
  • Includes the negative symbol when anonymizing numbers, e.g. number * - number instead number * number

Issues Resolved

opensearch-project#821

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented May 17, 2023

Codecov Report

❗ No coverage uploaded for pull request base (integ-sql-anonymizer@a924d5a). Click here to learn what that means.
The diff coverage is n/a.

@@                   Coverage Diff                   @@
##             integ-sql-anonymizer     #266   +/-   ##
=======================================================
  Coverage                        ?   97.17%           
  Complexity                      ?     4146           
=======================================================
  Files                           ?      372           
  Lines                           ?    10409           
  Branches                        ?      711           
=======================================================
  Hits                            ?    10115           
  Misses                          ?      287           
  Partials                        ?        7           
Flag Coverage Δ
sql-engine 97.17% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Signed-off-by: Matthew Wells <[email protected]>
Copy link

@Yury-Fridlyand Yury-Fridlyand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add IT

@GumpacG
Copy link

GumpacG commented May 23, 2023

Not sure if intended or if it's a big deal but I found a difference between legacy and and V2 anonymization.

Query: SELECT SUM(int0) FROM calcs
Legacy: SELECT SUM(identifier) FROM table
V2: SELECT SUM(identifier) FROM identifier

Note that calcs is anonymized as identifier instead of table.

@GumpacG
Copy link

GumpacG commented May 23, 2023

Could you explain what the AnonymizerListenerTest is testing? I'm not sure I see or understand what the expected results are.

@matthewryanwells
Copy link
Author

Not sure if intended or if it's a big deal but I found a difference between legacy and and V2 anonymization.

Query: SELECT SUM(int0) FROM calcs
Legacy: SELECT SUM(identifier) FROM table
V2: SELECT SUM(identifier) FROM identifier

Note that calcs is anonymized as identifier instead of table.

Because of the way that we are anonymizing there are a few differences between the new and legacy anonymizers, the main and most obvious one is that table is anonymized to identifier rather than to table, I mentioned that other differences in the ticket description.

I can look into if there is any possible way of changing this but currently it just looks at what kind of node it is and all identifiers and tables have the same ID node.

@matthewryanwells
Copy link
Author

Could you explain what the AnonymizerListenerTest is testing? I'm not sure I see or understand what the expected results are.

The anonymizerListener is an implementation of the ParseTreeListener interface, included in it is 4 different methods that I need to implement. Even though I only need the visitTerminal method I still have to include the other three and leave them blank (because I want them to do nothing). I was receiving Jacoco test coverage failures because I had no test for the visitErrorNode method so I needed to create a test file to create a test to test an empty function that doesn't return anything.

@GumpacG
Copy link

GumpacG commented May 23, 2023

Could you explain what the AnonymizerListenerTest is testing? I'm not sure I see or understand what the expected results are.

The anonymizerListener is an implementation of the ParseTreeListener interface, included in it is 4 different methods that I need to implement. Even though I only need the visitTerminal method I still have to include the other three and leave them blank (because I want them to do nothing). I was receiving Jacoco test coverage failures because I had no test for the visitErrorNode method so I needed to create a test file to create a test to test an empty function that doesn't return anything.

I wonder if there's a better way of doing this. A question to ask is why is visitErrorNode not covered while exitEveryRule is. The test just looks odd is all but maybe not a big deal.

@matthewryanwells
Copy link
Author

Could you explain what the AnonymizerListenerTest is testing? I'm not sure I see or understand what the expected results are.

The anonymizerListener is an implementation of the ParseTreeListener interface, included in it is 4 different methods that I need to implement. Even though I only need the visitTerminal method I still have to include the other three and leave them blank (because I want them to do nothing). I was receiving Jacoco test coverage failures because I had no test for the visitErrorNode method so I needed to create a test file to create a test to test an empty function that doesn't return anything.

I wonder if there's a better way of doing this. A question to ask is why is visitErrorNode not covered while exitEveryRule is. The test just looks odd is all but maybe not a big deal.

I certainly agree that it is very weird, I can spend some time looking into it

Signed-off-by: Matthew Wells <[email protected]>
…er, simplified identifier.identifier to identifier, and added IT

Signed-off-by: Matthew Wells <[email protected]>
matthewryanwells and others added 4 commits May 25, 2023 14:21
… method tests are conducted, added back in dot notation for identifiers

Signed-off-by: Matthew Wells <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
case TIMESTAMP:
case BACKTICK_QUOTE_ID:
if (previousType == FROM) {
anonymizedQueryString += "table";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this will only work for the first table. For example, if we have a join or alias, we will get identifiers.

e.g. FROM table1 as t1, table2 as t2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is fine... we don't need to fix this.

@matthewryanwells matthewryanwells merged commit 95a2cd8 into integ-sql-anonymizer May 29, 2023
@matthewryanwells matthewryanwells deleted the dev-sql-anonymizer branch May 29, 2023 22:34
Yury-Fridlyand pushed a commit that referenced this pull request May 30, 2023
* Create new anonymizer for new engine (#266)

* Created anonymizer listener for anonymizing SQL queries through the new engine
Signed-off-by: Matthew Wells <[email protected]>

* Update for review comments

Signed-off-by: Andrew Carbonetto <[email protected]>

* added missing file header, change public variable to private

Signed-off-by: Matthew Wells <[email protected]>

---------

Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Matthew Wells <[email protected]>
Co-authored-by: Andrew Carbonetto <[email protected]>
MitchellGale pushed a commit that referenced this pull request Jun 12, 2023
* Create new anonymizer for new engine (#266)

* Created anonymizer listener for anonymizing SQL queries through the new engine
Signed-off-by: Matthew Wells <[email protected]>

* Update for review comments

Signed-off-by: Andrew Carbonetto <[email protected]>

* added missing file header, change public variable to private

Signed-off-by: Matthew Wells <[email protected]>

---------

Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Matthew Wells <[email protected]>
Co-authored-by: Andrew Carbonetto <[email protected]>
MitchellGale pushed a commit that referenced this pull request Jun 12, 2023
* Create new anonymizer for new engine (#266)

* Created anonymizer listener for anonymizing SQL queries through the new engine
Signed-off-by: Matthew Wells <[email protected]>

* Update for review comments

Signed-off-by: Andrew Carbonetto <[email protected]>

* added missing file header, change public variable to private

Signed-off-by: Matthew Wells <[email protected]>

---------

Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Matthew Wells <[email protected]>
Co-authored-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Mitchell Gale <[email protected]>
MitchellGale pushed a commit that referenced this pull request Jun 21, 2023
…earch-project#1677)

* Create new anonymizer for new engine (#266)

* Created anonymizer listener for anonymizing SQL queries through the new engine
Signed-off-by: Matthew Wells <[email protected]>

* Update for review comments

Signed-off-by: Andrew Carbonetto <[email protected]>

* added missing file header, change public variable to private

Signed-off-by: Matthew Wells <[email protected]>

---------

Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Matthew Wells <[email protected]>
Co-authored-by: Andrew Carbonetto <[email protected]>
(cherry picked from commit 62120fd)

Co-authored-by: Matthew Wells <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants