Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add another flexible way for JDBC paging (manual mode) #95

Merged
merged 10 commits into from
Dec 22, 2021

Conversation

dingx1
Copy link
Contributor

@dingx1 dingx1 commented Dec 10, 2021

Release notes

Added jdbc_paging_mode option to choose if use explicit pagination in statements and avoid the initial count query or use auto to delegate to the underlying library.

What does this PR do?

This commit introduces a new configuration option to avoid the initial count select statement executed by Sequel in case of paginated queries.
The paginated queries could be done now with initial count or not, in case jdbc_paging_mode is values explicit the plugin executes paged queries till it reach a page with less rows than the expected, instead of relying on the total row count.
In this case the SQL statement has to explicitly use the pagination keywords (like LIMIT and OFFSET) receiving as :offset and :size as implicit parameters.
When the paging mode jdbc_paging_mode is set to auto then the pagination happens automatically without the intervention of the user to create paginated query.

Why is it important/What is the impact to the user?

In some circumstances, like stored procedure call with pagination bounds, the initial count query is simply not meaningful.
In cases where the pages to retrieve are few the initial count query could be an useful overhead.
In some complex nested queries the count could generate extra work that's not usefull

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

Author's Checklist

  • check it with a local Logstash test.

How to test this PR locally

  • spin up a local DB
  • configure a JDBC input query with pagination and with the new jdbc_paging_avoid_count and check it retrieves all the expected rows. In logs should be present a line for each paged query
input {
  jdbc {
    jdbc_driver_library => "/path/mysql-connector-java-8.0.26.jar"
    jdbc_driver_class => "Java::com.mysql.cj.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/test_logstash"

    jdbc_user => "user"
    jdbc_password => "s3cret"

    jdbc_default_timezone => "UTC"

    schedule => "* * * * *"

    use_column_value => true
    tracking_column => "log_id"
    tracking_column_type => "numeric"
    last_run_metadata_path => ".last_run"
    
    statement => "SELECT * FROM data_log WHERE log_id > :sql_last_value LIMIT :size OFFSET :offset"
    jdbc_paging_enabled => true
    jdbc_page_size => 7
    jdbc_paging_avoid_count => true
  }
}

output {
  stdout {
    codec => "rubydebug"
  }
}

Related issues

Use cases

Simple query with small result sets

In a simple query like:

SELECT * from test_table

the automatic pagination would generate the count query plus the paginated queries.
With small resultsets it issues 2 queries instead of one query.

A stored procedure that requires pagination information. Suppose you have a stored like

CALL fetch_my_data(:sql_last_value, <offset>, <size>)

the automatic pagination would generate an invalid SQL statement:

SELECT count(*) AS count FROM ( CALL fetch_my_data(:sql_last_value) ) AS t1 LIMIT 1
SELECT * FROM ( CALL fetch_my_data(:sql_last_value) ) AS t1 LIMIT 10 OFFSET 0
.
.

which is not what's expected.

Nested queries with pagination on the inner query

Suppose the query you want to paginate is the inner on, like in:

SELECT * FROM test_table2 WHERE id IN (SELECT * FROM test_table LIMIT <offset>, <size>)

Without this feature, the automatic pagination would generate the following SQL:

SELECT count(*) AS count FROM ( 
  SELECT * FROM test_table2 WHERE id IN (SELECT * FROM test_table) 
) AS t1 LIMIT 1;

SELECT * FROM ( 
  SELECT * FROM test_table2 WHERE id IN (SELECT * FROM test_table) 
) AS t1 LIMIT 10 OFFSET 0;
.
.

which is the pagination applied to the outer query instead of the inner one.

Copy link
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andsel andsel added the enhancement New feature or request label Dec 13, 2021
docs/input-jdbc.asciidoc Outdated Show resolved Hide resolved
@andsel andsel requested a review from karenzone December 13, 2021 09:25
@@ -55,6 +55,9 @@ def setup_jdbc_config
# Be aware that ordering is not guaranteed between queries.
config :jdbc_paging_enabled, :validate => :boolean, :default => false

# Whether to use manual mode during the JDBC paging
config :jdbc_paging_manual_mode, :validate => :boolean, :default => false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late review, did you guys consider having better naming for the feature? e.g.

jdbc_paging_enabled => true jdbc_paging_mode => offset any chance this reads better (than "manual")?
(jdbc_paging_mode => count would de the default value)

should also keep things open for changing the paging strategy again (offset paging is a bit of an anti pattern although for mostly static data and LS' needs of at least once delivery it's sufficient), wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kares I like your suggestion, instead of using a boolean flag defined a string value for the strategy is more readable.
However this feature doesn't cover only static data but also use cases where the Sequel automatic paging doesn't apply correctly, mostly:

  • procedure calls with paging limits
  • paging in inner queries in case of nested SQL queries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, maybe we need better naming there - haven't thought about it that much.
however the manual just read a bit weird - had no idea what that could mean SQL wise.

what I did is check Sequel's each_page and it's using a COUNT(...) initial query (the details on how the statement is transformed into a count is usually adapter specific), haven't checked any adapter specific overrides for each_page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kares to me the naming you are proposing is good, better than the boolean flag. So I would use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kares @andsel @karenzone Thanks for your time. I also think this suggestion is good and will improve according to it.
I have a little question about the name.
Actually, this feature involves two options:

  1. Not to add paging condition automatically, but explicitly customized by user.
  2. Not to use count query first

The first one is the main purpose of this feature.
So, is it easy for users to understand the two values of 'count' and 'offset'?
If everyone don't have other suggested names, I will implement it according to current suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we could use

  • automatic for dbc_paging_mode => count and
  • explicit for jdbc_paging_mode => offset

@kares is it a better naming in your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those sound better instead of automatic maybe jdbc_paging_mode => auto to choose the best available strategy based on the driver (Sequel does this). jdbc_paging_mode => explicit or manual both sounds okay to me.

docs/input-jdbc.asciidoc Outdated Show resolved Hide resolved
@andsel andsel requested a review from kares December 17, 2021 08:37
# @yieldparam row [Hash{Symbol=>Object}]
def perform_query(db, sql_last_value, jdbc_paging_enabled, jdbc_page_size)
def perform_query(db, sql_last_value, jdbc_paging_enabled, jdbc_paging_manual_mode, jdbc_page_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if there was a new class as it's a different algorithm e.g. ExplicitStatementHandler instead of hacking the algorithm into the existing one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea, I come up with some commits to switch this

@@ -55,6 +55,9 @@ def setup_jdbc_config
# Be aware that ordering is not guaranteed between queries.
config :jdbc_paging_enabled, :validate => :boolean, :default => false

# Whether to use manual mode during the JDBC paging
config :jdbc_paging_manual_mode, :validate => :boolean, :default => false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those sound better instead of automatic maybe jdbc_paging_mode => auto to choose the best available strategy based on the driver (Sequel does this). jdbc_paging_mode => explicit or manual both sounds okay to me.

@andsel
Copy link
Contributor

andsel commented Dec 21, 2021

@kares I've integrated your great latest suggest, I would ask another round of your eyes on this, please.

@andsel andsel force-pushed the feature/query_manual_paging branch from 2c21b3b to e456bf7 Compare December 21, 2021 14:45
Copy link
Contributor

@kares kares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, well done!

@andsel andsel merged commit 5055d64 into logstash-plugins:main Dec 22, 2021
@andsel
Copy link
Contributor

andsel commented Dec 22, 2021

Many thanks @dingx1 for all of this 🥇 it was published with https://rubygems.org/gems/logstash-integration-jdbc/versions/5.2.0

@dingx1
Copy link
Contributor Author

dingx1 commented Dec 22, 2021

Wow, it's finally done. Great! ^_^
Thank you @andsel and everyone for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants