Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle multiple hosts in the connection string, where only one host h… #476

Closed
wants to merge 8 commits into from

Conversation

guynaor
Copy link

@guynaor guynaor commented Aug 21, 2022

…as writable session. Will skip read-only hosts

This small change adds a test to see if the failure of the connection is because the host is read-only, and the session has a read-write requirement.

With this change a connection string of the format: postgres://192.168.1.20:5432,192.168.2.30:5432/liba_app?target_session_attrs=read-write&sslmode=require will work correctly, even if the first host (192.168.1.20) is the read only host in a primary/secondary style cluster

@larskanis
Copy link
Collaborator

Thank you for working on this issue!

@@ -764,6 +764,10 @@ def new(*args)
# Seems to be no authentication error -> try next host
errors << err
return nil
elsif err.to_s =~ /.*session is read-only.*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is language dependant. If the server is configured for a different language this condition will always fail.

Not sure how to solve is better especially that I'm on vacation currently, but I would read the libpq sources and the wireshark output to get an idea. Will do this when I'm back.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked deeper at the code, and the libpq code as well. I still can't find any flag or parameter at the connection that will tell us this is the case for the failure.
I do have a different idea, and would love your view on it: libpq knows how to handle this situation internally. It means that if we pass the whole URI to libpq connect, it will parse and handle it itself. Is this something you would consider as a solution?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I committed new code that does what I proposed before (send the URI based strings directly to libpq) to allow you to see how it can work. I tested and this works fine, as libpq knows how to handle multiple hosts with read-write requirements.
Would love to get your feedback on that, and extend it as a general way to connect

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change to remove warnings from the tests.

Please note that many of the tests fail now, as I first want to make sure this change is accepted in principle, as it is a major change to how we call libpq. If it is, I will work on cleaning up the tests, and add deeper tests for the reda-only case in a cluster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for handling multiple hosts one after another is the connect_timeout parameter. That parameter is not handled by libpq, when doing non-blocking connection, which is necessary for Fiber.scheduler compatibility. Also the address resolution is done in this step, which would be a blocking operation otherwise. So #459 fixed these two issues in pg-1.4.0.

I don't think it makes sense to revert to the behaviour of pg-1.3.x, that doesn't respect connect_timeout or losing Fiber.scheduler compatibility as in pg-1.2.x.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larskanis this latest version restored connection.rb to what it was, and now just has the additional tests for a read only and read-write sessions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guynaor I also created a test case that checks the target_session_attrs parameter properly some days ago, but didn't have the time to push it. Will try to work on this today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried again on my actual DB cluster, and it fails when the read-only server is the first on the list. Here is the output:

irb(main):003:0> c = PG::Connection.new "postgres://[email protected]:5432,172.63.80.8:5432/liba_app?target_session_attrs=read-write&sslmode=require"
Traceback (most recent call last):
       14: from /home/bc-runner/.rbenv/versions/2.6.10/bin/irb:23:in `<main>'
       13: from /home/bc-runner/.rbenv/versions/2.6.10/bin/irb:23:in `load'
       12: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
       11: from (irb):3
       10: from (irb):3:in `rescue in irb_binding'
        9: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:672:in `new'
        8: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:721:in `connect_to_hosts'
        7: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:721:in `each_with_index'
        6: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:721:in `each'
        5: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:735:in `block in connect_to_hosts'
        4: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:735:in `each'
        3: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:737:in `block (2 levels) in connect_to_hosts'
        2: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:761:in `connect_internal'
        1: from /home/bc-runner/.rbenv/versions/2.6.10/lib/ruby/gems/2.6.0/gems/pg-1.4.3/lib/pg/connection.rb:609:in `async_connect_or_reset'
PG::ConnectionBad (connection to server at "172.63.107.80", port 5432 failed: session is read-only)

As you can see, it fails as the first IP is the replica IP

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larskanis I think I know where the problem is coming from on my real cluster, compared to the test DB. The real cluster machine will not respond with a connection OK, and then let us query it for read-write. It will fail to even create the connection. So the fix might be to actually do the test manually as I did in one of my revisions. But I'd like to hear your opinion on it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larskanis after digging into the connection code, libpq source, and protocol documentation, I finally understood the root cause of the issue. When a connection is established, the server sends a set of params to the client, which can be retrieved with param_status(). Two params are relevant for the read-only mode: in_hot_standby and default_transaction_read_only. In both cases, the connection is considered read-only and will fail to establish. At that stage, we can't check the param_status, as the connection was issued the finish call. In order to work around this, I added an instance variable which I initialize before we finish the connection. Similar to how @last_status is handled.
Let me know if this solution is in the right direction.

larskanis added a commit to larskanis/ruby-pg that referenced this pull request Oct 1, 2022
larskanis added a commit to larskanis/ruby-pg that referenced this pull request Oct 2, 2022
larskanis added a commit to larskanis/ruby-pg that referenced this pull request Oct 2, 2022
larskanis added a commit to larskanis/ruby-pg that referenced this pull request Oct 4, 2022
larskanis added a commit to larskanis/ruby-pg that referenced this pull request Oct 4, 2022
larskanis added a commit to larskanis/ruby-pg that referenced this pull request Oct 5, 2022
larskanis added a commit to larskanis/ruby-pg that referenced this pull request Oct 8, 2022
@larskanis
Copy link
Collaborator

Hi @guynaor! After thinking a bit more about this issue, I propose to solve it per #485. That means to rely on the connection startup handling of libpq instead of reinventing more of it's logic in ruby.

The connect_timeout parameter was the reason for switching to host-by-host iteration in ruby (introduced in pg-1.4), but I think timeout handling is good enough how it's implemented in #485 now.

@guynaor
Copy link
Author

guynaor commented Oct 10, 2022

Hi @larskanis,

Thanks for looking at it thoroughly! I agree that the solution of letting pqlib handle it is best. I think this will solve the issues I had, and will let the Ruby gem, follow changes in libpq automatically.

THanks again,

Guy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants