Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS RDS Aurora Serverless connection hangs #325

Closed
joelvh opened this issue Jan 21, 2020 · 42 comments
Closed

AWS RDS Aurora Serverless connection hangs #325

joelvh opened this issue Jan 21, 2020 · 42 comments

Comments

@joelvh
Copy link

joelvh commented Jan 21, 2020

I've been using the pg gem in an AWS Lambda function, connecting to AWS RDS Aurora Serverless PostgreSQL. The connection would hang periodically, when executing a query. It was not always the same query, but would eventually start to hang consistently.

I wanted to see if anyone else has seen this behavior or if it might be something specific to my environment. I posted a stacktrace using Sequel here: jeremyevans/sequel#1660 (comment)

At first I thought it might be that the gems were compiled with the wrong version of the PostgreSQL client, but I don't think that is the issue. Maybe this is considered out-of-scope for the pg project, but for what it's worth, I'm using the lambci/lambda image to replicate the AWS Lambda environment.

My Dockerfile installs PostgreSQL and gems like this:

FROM lambci/lambda:build-ruby2.5

# RUN yum install -y postgresql postgresql-devel
RUN yum install -y wget
# See: https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-6-x86_64/
RUN wget https://yum.postgresql.org/10/redhat/rhel-6.9-x86_64/postgresql10-libs-10.7-1PGDG.rhel6.x86_64.rpm
RUN wget https://yum.postgresql.org/10/redhat/rhel-6.9-x86_64/postgresql10-10.7-1PGDG.rhel6.x86_64.rpm
RUN wget https://yum.postgresql.org/10/redhat/rhel-6.9-x86_64/postgresql10-devel-10.7-1PGDG.rhel6.x86_64.rpm
RUN rpm -ivh postgresql10-libs-10.7-1PGDG.rhel6.x86_64.rpm
RUN rpm -ivh postgresql10-10.7-1PGDG.rhel6.x86_64.rpm
RUN rpm -ivh postgresql10-devel-10.7-1PGDG.rhel6.x86_64.rpm
RUN gem update bundler
# See: https://github.com/stevenringo/lambda-ruby-pg-nokogiri/blob/master/Makefile#L51
RUN bundle config --local build.pg --with-pg-config=/usr/pgsql-10/bin/pg_config
RUN bundle config --local silence_root_warning true
ENV PATH "$PATH:/usr/pgsql-10/bin"
# ENV POSTGRES_LIB=/usr/pgsql-10/lib
# RUN ls -la /usr/lib64

ADD Gemfile /var/task/Gemfile
ADD Gemfile.lock /var/task/Gemfile.lock

RUN bundle install --standalone --path /var/task/vendor/bundle --clean
RUN mkdir -p /var/task/lib
RUN cp -a /usr/pgsql-10/lib/*.so.* /var/task/lib/

Could this be a case where the pg gem is having a hard time with an unusual response from AWS RDS Aurora Serverless?

Thanks in advance.

@joelvh joelvh changed the title AWS Aurora Serverless connection hangs AWS RDS Aurora Serverless connection hangs Jan 21, 2020
@cbandy
Copy link
Contributor

cbandy commented Jan 21, 2020

Both Lambda and Aurora Serverless sound like cases where connections should be handled carefully. In the Sequel issue, I see you are using a connection pool.

Questions that come to mind:

  1. Is the traffic to this Lambda function steady or sporadic?
  2. Is the traffic to this Aurora cluster steady or sporadic?
  3. What is happening at the Aurora cluster when the query hangs?
  4. How many connections are free/idle versus used in the connection pool?
  5. How many connections are idle versus active in the Aurora cluster?
  6. Are you using the Sequel connection validator?

@joelvh
Copy link
Author

joelvh commented Jan 21, 2020

Hi @cbandy thanks for the feedback. The traffic to the Lambda function was steady and thereby the same goes for the Aurora cluster. However, I wasn't able to get much insight into the Aurora cluster and what was happening from that point of view as the connection hung.

I'll review our setup to double-check the areas you've mentioned. I hadn't heard of the Sequel connection validator. It does remind me that we tried several settings such as pool size and number of threads. We also set the timeout around 5 seconds or so, yet (as Jeremy mentioned) no error was raised by pg that Sequel could handle and the request timedout at 30 seconds (the Lambda max).

Do you think that using dead pooled connections would cause those symptoms where pg hangs and doesn't throw some kind of error?

@cbandy
Copy link
Contributor

cbandy commented Jan 21, 2020

However, I wasn't able to get much insight into the Aurora cluster and what was happening from that point of view as the connection hung.

You'll need to go to Aurora separately to see it's view of the world. e.g. is it scaling at that moment? Does it report fewer than max connections? Is the connection in question visible there? If so, is it idle, active, or waiting?

Do you think that using dead pooled connections would cause those symptoms where pg hangs and doesn't throw some kind of error?

I think so, but I don't have experience with this exact setup. Narrowing it down depends on what the other end of the connection sees/believes.

@joelvh
Copy link
Author

joelvh commented Jan 21, 2020

@cbandy I'll work on testing things out a bit more and report back. Thanks for the help!

@spikex
Copy link

spikex commented Jan 24, 2020

I don't have any more details to add yet, but I can confirm we're seeing this same thing on two different projects. Both are Rails apps running on EC2 instances, not Lambda.

@copiousfreetime
Copy link

I work with @spike I was able to get a somewhat repeatable example for a few hours on our of our customers systems (ActiveRecord 6, pg 1.1.4) and it was a select that returned 3 columns from all rows in the table. Say I had a table t with exactly 50,000 rows:

  • if the AR query resolved to select a,b,c from t then it would hang,
  • if the AR query resolved to select a,b,c from t limit 49999 it would return fine
  • if the AR query resolved to select a,b,c from t limit 50000 it would hang.

I did all of these via AR in irb, and then did the resulting query via the raw ruby PG connection and got the same results.

Doing the queries via psql from the same client machine as the rails machine resulted in successes on all fronts.

Unfortunately, a couple of hours later, while attempting to get trace data, the hanging behavior stopped on my test table. It didn't happen for a few days, and now is coming back on a different table. From my perspecitve, this is turning into a heisenbug. I'm waiting for this to happen consistently again and then I'll be doing a full trace, and possibly a tcpdump.

From the Aurora Serless perspective, this is an over allocated cluster, very low load, and currently I'm not allowing it to change size so that we can rule a scaling event out as the cause.

I know this doesn't really help all that much at the moment, I'll have more information when I can get a proper capture. Just letting folks know that there's definitely something here.

@spikex
Copy link

spikex commented Feb 11, 2020

One additional data point. On our other project, I migrated the client from Serverless to Servered (Serverful? Severicious?) and the problem when away.

@joelvh
Copy link
Author

joelvh commented Feb 11, 2020

@marcilio @danilop seen you guys write about Aurora Postgres Serverless at AWS - heard of this kind of issue, in light of the fact it may be the serverless flavor of Aurora causing our issue? hope it’s not a reach to pull you guys in on a potentially Ruby-specific issue. Thanks!

@larskanis
Copy link
Collaborator

larskanis commented Feb 11, 2020

@copiousfreetime Thank you for sharing the details of your investigation! I don't have a good idea so far, maybe you could try to use the synchronous query API per PG::Connection#sync_exec instead of just exec the next time. Just in case you get a chance to do so.

@copiousfreetime
Copy link

@larskanis After a long time of not happening - it is happening again right now. I have a query that is supposed to return 440 rows - I've managed to get a solid pg trace using PG:Connection.trace. Since there is some email addresses and such of customer data - I've created a private repo to store the traces and invited you @cbandy and @spikex to the repo to take a look if you want.

The main results are:

the query

This query should return 440 rows of data

    SELECT "form_submissions".*
      FROM "form_submissions"
     WHERE "form_submissions"."type" = 'ObservationSubmission'
       AND (processed_at is not null)"

Using the following script I attmpted to run the query usingsync_exec, async_exec, and async_exec_params. All of those with and without and ORDER BY created_at clause.

#!/usr/bin/env ruby

sql = "SELECT \"form_submissions\".* FROM \"form_submissions\" WHERE \"form_submissions\".\"type\" = 'ObservationSubmission' AND (processed_at is not null)"

ar_connection = ObservationSubmission.connection
pg_connection = ar_connection.instance_variable_get("@connection") # PGConnection

# hangs
# method = "async_exec_params"
# method = "async_exec"

# works fine
method = "sync_exec"

File.open("#{method}.trace", "w+") do |f|
  # f.sync = true
  pg_connection.trace(f)
  result = pg_connection.send(method, sql)
  i = 0
  rows = []
  result.each do |row|
    i += 1
    rows << row.values
  end

  puts "#{i} done"
end

A ruby-pg trace was captured for each run.

results

async_exec_params was not included as it turned out to be byte-for-byte trace equivalent to async_exec

method order by result
sync_exec no 440 rows
sync_exec yes 440 rows
async_exec no hang
async_exec yes 440 rows

oddity

Since I had kill the async_exec process I thought to turn off file buffering on the trace file. When I did this and added f.sync = true (line 16 in trace.rg) to the File that was getting the pg trace, theasync_exec worked correctly and returned 440 rows.

This was repeatable, I took the file sync out and the async_exec would hang. Put it back in and it would work fine.

thoughts

The trace differences weren't a whole lot between the async and the sync, mostly it was the last row that just got cut off. Which to me seems like a buffering issue of some type.

@copiousfreetime
Copy link

copiousfreetime commented Mar 22, 2020

Update on the sync/async/order by - the order by does not appear be an indicator, or the File.sync. I have a different query, which is exactly the same type as above, and it has an order by on the query, and it has started to hang now - and it wasn't earlier. This new query does not hang consistently in isolation.

@cbandy
Copy link
Contributor

cbandy commented Mar 24, 2020

Hmm. All of the traces appear cutoff. Perhaps there needs to be an f.flush or f.fsync at the end of the block (before the file is closed.)

Each of the traces ends in the middle of a column value of a DataRow message. For example, this is a 4-byte length of 1546 then 1546 bytes of the value:

From backend (#4)> 1546
From backend (1546)> ...

Unless I'm misunderstanding, the last trace line comes from here at which point the data has already been read off the wire and the entire value should be visible in the trace file.

@aniketschneider
Copy link

aniketschneider commented May 22, 2020

I'm pretty sure I'm experiencing this issue as well. I've tested with pg 0.19 and 1.2.3, talking to aurora serverless cluster (postgres 10.7 compatibility) and a postgres 10.7 RDS server instance for comparison.

I encountered the issue using a workflow that uses a cursor to scan through different tables 1000 rows at a time, and for a number of larger tables it will scan through a chunk of the table before stalling on a particular slice of rows. This is generally reproducible, so if it stalls on rows 171000 to 172000, it will stall on the same set of rows if I start the cursor at offset 170000 instead. Oddly, starting directly at offset 171000 sometimes fails and sometimes doesn't.

I've pg_dumped out tables and loaded them into a different database on the same cluster and the behavior is the same on the copy. Adding or removing indexes didn't appear to change anything either. As a side oddity though, renaming certain columns will make the workflow hang on a different set of 1000 rows (so at offset 367000 instead of 171000, for example).

I've compared traces for the postgres server and the serverless cluster and didn't observe any significant differences (the table ID was different, and certain single-letter protocol messages were repeated different numbers of times but I didn't see a pattern there). Note that these were traces after being interrupted using SIGINT, and with File.sync = true. When I kill -9 instead, the trace is cut off after the From in From backend in the middle of a row. Unfortunately I've had no luck replicating the issue with artificial data, so I can't easily share the trace files, but if you're interested I can try to find a way to sanitize them.

I've tested with sync_exec vs async_exec, and at first I thought that the problem was only with async_exec, but after attempting to monkey-patch Sequel to use sync_exec I discovered that the workflow using sync_exec also stalls, just on different offsets in each table.

@cbandy
Copy link
Contributor

cbandy commented May 23, 2020

@aniketschneider This sounds a little bit like it stalls after the same amount of bytes transferred. What happens if you try a query like the following? It returns 1GiB of text as 1MiB rows.

SELECT rpad('',1024*1024,'x') FROM generate_series(1,1024)

@aniketschneider
Copy link

@cbandy I thought I saw that query hang the first couple times I tried it, but either I didn't wait long enough for it to complete or it doesn't fail consistently, because now it's succeeding every time I run it.

However, when I use a cursor to scan through the results of that query, I can get it to fail:

require 'pg' # v 1.2.3

# removed connection info

conn = PG.connect(
  host: host,
  dbname: db,
  user: user
)

batch_size = ARGV[0]

sql = "SELECT rpad('',1024*1024,'x') FROM generate_series(1,1024)"

cursor_name = "bug_replication_cursor"

File.open("trace.out", "w") do |f|
  f.sync = true
  conn.trace(f)

  conn.transaction do |conn|
    conn.exec("declare #{cursor_name} no scroll cursor without hold for #{sql}")
  
    slice_num = 0
    begin
      slice = conn.async_exec("fetch forward #{batch_size} from #{cursor_name}")
      puts "Slice #{slice_num}, #{slice.count} rows"
      slice_num += 1
    end while slice.any?
  end
end

I tried this with various batch sizes:

Batch size Hangs before completion?
1 to 6 no
7 to 8 yes
9 no
10 to 15 yes
16 no
17 to 19 yes
128 yes
256 no

Hard to find a pattern there; on the plus side, this is entirely artificial data, so perhaps it can be replicated on other clusters.

@cbandy
Copy link
Contributor

cbandy commented May 26, 2020

When a particular batch size hangs, does it consistently/repeatedly do so on a particular slice_num?

@aniketschneider
Copy link

Sort of. With a batch size of 8, for example, I am seeing it hang on anywhere from slice_num 1 to 4, but I've never seen it get higher than that.

This doesn't really line up with how I saw it happen on our ETL data, where it seemed like it was always hanging on the same slice_num.

@aniketschneider
Copy link

Just to eliminate some potential false signals here, I've stripped out the transaction and the cursor and am seeing roughly the same behavior. I say "roughly" because this version seems to have a slightly different set of batch sizes for which it does not hang; for example, it does not hang on batch size 4, but does hang on 3.

require 'pg'

# removed connection info

conn = PG.connect(
  host: host,
  dbname: db,
  user: 'postgres'
)

if ARGV.length >= 1
  batch_size = ARGV[0]
end

sql = "SELECT rpad('',1024*1024,'x') FROM generate_series(1,#{batch_size})"

File.open("trace.out", "w") do |f|
  f.sync = true
  conn.trace(f)

  slice_num = 0
  begin
    slice = conn.async_exec(sql)
    puts "Slice #{slice_num}, #{slice.count} rows"
    slice_num += 1
  end while slice.any?
end

@cbandy
Copy link
Contributor

cbandy commented May 26, 2020

Another test could be streaming / single-row-mode: https://www.rubydoc.info/gems/pg/PG/Connection#set_single_row_mode-instance_method

@krazyjakee
Copy link

krazyjakee commented Jun 24, 2020

In our testing...

conn = PG::Connection.open(...)

conn.exec(query) # doesn't work (hangs)

conn = PG::Connection.open(...)

conn.send_query query
conn.get_result # works (doesn't hang)

conn = PG::Connection.open(...)

conn.send_query query
conn.block # doesn't work (hangs)

So the issue is in .block where wait_socket_readable is called.

I believe this issue is the same issue: #297

@brenes
Copy link

brenes commented Jun 29, 2020

Not sure if this can give any clue for the developers to debug the issue, but here is my experience with Ruby and Aurora Serverless.

We have a Rails 4 application with pg v0.18.4. The app has been active for 5 years without an issue in its connection with a RDS PostgreSQL.

The moment we migrated to Aurora Serverless we started to notice the behaviour described in this issue. Some queries just hanged and never returned any result. However, the Aurora logs indicated no error and all the queries were being reoslved in good time.

We wrote a dirty ActiveRecord extension to deal with this situation: https://gist.github.com/brenes/ca6a45c1399e413642cafa87f1c8b19d

After this we started to see a different behaviour. Every time the database tries to scale (up or down) every query just froze for some seconds (between 5 and 30 seconds and, in a few cases up to 5 minutes). Searching the logs for the message we set in the gist we saw an almost perfect match but we are not sure wether those queries would hang up forever or the would have returned a result through async_exec.

Speaking with Amazon Support they have confirmed to us how Aurora Serverless works.

  1. They have a proxy in front of the database where all the connections are established.
  2. Once they decide an scale (up or down) is required, every query sent enters in a queue where they wait for the scale.
  3. The database tries to find a point where the scale is safe (i.e no active query is modifying the database).
  4. The database scales.
  5. All the queries in the queue are resolved.

The points 3 and 4 are what takes the most of those 5 seconds to 5 minutes pauses.

Maybe all this proxy dance is what is causing the issues with the async_exec method? It could happen that some of the queries used as example before trigger an scale up operation (due to CPU use, maybe?) and creates the problem?

Sorry if all this is not quite helpful and thanks for the hard work!

@krazyjakee
Copy link

krazyjakee commented Jun 30, 2020

@brenes Thanks for the write-up here. Your extension is very interesting as we also had trouble scaling up or down and it's now clear why.

You symptom is equally valid as a bug. I believe what you're describing is a different symptom than ours but of the same issue.

@sprite2005
Copy link

sprite2005 commented Jul 11, 2020

I am having the same issues. Running latest PG Gem with latest Rails. Also tried other versions.

Here is one query that hangs. It only hangs for specific combinations of LIMIT and OFFSET. For example increasing or decreasing either LIMIT or OFFSET by one and it will work fine. One odd thing is if I try the Photo Load by itself it loads fine.

irb(main):015:0> User.includes(:photos).order(:display_name, :created_at).page(11).per(25).each { |e| puts "HANGING" }
User Load (10.7ms) SELECT "users".* FROM "users" ORDER BY "users"."display_name" ASC, "users"."created_at" ASC LIMIT $1 OFFSET $2 [["LIMIT", 25], ["OFFSET", 250]]
^C Photo Load (463863.1ms) SELECT "photos".* FROM "photos" WHERE "photos"."user_id" IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25) ORDER BY "photos"."created_at" ASC [["user_id", "376971a1-1444-4bd7-8b07-ea87ff76ee69"], ["user_id", "d81c0d84-be0f-410f-bf93-0164d56d85ce"], ["user_id", "04347252-77e0-4582-967b-ded4e147ca08"], ["user_id", "dced8a64-629e-45f7-b04a-eb9b87824d42"], ["user_id", "5c83951c-cee7-454f-844c-67eb7a889f4d"], ["user_id", "2584154a-1dcc-4ecc-9370-2eb6bcf2be24"], ["user_id", "1bd3ea4e-e25d-4ebd-8b2f-29c2af4a2db5"], ["user_id", "70279c04-ef33-402b-adfc-8e079d603f64"], ["user_id", "3271e9eb-fb82-4f13-8c56-d8a5f11ae16f"], ["user_id", "9696f1d4-dc8f-4649-a8af-9caca6b0fb5d"], ["user_id", "4719fddb-15a2-4f25-919f-a2936a01b821"], ["user_id", "edc76477-46bc-4172-a2cf-5881e9224229"], ["user_id", "1717b022-62ca-4793-a5d3-2686669f26e4"], ["user_id", "03ea9be9-1331-4f0a-9c05-95d542c03ddf"], ["user_id", "b79b1b80-0734-4674-b152-98e5982d24dc"], ["user_id", "62afa220-6e37-42d2-9bda-b16e09180f16"], ["user_id", "584f9836-ddaf-4231-b613-1aee5c0c6bab"], ["user_id", "6f168913-e6c5-4fa9-9d5b-d4c4a0d6eb9c"], ["user_id", "3425484b-657d-44e1-ba9d-dae659523a49"], ["user_id", "b0e87545-16ba-4462-b41e-017704f31723"], ["user_id", "7b79628b-61ec-401d-9f72-05187cad3e9e"], ["user_id", "9feb7ac2-e91f-4886-a7a6-a02f2ffe0a91"], ["user_id", "f6c5af79-f5e0-448a-ae4e-d5dafdc11f54"], ["user_id", "b95f91a9-0e6c-48fa-b4f4-03087cf77783"], ["user_id", "f16203b4-27c3-4561-8d3b-3129502bc528"]]
Traceback (most recent call last):
2: from (irb):15
1: from (irb):15:in `rescue in irb_binding'
IRB::Abort (abort then interrupt!)

PG will just hang forever. It does not throw an error.

If I set PG::Connection.async_api = false the query completes as expected.

@fitzee
Copy link

fitzee commented Jul 16, 2020

Possibly same issue here, but not using Ruby. I am using SQLAlchemy (Python) and the aurora-data-api extensions for pg. For the most part, I can query fine. Periodically the simplest of statements hang, and subsequently timeout (maybe 5% of all operations, not restricted to any one of select, insert, updated etc.)

Access patterns and volumes are very predictable, a rule fires every 2 hours and runs a Lambda function that truncates a table and then inserts ~5000 records (in batches). That's it.. no queries, just inserts. On the backside, I have QuickSight querying these datasets at around the same cadence (every 2 hours or so).

This was working perfectly until around May 20th. Prior to then, not a single timeout - everything else related to volume/traffic had not changed either. Logs, Cloudwatch etc. don't seem to be useful. Really odd.

@aniketschneider
Copy link

aniketschneider commented Jul 27, 2020

We've been running in circles on this on and off for a couple months, and got halfway to switching to a non-serverless solution, when we found that turning off SSL on the client connection made the stalling behavior disappear (that is, connecting with the sslmode: 'disable' option). That's probably not a reasonable step to take for every environment where this is happening, but hopefully it will be helpful for someone!

@sprite2005
Copy link

We've been running in circles on this on and off for a couple months, and got halfway to switching to a non-serverless solution, when we found that turning off SSL on the client connection made the stalling behavior disappear (that is, connecting with the sslmode: 'disable' option). That's probably not a reasonable step to take for every environment where this is happening, but hopefully it will be helpful for someone!

If you don't want to disable SSL you can add 'PG::Connection.async_api = false' in an initializer. It fixed the problem for me.

@krazyjakee
Copy link

@sprite2005 This is undocumented so can you tell me what are the consequences of doing this?

Thanks

@sprite2005
Copy link

@sprite2005 This is undocumented so can you tell me what are the consequences of doing this?

Thanks

It will use sync_exec instead of async_exec. I've been running with async_api = false in production and on Aurora Serverless for a few months now and have not had any issues since doing that. Docs state this:

#exec is an alias for #async_exec which is almost identical to #sync_exec . #sync_exec is implemented on the simpler synchronous command processing API of libpq, whereas #async_exec is implemented on the asynchronous API and on ruby's IO mechanisms. Both methods ensure that other threads can process while waiting for the server to complete the request, but #sync_exec blocks all signals to be processed until the query is finished. This is most notably visible by a delayed reaction to Control+C. It's not recommended to use explicit sync or async variants but #exec instead, unless you have a good reason to do so.

@davidfromdk
Copy link

davidfromdk commented Oct 19, 2020

@joelvh: What did you end up doing? Did you figure out a solution using Sequel? Could you elaborate a bit?

Thanks!

@joelvh
Copy link
Author

joelvh commented Oct 19, 2020

@davidfromdk I ended up using an RDS Aurora instance - abandoned the Serverless flavor for the time being. Have been tracking this thread, but haven't yet had an opportunity to test the latest findings.

@spikex
Copy link

spikex commented Oct 19, 2020

@davidfromdk We switched to "non-Severless" as well.

@krazyjakee
Copy link

@davidfromdk sorry to not provide more information but we also switched to non-serverless. It resolved many, many issues for us.

@davidfromdk
Copy link

Thanks for the elaboration everybody 😄 🚀

@l33z3r
Copy link

l33z3r commented Nov 12, 2020

Hey guys, I think it is due to a known limitation of serverless. See https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.html#aurora-serverless.limitations

In particular:

Currently, certain SQL queries and commands, such as cursors with hold (DECLARE name ... CURSOR WITH HOLD FOR query); session-level advisory locks; temporary relations; and asynchronous notifications (LISTEN, NOTIFY commands) prevent scaling and are not recommended. For more information, see Autoscaling for Aurora Serverless.

Would that make sense?

@joelvh
Copy link
Author

joelvh commented Dec 2, 2020

@l33z3r that's a good find. Looks like they've been expanding that list of limitations fairly recently based on the history of those page edits: https://github.com/awsdocs/amazon-aurora-user-guide/blame/master/doc_source/aurora-serverless.md

I can't speak to whether the listed limitations overlap with the findings in this thread. But maybe someone who has deeper knowledge of pg or possibly @jeremyevans see something jump out at them here.

@jeremyevans
Copy link
Contributor

Unfortunately, I don't think I can help here, as I don't have any experience with Aurora. Unless this issue can be reproduced on stock PostgreSQL, it sounds like an Aurora specific bug and I think this issue should be closed.

@joelvh
Copy link
Author

joelvh commented Dec 2, 2020

Thanks @jeremyevans - I think the hope is that someone with deeper knowledge of pg or how the sequel gems work might have an idea of what settings to tweak that might allow side-stepping some of the listed limitations to see if that allows use of Aurora.

Aside from that, it seems the best we can do is try and track the Aurora docs (which seem to finally have more detail) and keep giving that service a try in the future.

@ka19
Copy link

ka19 commented Dec 2, 2020

It looks like the issue is with the size of the ssl frame. Whenever the query response ends with a bigger ssl frame (>8kb), the client is getting stuck. Its because in the loop below, PQConsumeInput only reads a max of 8kb decrypted data, but openssl reads the entire ssl frame into memory in order to decrypt. is_readable will return false because entire response is not yet read into the pq buffer and then the loop goes on to wait for more data to arrive on the socket which never arrives because openssl already read the entire data.

This doesn't happen with stock postgres because stock postgres only sends a max of 8kb data in a single ssl frame.

Also, it seems this issue is no longer being reproduced for me. Is it for anyone else?

    while ( !(retval=is_readable(conn)) ) {
        if ( ptimeout ) {
            gettimeofday(&currtime, NULL);
            timersub(&aborttime, &currtime, &waittime);
        }

        /* Is the given timeout valid? */
        if( !ptimeout || (waittime.tv_sec >= 0 && waittime.tv_usec >= 0) ){
            /* Wait for the socket to become readable before checking again */
            ret = rb_wait_for_single_fd( sd, RB_WAITFD_IN, ptimeout ? &waittime : NULL );
        } else {
            ret = 0;
        }

        if ( ret < 0 ){
            rb_sys_fail( "rb_wait_for_single_fd()" );
        }

        /* Return false if the select() timed out */
        if ( ret == 0 ){
            return NULL;
        }

        /* Check for connection errors (PQisBusy is true on connection errors) */
        if ( PQconsumeInput(conn) == 0 ){
            rb_raise( rb_eConnectionBad, "PQconsumeInput() %s", PQerrorMessage(conn) );
        }
    }

@sprite2005
Copy link

@krazyjakee did you have to migrate via snapshot or is there a better way to do it? I'm looking to get off serverless also but hoping to avoid downtime.

@larskanis
Copy link
Collaborator

It seems that this issue has been fixed on Aurora side.

@davidfromdk
Copy link

davidfromdk commented Oct 13, 2022

@larskanis: Out of curiosity. How do you know? :-)

Thanks!

@larskanis
Copy link
Collaborator

@davidfromdk Just a guess, since the above comment says it's no longer reproducible and no one complained since then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests