-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS RDS Aurora Serverless connection hangs #325
Comments
Both Lambda and Aurora Serverless sound like cases where connections should be handled carefully. In the Sequel issue, I see you are using a connection pool. Questions that come to mind:
|
Hi @cbandy thanks for the feedback. The traffic to the Lambda function was steady and thereby the same goes for the Aurora cluster. However, I wasn't able to get much insight into the Aurora cluster and what was happening from that point of view as the connection hung. I'll review our setup to double-check the areas you've mentioned. I hadn't heard of the Sequel connection validator. It does remind me that we tried several settings such as pool size and number of threads. We also set the timeout around 5 seconds or so, yet (as Jeremy mentioned) no error was raised by Do you think that using dead pooled connections would cause those symptoms where |
You'll need to go to Aurora separately to see it's view of the world. e.g. is it scaling at that moment? Does it report fewer than max connections? Is the connection in question visible there? If so, is it idle, active, or waiting?
I think so, but I don't have experience with this exact setup. Narrowing it down depends on what the other end of the connection sees/believes. |
@cbandy I'll work on testing things out a bit more and report back. Thanks for the help! |
I don't have any more details to add yet, but I can confirm we're seeing this same thing on two different projects. Both are Rails apps running on EC2 instances, not Lambda. |
I work with @spike I was able to get a somewhat repeatable example for a few hours on our of our customers systems (ActiveRecord 6, pg 1.1.4) and it was a select that returned 3 columns from all rows in the table. Say I had a table t with exactly 50,000 rows:
I did all of these via AR in irb, and then did the resulting query via the raw ruby PG connection and got the same results. Doing the queries via Unfortunately, a couple of hours later, while attempting to get trace data, the hanging behavior stopped on my test table. It didn't happen for a few days, and now is coming back on a different table. From my perspecitve, this is turning into a heisenbug. I'm waiting for this to happen consistently again and then I'll be doing a full trace, and possibly a tcpdump. From the Aurora Serless perspective, this is an over allocated cluster, very low load, and currently I'm not allowing it to change size so that we can rule a scaling event out as the cause. I know this doesn't really help all that much at the moment, I'll have more information when I can get a proper capture. Just letting folks know that there's definitely something here. |
One additional data point. On our other project, I migrated the client from Serverless to Servered (Serverful? Severicious?) and the problem when away. |
@copiousfreetime Thank you for sharing the details of your investigation! I don't have a good idea so far, maybe you could try to use the synchronous query API per |
@larskanis After a long time of not happening - it is happening again right now. I have a query that is supposed to return 440 rows - I've managed to get a solid pg trace using The main results are: the queryThis query should return 440 rows of data
Using the following script I attmpted to run the query using
A ruby-pg trace was captured for each run. results
odditySince I had kill the async_exec process I thought to turn off file buffering on the trace file. When I did this and added This was repeatable, I took the file sync out and the async_exec would hang. Put it back in and it would work fine. thoughtsThe trace differences weren't a whole lot between the async and the sync, mostly it was the last row that just got cut off. Which to me seems like a buffering issue of some type. |
Update on the sync/async/order by - the |
Hmm. All of the traces appear cutoff. Perhaps there needs to be an Each of the traces ends in the middle of a column value of a DataRow message. For example, this is a 4-byte length of 1546 then 1546 bytes of the value:
Unless I'm misunderstanding, the last trace line comes from here at which point the data has already been read off the wire and the entire value should be visible in the trace file. |
I'm pretty sure I'm experiencing this issue as well. I've tested with pg 0.19 and 1.2.3, talking to aurora serverless cluster (postgres 10.7 compatibility) and a postgres 10.7 RDS server instance for comparison. I encountered the issue using a workflow that uses a cursor to scan through different tables 1000 rows at a time, and for a number of larger tables it will scan through a chunk of the table before stalling on a particular slice of rows. This is generally reproducible, so if it stalls on rows 171000 to 172000, it will stall on the same set of rows if I start the cursor at offset 170000 instead. Oddly, starting directly at offset 171000 sometimes fails and sometimes doesn't. I've pg_dumped out tables and loaded them into a different database on the same cluster and the behavior is the same on the copy. Adding or removing indexes didn't appear to change anything either. As a side oddity though, renaming certain columns will make the workflow hang on a different set of 1000 rows (so at offset 367000 instead of 171000, for example). I've compared traces for the postgres server and the serverless cluster and didn't observe any significant differences (the table ID was different, and certain single-letter protocol messages were repeated different numbers of times but I didn't see a pattern there). Note that these were traces after being interrupted using SIGINT, and with I've tested with |
@aniketschneider This sounds a little bit like it stalls after the same amount of bytes transferred. What happens if you try a query like the following? It returns 1GiB of text as 1MiB rows. SELECT rpad('',1024*1024,'x') FROM generate_series(1,1024) |
@cbandy I thought I saw that query hang the first couple times I tried it, but either I didn't wait long enough for it to complete or it doesn't fail consistently, because now it's succeeding every time I run it. However, when I use a cursor to scan through the results of that query, I can get it to fail: require 'pg' # v 1.2.3
# removed connection info
conn = PG.connect(
host: host,
dbname: db,
user: user
)
batch_size = ARGV[0]
sql = "SELECT rpad('',1024*1024,'x') FROM generate_series(1,1024)"
cursor_name = "bug_replication_cursor"
File.open("trace.out", "w") do |f|
f.sync = true
conn.trace(f)
conn.transaction do |conn|
conn.exec("declare #{cursor_name} no scroll cursor without hold for #{sql}")
slice_num = 0
begin
slice = conn.async_exec("fetch forward #{batch_size} from #{cursor_name}")
puts "Slice #{slice_num}, #{slice.count} rows"
slice_num += 1
end while slice.any?
end
end I tried this with various batch sizes:
Hard to find a pattern there; on the plus side, this is entirely artificial data, so perhaps it can be replicated on other clusters. |
When a particular batch size hangs, does it consistently/repeatedly do so on a particular |
Sort of. With a batch size of 8, for example, I am seeing it hang on anywhere from This doesn't really line up with how I saw it happen on our ETL data, where it seemed like it was always hanging on the same |
Just to eliminate some potential false signals here, I've stripped out the transaction and the cursor and am seeing roughly the same behavior. I say "roughly" because this version seems to have a slightly different set of batch sizes for which it does not hang; for example, it does not hang on batch size 4, but does hang on 3. require 'pg'
# removed connection info
conn = PG.connect(
host: host,
dbname: db,
user: 'postgres'
)
if ARGV.length >= 1
batch_size = ARGV[0]
end
sql = "SELECT rpad('',1024*1024,'x') FROM generate_series(1,#{batch_size})"
File.open("trace.out", "w") do |f|
f.sync = true
conn.trace(f)
slice_num = 0
begin
slice = conn.async_exec(sql)
puts "Slice #{slice_num}, #{slice.count} rows"
slice_num += 1
end while slice.any?
end |
Another test could be streaming / single-row-mode: https://www.rubydoc.info/gems/pg/PG/Connection#set_single_row_mode-instance_method |
In our testing... conn = PG::Connection.open(...)
conn.exec(query) # doesn't work (hangs)
conn = PG::Connection.open(...)
conn.send_query query
conn.get_result # works (doesn't hang)
conn = PG::Connection.open(...)
conn.send_query query
conn.block # doesn't work (hangs) So the issue is in I believe this issue is the same issue: #297 |
Not sure if this can give any clue for the developers to debug the issue, but here is my experience with Ruby and Aurora Serverless. We have a Rails 4 application with pg v0.18.4. The app has been active for 5 years without an issue in its connection with a RDS PostgreSQL. The moment we migrated to Aurora Serverless we started to notice the behaviour described in this issue. Some queries just hanged and never returned any result. However, the Aurora logs indicated no error and all the queries were being reoslved in good time. We wrote a dirty ActiveRecord extension to deal with this situation: https://gist.github.com/brenes/ca6a45c1399e413642cafa87f1c8b19d After this we started to see a different behaviour. Every time the database tries to scale (up or down) every query just froze for some seconds (between 5 and 30 seconds and, in a few cases up to 5 minutes). Searching the logs for the message we set in the gist we saw an almost perfect match but we are not sure wether those queries would hang up forever or the would have returned a result through Speaking with Amazon Support they have confirmed to us how Aurora Serverless works.
The points 3 and 4 are what takes the most of those 5 seconds to 5 minutes pauses. Maybe all this proxy dance is what is causing the issues with the async_exec method? It could happen that some of the queries used as example before trigger an scale up operation (due to CPU use, maybe?) and creates the problem? Sorry if all this is not quite helpful and thanks for the hard work! |
@brenes Thanks for the write-up here. Your extension is very interesting as we also had trouble scaling up or down and it's now clear why. You symptom is equally valid as a bug. I believe what you're describing is a different symptom than ours but of the same issue. |
I am having the same issues. Running latest PG Gem with latest Rails. Also tried other versions. Here is one query that hangs. It only hangs for specific combinations of LIMIT and OFFSET. For example increasing or decreasing either LIMIT or OFFSET by one and it will work fine. One odd thing is if I try the Photo Load by itself it loads fine. irb(main):015:0> User.includes(:photos).order(:display_name, :created_at).page(11).per(25).each { |e| puts "HANGING" } PG will just hang forever. It does not throw an error. If I set PG::Connection.async_api = false the query completes as expected. |
Possibly same issue here, but not using Ruby. I am using SQLAlchemy (Python) and the aurora-data-api extensions for pg. For the most part, I can query fine. Periodically the simplest of statements hang, and subsequently timeout (maybe 5% of all operations, not restricted to any one of select, insert, updated etc.) Access patterns and volumes are very predictable, a rule fires every 2 hours and runs a Lambda function that truncates a table and then inserts ~5000 records (in batches). That's it.. no queries, just inserts. On the backside, I have QuickSight querying these datasets at around the same cadence (every 2 hours or so). This was working perfectly until around May 20th. Prior to then, not a single timeout - everything else related to volume/traffic had not changed either. Logs, Cloudwatch etc. don't seem to be useful. Really odd. |
We've been running in circles on this on and off for a couple months, and got halfway to switching to a non-serverless solution, when we found that turning off SSL on the client connection made the stalling behavior disappear (that is, connecting with the |
If you don't want to disable SSL you can add 'PG::Connection.async_api = false' in an initializer. It fixed the problem for me. |
@sprite2005 This is undocumented so can you tell me what are the consequences of doing this? Thanks |
It will use sync_exec instead of async_exec. I've been running with async_api = false in production and on Aurora Serverless for a few months now and have not had any issues since doing that. Docs state this: #exec is an alias for #async_exec which is almost identical to #sync_exec . #sync_exec is implemented on the simpler synchronous command processing API of libpq, whereas #async_exec is implemented on the asynchronous API and on ruby's IO mechanisms. Both methods ensure that other threads can process while waiting for the server to complete the request, but #sync_exec blocks all signals to be processed until the query is finished. This is most notably visible by a delayed reaction to Control+C. It's not recommended to use explicit sync or async variants but #exec instead, unless you have a good reason to do so. |
@joelvh: What did you end up doing? Did you figure out a solution using Thanks! |
@davidfromdk I ended up using an RDS Aurora instance - abandoned the Serverless flavor for the time being. Have been tracking this thread, but haven't yet had an opportunity to test the latest findings. |
@davidfromdk We switched to "non-Severless" as well. |
@davidfromdk sorry to not provide more information but we also switched to non-serverless. It resolved many, many issues for us. |
Thanks for the elaboration everybody 😄 🚀 |
Hey guys, I think it is due to a known limitation of serverless. See https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.html#aurora-serverless.limitations In particular: Currently, certain SQL queries and commands, such as cursors with hold (DECLARE name ... CURSOR WITH HOLD FOR query); session-level advisory locks; temporary relations; and asynchronous notifications (LISTEN, NOTIFY commands) prevent scaling and are not recommended. For more information, see Autoscaling for Aurora Serverless. Would that make sense? |
@l33z3r that's a good find. Looks like they've been expanding that list of limitations fairly recently based on the history of those page edits: https://github.com/awsdocs/amazon-aurora-user-guide/blame/master/doc_source/aurora-serverless.md I can't speak to whether the listed limitations overlap with the findings in this thread. But maybe someone who has deeper knowledge of |
Unfortunately, I don't think I can help here, as I don't have any experience with Aurora. Unless this issue can be reproduced on stock PostgreSQL, it sounds like an Aurora specific bug and I think this issue should be closed. |
Thanks @jeremyevans - I think the hope is that someone with deeper knowledge of Aside from that, it seems the best we can do is try and track the Aurora docs (which seem to finally have more detail) and keep giving that service a try in the future. |
It looks like the issue is with the size of the ssl frame. Whenever the query response ends with a bigger ssl frame (>8kb), the client is getting stuck. Its because in the loop below, PQConsumeInput only reads a max of 8kb decrypted data, but openssl reads the entire ssl frame into memory in order to decrypt. is_readable will return false because entire response is not yet read into the pq buffer and then the loop goes on to wait for more data to arrive on the socket which never arrives because openssl already read the entire data. This doesn't happen with stock postgres because stock postgres only sends a max of 8kb data in a single ssl frame. Also, it seems this issue is no longer being reproduced for me. Is it for anyone else?
|
@krazyjakee did you have to migrate via snapshot or is there a better way to do it? I'm looking to get off serverless also but hoping to avoid downtime. |
It seems that this issue has been fixed on Aurora side. |
@larskanis: Out of curiosity. How do you know? :-) Thanks! |
@davidfromdk Just a guess, since the above comment says it's no longer reproducible and no one complained since then. |
I've been using the
pg
gem in an AWS Lambda function, connecting to AWS RDS Aurora Serverless PostgreSQL. The connection would hang periodically, when executing a query. It was not always the same query, but would eventually start to hang consistently.I wanted to see if anyone else has seen this behavior or if it might be something specific to my environment. I posted a stacktrace using Sequel here: jeremyevans/sequel#1660 (comment)
At first I thought it might be that the gems were compiled with the wrong version of the PostgreSQL client, but I don't think that is the issue. Maybe this is considered out-of-scope for the
pg
project, but for what it's worth, I'm using the lambci/lambda image to replicate the AWS Lambda environment.My
Dockerfile
installs PostgreSQL and gems like this:Could this be a case where the
pg
gem is having a hard time with an unusual response from AWS RDS Aurora Serverless?Thanks in advance.
The text was updated successfully, but these errors were encountered: