Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing unconfirmed message always results in an error #152

Closed
zachmartin9 opened this issue Apr 28, 2021 · 13 comments · Fixed by #178
Closed

Removing unconfirmed message always results in an error #152

zachmartin9 opened this issue Apr 28, 2021 · 13 comments · Fixed by #178
Labels

Comments

@zachmartin9
Copy link

Currently when a message is acked/nacked, code was added to make sure the correct message is being removed from the sent messages queue. The function to achieve this is located below from the src/ChannelWrapper.js file:

function removeUnconfirmedMessage(arr, message) {
    const toRemove = arr.indexOf(message);
    if (toRemove === -1) {
        throw new Error(`Message is not in _unconfirmedMessages!`);
    }
    const removed = arr.splice(toRemove, 1);
    return removed[0];
}

The issue is messages that are pushed to the queue are Objects and therefore the line of code const toRemove = arr.indexOf(message) will always result in -1.

@jwalton
Copy link
Owner

jwalton commented Apr 28, 2021

indexOf() works for objects:

message = {};
x = [{}, message, {}];
x.indexOf(message); // Returns 1.

There's pretty good test coverage for this project, too. Do you have an example that's causing the behavior you are seeing?

@zachmartin9
Copy link
Author

My situation is more along the lines of this:

message = {};
x = [{}, {}, {}];
x.indexOf(message); // Returns -1.

When logging the arr and message from the removeUnconfirmedMessages function, the Object in the array and the message Object have identical structure and properties, but it seems the reference is not equal.

@jwalton
Copy link
Owner

jwalton commented Apr 28, 2021

That would be true if you were in that situation - indexOf() uses === to decide if a value in the array is the value you are searching for.

But... how would you ever get into that situation? removeUnconfirmedMessage() is only ever called in _publishQueuedMessages() with a message that was pushed onto the unconfirmed messages array at the start the function. The exact object message refers to should always be present in this._unconfirmedMessages when we call removeUnconfirmedMessage(). It should be impossible to hit this exception. Are you directly modifying _unconfirmedMessages in your code?

@zachmartin9
Copy link
Author

I am not modifying _unconfirmedMessages. The only operations I am performing are: open a connection, create channels, send to queue, and close connections.

Regarding removeUnconfirmedMessage(), I completely agree based on the code in _publishQueuedMessages() that the exception should never be hit. I am not sure what edge case I have produced that is causing the exception to be thrown.

@jwalton
Copy link
Owner

jwalton commented Apr 29, 2021

I'm certainly interested in fixing this, but unless you can provide a test case or a sample repo that reproduces this, I'm not sure how I can help diagnose this. If your messages have some kind of unique identifier, you might try adding some logging to removeUnconfirmedMessage() to see if your message is somehow being double-freed, or if the message really is in the queue but has somehow been cloned.

@zachmartin9
Copy link
Author

Unfortunately, the repo this module is being used in belongs to a private company so I cannot provide it as an example. In the meantime, I will try to create a test case that can replicate the issue I am experiencing and provide that to you.

Thanks for the discussion and feedback.

@fitzhavey
Copy link

@zachmartin9 Hi, I've found myself running into something similar and have checked that I also am not modifying _unconfirmedMessages. Did you ever find the cause or a solution to this one? 🤞

@fitzhavey
Copy link

fitzhavey commented Jul 21, 2021

@jwalton some team mates have done some digging and found that it seems the logic on reconnection clears the _uncomfirmedMessages array.

This could lead to a race condition on reconnection that could produce this error.

luddd3 added a commit to luddd3/node-amqp-connection-manager that referenced this issue Aug 23, 2021
@jwalton
Copy link
Owner

jwalton commented Aug 26, 2021

I was speculating with luddd3 that it might be such a race condition over here: #166 although it's not obvious to me how such a race condition would occur. We'd need to get the "disconnect" from amqplib, and then get a confirm after the disconnect (although maybe that's exactly what's happening).

  • Send the message, but the confirm gets held up.
  • Disconnect, move the message to _messages.
  • Get the confirm, can't find the message in _unconfirmedMessages.

We could go looking for the message in the _messages array and remove it there, but if that's what is happening there's a non-zero chance that the message is going to end up getting sent twice - we could:

  • Send the message, but the confirm gets held up.
  • Disconnect, move the message to _messages.
  • Connect
  • Send the message from _messages, move it back to _unconfirmedMessages.
  • Get the original confirm, remove the from _unconfirmedMessages.
  • Get the second confirm.

(Although I admit that sounds very unlikely). Hang on, I'll raise a PR...

@jwalton
Copy link
Owner

jwalton commented Aug 26, 2021

Oh wait, we only move messages from _unconfirmedMessages to _messages on the reconnect - it would have to be exactly that second scenario. :/

jwalton added a commit that referenced this issue Aug 26, 2021
The unit tests for this had a fatal flaw - they assumed that if the
connection dropped, we'd get nothing back for any in-flight messages.
This isn't true, though - we'd actually get back an error from amqplib
when the underlying connection fails.  This fixes the tests to reflect
this.  If we rely on amqplib to reject such messages, then moving all
messages from _unconfirmedMessages to _messages on a
reconnect now becomes superfluous.

I also reworked `_publishQueuedMessages()` to be more synchronous.
As it stood, I had to add a lot of pointless delays in my tests to make
sure that the `then` after publishing a message had time to run.

fix #152
@jwalton
Copy link
Owner

jwalton commented Aug 26, 2021

I think I have a fix for this in #178. I think you hit the nail on the head @michaelfitzhavey. That block of code was the problem. We should be relying on errors from amqplib on the disconnect to remove messages from the _unconfrimedMessages list.

@jwalton
Copy link
Owner

jwalton commented Aug 26, 2021

@michaelfitzhavey If you want to, you can review the fix too. I'd just like someone to eyeball this before I push it to prod. :)

github-actions bot pushed a commit that referenced this issue Aug 26, 2021
## [3.5.2](v3.5.1...v3.5.2) (2021-08-26)

### Bug Fixes

* Fix handling of resending messages during a disconnect. ([e1457a5](e1457a5)), closes [#152](#152)

### Performance Improvements

* Send messages to underlying channel in synchronous batches. ([b866ef2](b866ef2))
@jwalton
Copy link
Owner

jwalton commented Aug 26, 2021

🎉 This issue has been resolved in version 3.5.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants