Segmentation Fault when using Continuous Query Notification #1009

wvanderdeijl · 2018-11-21T16:00:49Z

I am creating a CQN subscription for select * from demo with only SUBSCR_QOS_ROWIDS as qos. This works fine, unless I do a lot of transactions very quickly. For example a SQL script doing 10 update statements on that table with a commit between each update. This creates 10 notification to the node container and that frequently (but not always) crashes the node process with a 'Segmentation Fault' or 'Aborted' message on the console.

A reproducable test case with two docker containers in a docker-compose can be found at https://github.com/wvanderdeijl/oracle-leech/tree/segfault

The console output from the node container with the subscription:

node_1_5b71996e877c | >> callback of type 6
node_1_5b71996e877c | Aborted
node_1_5b71996e877c | npm ERR! code ELIFECYCLE
node_1_5b71996e877c | npm ERR! errno 134
node_1_5b71996e877c | npm ERR! [email protected] start: `ts-node ./index.ts`
node_1_5b71996e877c | npm ERR! Exit status 134
node_1_5b71996e877c | npm ERR! 
node_1_5b71996e877c | npm ERR! Failed at the [email protected] start script.
node_1_5b71996e877c | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
node_1_5b71996e877c | 
node_1_5b71996e877c | npm ERR! A complete log of this run can be found in:
node_1_5b71996e877c | npm ERR!     /root/.npm/_logs/2018-11-21T15_36_44_404Z-debug.log
oracle-leech_node_1_5b71996e877c exited with code 134

The first line (callback of type 6) is from my javascript callback being invoked with the first message. The second message never arrives as the node process has crashed.

Usually the nodejs container simply crashes with a message 'Aborted', but sometimes it crashes with 'Segmentation Fault'. When that happens the segfault-handler npm module we use gets a chance to write a crash.log with a stack trace. One example of such a log file:

PID 436 received SIGSEGV for address: 0x0
/usr/src/app/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x2b90)[0x7f56a37dfb90]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f56c1554890]
/usr/src/app/node_modules/oracledb/build/Release/oracledb.node(_ZN15njsSubscription13CreateMessageEP16dpiSubscrMessage+0x49)[0x7f56ba5ab879]
/usr/src/app/node_modules/oracledb/build/Release/oracledb.node(_ZN15njsSubscription19ProcessNotificationEP10uv_async_s+0x65)[0x7f56ba5ac365]
node[0xa4732f]
node[0xa58018]
node(uv_run+0x14b)[0xa47c6b]
node(_ZN4node5StartEPN2v87IsolateEPNS_11IsolateDataERKSt6vectorISsSaISsEES9_+0x565)[0x8e5255]
node(_ZN4node5StartEiPPc+0x462)[0x8e34a2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f56c11bbb45]
node[0x89dd45]

This stack makes me believe the error is in the CreateMessage method of Subscription which was invoked by ProcessNotification The issue only seems to reproduce when we get a number of messages in a very short timeframe. I am no C++ developer, but looking at Subscription.cpp it seems like the message lives as a property on the subscription itself (subscription->message). Could it be that this is a concurreny issue and some of the native code is already starting processing the second message and putting that message on the subscription while the processing of the first message has not completed yet.

Answer the following questions:

Running Node version 10.13.0 running on x64 linux (the offical node:10 docker container)
Using node-oracledb 3.0.1
Using instantclient 18.3.0.0.0 on linux64 (docker container)
Database Enterprise Edition v11.2.0.1

Command to reproduce once the docker-compose (with an oracle EE container and a node container) are up:

docker-compose exec node sh -c "/opt/oracle/instantclient_18_3/sqlplus system/oracle@oracle/EE.oracle.docker @update.sql"

This SQL script is just 10 update statements on the table with a commit between each statement: https://github.com/wvanderdeijl/oracle-leech/blob/segfault/update.sql

NodeJS typescript code for the listener that fails can be found at https://github.com/wvanderdeijl/oracle-leech/blob/segfault/index.ts

The text was updated successfully, but these errors were encountered:

anthony-tuininga · 2018-11-21T21:50:10Z

I am no C++ developer, but looking at Subscription.cpp it seems like the message lives as a property on the subscription itself (subscription->message). Could it be that this is a concurrency issue and some of the native code is already starting processing the second message and putting that message on the subscription while the processing of the first message has not completed yet.

Considering that ProcessNotification() is called on the main Node.js thread that would seem impossible, barring a bug in Node.js itself, of course! But it is likely still a concurrency issue, if not there. :-)

Thanks for providing such a complete test case and the steps for reproducing it. Hopefully we can indeed replicate the issue.

wvanderdeijl · 2018-11-21T22:31:48Z

Would be great of you can figure this one out. CQN seems the perfect match for our scenario but these stability issues make use hesitant to rely on CQN.
It reproduces almost every time in my machine, so if you want me to do some more debugging or logging just let me know.

anthony-tuininga · 2018-11-21T22:37:08Z

I just tried running your docker-compose setup. It all appeared to work but running the final command fails with SP2-0310: unable to open file "update.sql". Running the commands by copy/paste yields ORA-00942: table or view does not exist. So something seems a bit off yet. What script runs the setup for the database?

wvanderdeijl · 2018-11-21T22:42:38Z

The table is created by the node container with the listening process. See line 48 at
https://github.com/wvanderdeijl/oracle-leech/blob/segfault/index.ts#L48
Also on initial startup of the compose the creation of the EE database takes quite a while.

Be sure to use the segfault branch of the repo. The master branch is not in a working state.

anthony-tuininga · 2018-11-21T22:56:02Z

Ah. I first built it with master. I then rebuilt everything, but running docker-compose up yields the following problems:

node_1    | Error: EACCES: permission denied, scandir '/usr/src/app'
node_1    | (node:8) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'get' of undefined
node_1    |     at errorHandler (/usr/local/lib/node_modules/npm/lib/utils/error-handler.js:205:18)
node_1    |     at /usr/local/lib/node_modules/npm/bin/npm-cli.js:78:20
node_1    |     at cb (/usr/local/lib/node_modules/npm/lib/npm.js:228:22)
node_1    |     at /usr/local/lib/node_modules/npm/lib/npm.js:266:24
node_1    |     at /usr/local/lib/node_modules/npm/lib/config/core.js:83:7
node_1    |     at Array.forEach (<anonymous>)
node_1    |     at /usr/local/lib/node_modules/npm/lib/config/core.js:82:13
node_1    |     at f (/usr/local/lib/node_modules/npm/node_modules/once/once.js:25:25)
node_1    |     at /usr/local/lib/node_modules/npm/lib/config/core.js:112:20
node_1    | (node:8) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
node_1    | (node:8) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
node_1    | TypeError: Cannot read property 'loaded' of undefined
node_1    |     at process.<anonymous> (/usr/local/lib/node_modules/npm/lib/utils/error-handler.js:40:18)
node_1    |     at process.emit (events.js:182:13)
node_1    | /usr/local/lib/node_modules/npm/lib/utils/error-handler.js:205
node_1    |   if (npm.config.get('json')) {
node_1    |                  ^
node_1    | 
node_1    | TypeError: Cannot read property 'get' of undefined
node_1    |     at process.errorHandler (/usr/local/lib/node_modules/npm/lib/utils/error-handler.js:205:18)
node_1    |     at process.emit (events.js:182:13)
node_1    |     at process._fatalException (internal/bootstrap/node.js:491:27)
oracle-leech_node_1 exited with code 7

Any further suggestions?

wvanderdeijl · 2018-11-21T23:06:57Z

That's weird. I’ll try it myself tomorrow in a fully clean working directory and will get back to you

wvanderdeijl · 2018-11-22T07:24:53Z

I don't really understand what is going wrong, but I expect it has something to do with the /usr/src/app being a volume that I mapped to your docker host directory. Looks like docker somehow doesn't have permissions on your host.

Anyhow; I've updated the segfault branch to run everything in the docker containers without host volume mapping. Could you try to pull the latest version from git and then follow the instructions in the README.md?

anthony-tuininga · 2018-11-22T16:11:16Z

Ok. I updated my git repo and ran the commands specified. Everything came up properly this time without any issues. I then ran the command to perform the update....and that worked, too. I ran it over a dozen times without any problems! :-)

So I hacked update.sql and got it to run 1000 times in an anonymous PL/SQL block. That successfully crashed node. So now I'll take it apart and see if I can figure out why!

wvanderdeijl · 2018-11-22T23:12:45Z

“Good” to hear you also have the same issue. Let’s hope that’s a big step towards fixing this. Let me know if there is anything else I can do

anthony-tuininga · 2018-11-29T21:17:04Z

Looking at this issue I discovered that all of the notifications are running the same thread, so this is not a concurrency issue. Instead, it seems to be due to the fact that too many notifications get backed up. The workaround (and probably better implementation anyway!) is to use notification grouping, as in the following code:

const cqnOptions = {
    callback: cqnCallback,
    sql: "select * from demo",
    groupingClass: oracledb.SUBSCR_GROUPING_CLASS_TIME,
    groupingValue: 1,  // number of seconds
    groupingType: oracledb.SUBSCR_GROUPING_TYPE_SUMMARY
};

If you're getting that many notifications this is going to be an improvement anyway! You can take a look at the documentation as well.

wvanderdeijl · 2018-12-01T08:10:52Z

We've tried grouping but that has the major downside of losing access to actual rowid's and we need to do a full table refresh. So we would like to use individual messages, but in my setup this already crashes node when I send just a couple of messages. It is not a very busy database but it might happen that a few messages hit at the same time. In my docker setup it already fails at the second message.
Have you found where the queueing takes place? Would it help if I accept the messages as quickly as possible in my JavaScript code and debounce there?

anthony-tuininga · 2018-12-01T16:31:05Z

One other possibility to consider: use another method to determine which rows need to be processed (last updated column, separate table containing primary keys of affected rows, etc.) so you can treat the CQN message like a signal.

Regarding processing the messages as quickly as possible, that's probably a good idea but I don't think that will eliminate the issue completely. Perhaps you can simply comment out the real processing temporarily to see if you can process more messages then? The queuing is taking place inside the Oracle Client libraries. We are hoping to talk to one of the owners of that section of the code to see if there is anything that can address this issue and will get back to you once we have more information.

cjbj · 2018-12-05T01:42:08Z

@wvanderdeijl I just pushed a patch from @anthony-tuininga to master that seems to resolve the CQN crash. Can you help test & review it?

anthony-tuininga · 2018-12-05T01:50:39Z

And it actually did turn out to be a concurrency issue. :-) And I was able to process better than 10,000 notifications in under a second, so I think we're good to go.

alpertandogan · 2018-12-07T13:58:56Z

And it actually did turn out to be a concurrency issue. :-) And I was able to process better than 10,000 notifications in under a second, so I think we're good to go.

I have the same problem with CQN.

How do you make 10,000 under a second. I just make max 20 .

anthony-tuininga

#1018 (comment)

anthony-tuininga · 2018-12-07T15:58:12Z

Can you take a look at this comment?

As for performance? That's simply a matter of a fast machine and what you are doing in your notification callback! All I did (effectively) was count the number of notifications, so that's going to process much more quickly than something that performs a query, for example!

alpertandogan · 2018-12-07T16:09:07Z

When i recive CQN from Server i only run console.log(message);

async function myCallback(message) {
console.log(message);
}

But i use this scripts (run well with 10 instead of 50 ) from other clients connect to oracle NODEjs crash with ORA-24550
begin
for i in 1...50 loop
UPDATE msgtable set xxxx ; COMMIT;
end loop;
end;

anthony-tuininga · 2018-12-07T16:10:52Z

Have you recompiled node-oracledb with the patch that was provided?

alpertandogan · 2018-12-08T09:31:19Z

I will compile and test monday. Thank s a lot.

cjbj · 2019-01-05T22:06:44Z

@alpertandogan how did your testing work out?

cjbj · 2019-01-23T00:29:55Z

Node-oracledb 3.1 was just released with a fix in this area. There may still be an issue on Windows, but we haven't reproduced in development.

anthony-tuininga added the bug label Nov 21, 2018

anthony-tuininga mentioned this issue Dec 7, 2018

Core dump when doing CQN on same connection when 50>records insert and commit #1018

Closed

cjbj closed this as completed Jan 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault when using Continuous Query Notification #1009

Segmentation Fault when using Continuous Query Notification #1009

wvanderdeijl commented Nov 21, 2018 •

edited

Loading

anthony-tuininga commented Nov 21, 2018

wvanderdeijl commented Nov 21, 2018

anthony-tuininga commented Nov 21, 2018 •

edited

Loading

wvanderdeijl commented Nov 21, 2018

anthony-tuininga commented Nov 21, 2018

wvanderdeijl commented Nov 21, 2018

wvanderdeijl commented Nov 22, 2018

anthony-tuininga commented Nov 22, 2018

wvanderdeijl commented Nov 22, 2018

anthony-tuininga commented Nov 29, 2018

wvanderdeijl commented Dec 1, 2018

anthony-tuininga commented Dec 1, 2018

cjbj commented Dec 5, 2018

anthony-tuininga commented Dec 5, 2018

alpertandogan commented Dec 7, 2018

anthony-tuininga commented Dec 7, 2018

alpertandogan commented Dec 7, 2018

anthony-tuininga commented Dec 7, 2018

alpertandogan commented Dec 8, 2018

cjbj commented Jan 5, 2019

cjbj commented Jan 23, 2019

Segmentation Fault when using Continuous Query Notification #1009

Segmentation Fault when using Continuous Query Notification #1009

Comments

wvanderdeijl commented Nov 21, 2018 • edited Loading

Answer the following questions:

anthony-tuininga commented Nov 21, 2018

wvanderdeijl commented Nov 21, 2018

anthony-tuininga commented Nov 21, 2018 • edited Loading

wvanderdeijl commented Nov 21, 2018

anthony-tuininga commented Nov 21, 2018

wvanderdeijl commented Nov 21, 2018

wvanderdeijl commented Nov 22, 2018

anthony-tuininga commented Nov 22, 2018

wvanderdeijl commented Nov 22, 2018

anthony-tuininga commented Nov 29, 2018

wvanderdeijl commented Dec 1, 2018

anthony-tuininga commented Dec 1, 2018

cjbj commented Dec 5, 2018

anthony-tuininga commented Dec 5, 2018

alpertandogan commented Dec 7, 2018

anthony-tuininga commented Dec 7, 2018

alpertandogan commented Dec 7, 2018

When i recive CQN from Server i only run console.log(message);

async function myCallback(message) { console.log(message); }

anthony-tuininga commented Dec 7, 2018

alpertandogan commented Dec 8, 2018

cjbj commented Jan 5, 2019

cjbj commented Jan 23, 2019

wvanderdeijl commented Nov 21, 2018 •

edited

Loading

anthony-tuininga commented Nov 21, 2018 •

edited

Loading

async function myCallback(message) {
console.log(message);
}