Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault when using Continuous Query Notification #1009

Closed
wvanderdeijl opened this issue Nov 21, 2018 · 21 comments
Closed

Segmentation Fault when using Continuous Query Notification #1009

wvanderdeijl opened this issue Nov 21, 2018 · 21 comments
Labels

Comments

@wvanderdeijl
Copy link

wvanderdeijl commented Nov 21, 2018

I am creating a CQN subscription for select * from demo with only SUBSCR_QOS_ROWIDS as qos. This works fine, unless I do a lot of transactions very quickly. For example a SQL script doing 10 update statements on that table with a commit between each update. This creates 10 notification to the node container and that frequently (but not always) crashes the node process with a 'Segmentation Fault' or 'Aborted' message on the console.

A reproducable test case with two docker containers in a docker-compose can be found at https://github.com/wvanderdeijl/oracle-leech/tree/segfault

The console output from the node container with the subscription:

node_1_5b71996e877c | >> callback of type 6
node_1_5b71996e877c | Aborted
node_1_5b71996e877c | npm ERR! code ELIFECYCLE
node_1_5b71996e877c | npm ERR! errno 134
node_1_5b71996e877c | npm ERR! [email protected] start: `ts-node ./index.ts`
node_1_5b71996e877c | npm ERR! Exit status 134
node_1_5b71996e877c | npm ERR! 
node_1_5b71996e877c | npm ERR! Failed at the [email protected] start script.
node_1_5b71996e877c | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
node_1_5b71996e877c | 
node_1_5b71996e877c | npm ERR! A complete log of this run can be found in:
node_1_5b71996e877c | npm ERR!     /root/.npm/_logs/2018-11-21T15_36_44_404Z-debug.log
oracle-leech_node_1_5b71996e877c exited with code 134

The first line (callback of type 6) is from my javascript callback being invoked with the first message. The second message never arrives as the node process has crashed.

Usually the nodejs container simply crashes with a message 'Aborted', but sometimes it crashes with 'Segmentation Fault'. When that happens the segfault-handler npm module we use gets a chance to write a crash.log with a stack trace. One example of such a log file:

PID 436 received SIGSEGV for address: 0x0
/usr/src/app/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x2b90)[0x7f56a37dfb90]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f56c1554890]
/usr/src/app/node_modules/oracledb/build/Release/oracledb.node(_ZN15njsSubscription13CreateMessageEP16dpiSubscrMessage+0x49)[0x7f56ba5ab879]
/usr/src/app/node_modules/oracledb/build/Release/oracledb.node(_ZN15njsSubscription19ProcessNotificationEP10uv_async_s+0x65)[0x7f56ba5ac365]
node[0xa4732f]
node[0xa58018]
node(uv_run+0x14b)[0xa47c6b]
node(_ZN4node5StartEPN2v87IsolateEPNS_11IsolateDataERKSt6vectorISsSaISsEES9_+0x565)[0x8e5255]
node(_ZN4node5StartEiPPc+0x462)[0x8e34a2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f56c11bbb45]
node[0x89dd45]

This stack makes me believe the error is in the CreateMessage method of Subscription which was invoked by ProcessNotification The issue only seems to reproduce when we get a number of messages in a very short timeframe. I am no C++ developer, but looking at Subscription.cpp it seems like the message lives as a property on the subscription itself (subscription->message). Could it be that this is a concurreny issue and some of the native code is already starting processing the second message and putting that message on the subscription while the processing of the first message has not completed yet.

Answer the following questions:

  • Running Node version 10.13.0 running on x64 linux (the offical node:10 docker container)
  • Using node-oracledb 3.0.1
  • Using instantclient 18.3.0.0.0 on linux64 (docker container)
  • Database Enterprise Edition v11.2.0.1

Command to reproduce once the docker-compose (with an oracle EE container and a node container) are up:

docker-compose exec node sh -c "/opt/oracle/instantclient_18_3/sqlplus system/oracle@oracle/EE.oracle.docker @update.sql"

This SQL script is just 10 update statements on the table with a commit between each statement: https://github.com/wvanderdeijl/oracle-leech/blob/segfault/update.sql

NodeJS typescript code for the listener that fails can be found at https://github.com/wvanderdeijl/oracle-leech/blob/segfault/index.ts

@anthony-tuininga
Copy link
Member

I am no C++ developer, but looking at Subscription.cpp it seems like the message lives as a property on the subscription itself (subscription->message). Could it be that this is a concurrency issue and some of the native code is already starting processing the second message and putting that message on the subscription while the processing of the first message has not completed yet.

Considering that ProcessNotification() is called on the main Node.js thread that would seem impossible, barring a bug in Node.js itself, of course! But it is likely still a concurrency issue, if not there. :-)

Thanks for providing such a complete test case and the steps for reproducing it. Hopefully we can indeed replicate the issue.

@wvanderdeijl
Copy link
Author

Would be great of you can figure this one out. CQN seems the perfect match for our scenario but these stability issues make use hesitant to rely on CQN.
It reproduces almost every time in my machine, so if you want me to do some more debugging or logging just let me know.

@anthony-tuininga
Copy link
Member

anthony-tuininga commented Nov 21, 2018

I just tried running your docker-compose setup. It all appeared to work but running the final command fails with SP2-0310: unable to open file "update.sql". Running the commands by copy/paste yields ORA-00942: table or view does not exist. So something seems a bit off yet. What script runs the setup for the database?

@wvanderdeijl
Copy link
Author

The table is created by the node container with the listening process. See line 48 at
https://github.com/wvanderdeijl/oracle-leech/blob/segfault/index.ts#L48
Also on initial startup of the compose the creation of the EE database takes quite a while.

Be sure to use the segfault branch of the repo. The master branch is not in a working state.

@anthony-tuininga
Copy link
Member

Ah. I first built it with master. I then rebuilt everything, but running docker-compose up yields the following problems:

node_1    | Error: EACCES: permission denied, scandir '/usr/src/app'
node_1    | (node:8) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'get' of undefined
node_1    |     at errorHandler (/usr/local/lib/node_modules/npm/lib/utils/error-handler.js:205:18)
node_1    |     at /usr/local/lib/node_modules/npm/bin/npm-cli.js:78:20
node_1    |     at cb (/usr/local/lib/node_modules/npm/lib/npm.js:228:22)
node_1    |     at /usr/local/lib/node_modules/npm/lib/npm.js:266:24
node_1    |     at /usr/local/lib/node_modules/npm/lib/config/core.js:83:7
node_1    |     at Array.forEach (<anonymous>)
node_1    |     at /usr/local/lib/node_modules/npm/lib/config/core.js:82:13
node_1    |     at f (/usr/local/lib/node_modules/npm/node_modules/once/once.js:25:25)
node_1    |     at /usr/local/lib/node_modules/npm/lib/config/core.js:112:20
node_1    | (node:8) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
node_1    | (node:8) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
node_1    | TypeError: Cannot read property 'loaded' of undefined
node_1    |     at process.<anonymous> (/usr/local/lib/node_modules/npm/lib/utils/error-handler.js:40:18)
node_1    |     at process.emit (events.js:182:13)
node_1    | /usr/local/lib/node_modules/npm/lib/utils/error-handler.js:205
node_1    |   if (npm.config.get('json')) {
node_1    |                  ^
node_1    | 
node_1    | TypeError: Cannot read property 'get' of undefined
node_1    |     at process.errorHandler (/usr/local/lib/node_modules/npm/lib/utils/error-handler.js:205:18)
node_1    |     at process.emit (events.js:182:13)
node_1    |     at process._fatalException (internal/bootstrap/node.js:491:27)
oracle-leech_node_1 exited with code 7

Any further suggestions?

@wvanderdeijl
Copy link
Author

That's weird. I’ll try it myself tomorrow in a fully clean working directory and will get back to you

@wvanderdeijl
Copy link
Author

I don't really understand what is going wrong, but I expect it has something to do with the /usr/src/app being a volume that I mapped to your docker host directory. Looks like docker somehow doesn't have permissions on your host.

Anyhow; I've updated the segfault branch to run everything in the docker containers without host volume mapping. Could you try to pull the latest version from git and then follow the instructions in the README.md?

@anthony-tuininga
Copy link
Member

Ok. I updated my git repo and ran the commands specified. Everything came up properly this time without any issues. I then ran the command to perform the update....and that worked, too. I ran it over a dozen times without any problems! :-)

So I hacked update.sql and got it to run 1000 times in an anonymous PL/SQL block. That successfully crashed node. So now I'll take it apart and see if I can figure out why!

@wvanderdeijl
Copy link
Author

“Good” to hear you also have the same issue. Let’s hope that’s a big step towards fixing this. Let me know if there is anything else I can do

@anthony-tuininga
Copy link
Member

Looking at this issue I discovered that all of the notifications are running the same thread, so this is not a concurrency issue. Instead, it seems to be due to the fact that too many notifications get backed up. The workaround (and probably better implementation anyway!) is to use notification grouping, as in the following code:

const cqnOptions = {
    callback: cqnCallback,
    sql: "select * from demo",
    groupingClass: oracledb.SUBSCR_GROUPING_CLASS_TIME,
    groupingValue: 1,  // number of seconds
    groupingType: oracledb.SUBSCR_GROUPING_TYPE_SUMMARY
};

If you're getting that many notifications this is going to be an improvement anyway! You can take a look at the documentation as well.

@wvanderdeijl
Copy link
Author

We've tried grouping but that has the major downside of losing access to actual rowid's and we need to do a full table refresh. So we would like to use individual messages, but in my setup this already crashes node when I send just a couple of messages. It is not a very busy database but it might happen that a few messages hit at the same time. In my docker setup it already fails at the second message.
Have you found where the queueing takes place? Would it help if I accept the messages as quickly as possible in my JavaScript code and debounce there?

@anthony-tuininga
Copy link
Member

One other possibility to consider: use another method to determine which rows need to be processed (last updated column, separate table containing primary keys of affected rows, etc.) so you can treat the CQN message like a signal.

Regarding processing the messages as quickly as possible, that's probably a good idea but I don't think that will eliminate the issue completely. Perhaps you can simply comment out the real processing temporarily to see if you can process more messages then? The queuing is taking place inside the Oracle Client libraries. We are hoping to talk to one of the owners of that section of the code to see if there is anything that can address this issue and will get back to you once we have more information.

@cjbj
Copy link
Member

cjbj commented Dec 5, 2018

@wvanderdeijl I just pushed a patch from @anthony-tuininga to master that seems to resolve the CQN crash. Can you help test & review it?

@anthony-tuininga
Copy link
Member

And it actually did turn out to be a concurrency issue. :-) And I was able to process better than 10,000 notifications in under a second, so I think we're good to go.

@alpertandogan
Copy link

And it actually did turn out to be a concurrency issue. :-) And I was able to process better than 10,000 notifications in under a second, so I think we're good to go.

I have the same problem with CQN.

How do you make 10,000 under a second. I just make max 20 .

anthony-tuininga

#1018 (comment)

@anthony-tuininga
Copy link
Member

Can you take a look at this comment?

As for performance? That's simply a matter of a fast machine and what you are doing in your notification callback! All I did (effectively) was count the number of notifications, so that's going to process much more quickly than something that performs a query, for example!

@alpertandogan
Copy link

When i recive CQN from Server i only run console.log(message);

async function myCallback(message) {
console.log(message);
}

But i use this scripts (run well with 10 instead of 50 ) from other clients connect to oracle NODEjs crash with ORA-24550
begin
for i in 1...50 loop
UPDATE msgtable set xxxx ; COMMIT;
end loop;
end;

@anthony-tuininga
Copy link
Member

Have you recompiled node-oracledb with the patch that was provided?

@alpertandogan
Copy link

I will compile and test monday. Thank s a lot.

@cjbj
Copy link
Member

cjbj commented Jan 5, 2019

@alpertandogan how did your testing work out?

@cjbj
Copy link
Member

cjbj commented Jan 23, 2019

Node-oracledb 3.1 was just released with a fix in this area. There may still be an issue on Windows, but we haven't reproduced in development.

@cjbj cjbj closed this as completed Jan 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants