Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1737830: Sporadic error seen in AWS lambda after upgrading to 1.14.0 driver #936

Closed
spdaley opened this issue Oct 15, 2024 · 12 comments
Assignees
Labels
bug Something isn't working status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team

Comments

@spdaley
Copy link

spdaley commented Oct 15, 2024

Info

  1. What version of NodeJS driver are you using?
"snowflake-sdk": "1.14.0"
  1. What operating system and processor architecture are you using?
AWS Lambda Linux Instances
  1. What version of NodeJS are you using?
Node 20
  1. What are the component versions in the environment (npm list)?
too many to list?
  1. What did you do?
Unfortuantely, I don't have a complete runnable program for you. This was just something that started happening in our logs after ugprading to 1.14.0. Our application runs in AWS Lambda. We generally try to preserve the snowflake connections across lambda invocations. This typically works just fine. After upgrading to 1.14.0 we started noticing very sporadic errors in the log which say:

Encountered an error when sending the request. Details: {}

There were not a large number of these but enough to set off our error monitoring. Also, the ones we found did not seem to be causing any issues.

From the logs it looks like when some of the lambda invocations start, a snowflake heartbeat event will get started and if the lambda finishes quickly we end up seeing an error in a subsequent invocation. See logs below.

Note, one other thing is these lambdas have multiple data sources that they deal with and sometimes do not actually talk to snowflake during an invocation. I believe all of the incidents where I saw this happen were during executions that never used snowflake in the end.
  1. What did you expect to see?
No error
  1. Can you collect debug logs?
We have some logs now. I've included when the lambdas started and finished in the logs but I've pruned out everything else we were doing. 

"2024-10-10T16:10:09.234Z","REPORT RequestId: 8c70ef49-88c6-4792-ad95-4345f126859d	Duration: 87.92 ms	Billed Duration: 88 ms	Memory Size: 1024 MB	Max Memory Used: 468 MB"
"2024-10-10T16:10:09.234Z","END RequestId: 8c70ef49-88c6-4792-ad95-4345f126859d"
"2024-10-10T16:10:09.209Z","[4:10:09.209 PM]: Encountered an error when sending the request. Details: {}"
"2024-10-10T16:10:09.146Z","START RequestId: 8c70ef49-88c6-4792-ad95-4345f126859d Version: 2842"
"2024-10-10T16:10:07.442Z","REPORT RequestId: 5d5b2ef4-3a97-4334-87bb-a5696f10c641	Duration: 92.46 ms	Billed Duration: 93 ms	Memory Size: 1024 MB	Max Memory Used: 468 MB"
"2024-10-10T16:10:07.442Z","END RequestId: 5d5b2ef4-3a97-4334-87bb-a5696f10c641"
"2024-10-10T16:10:07.436Z","[4:10:07.436 PM]: OCSP validation disabled for cn58973.us-east-2.aws.snowflakecomputing.com"
"2024-10-10T16:10:07.433Z","[4:10:07.433 PM]: CALL POST with timeout 90000: https://CN58973.us-east-2.aws.snowflakecomputing.com/session/heartbeat?requestId=736de361-4d9e-495e-b870-47787f3b0370"
"2024-10-10T16:10:07.433Z","[4:10:07.433 PM]: Get agent with id: https://cn58973.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"
"2024-10-10T16:10:07.398Z","[4:10:07.398 PM]: OCSP validation disabled for cn58973.us-east-2.aws.snowflakecomputing.com"
"2024-10-10T16:10:07.396Z","[4:10:07.396 PM]: CALL POST with timeout 90000: https://CN58973.us-east-2.aws.snowflakecomputing.com/session/token-request"
"2024-10-10T16:10:07.396Z","[4:10:07.395 PM]: Get agent with id: https://cn58973.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"
"2024-10-10T16:10:07.355Z","[4:10:07.355 PM]: OCSP validation disabled for cn58973.us-east-2.aws.snowflakecomputing.com"
"2024-10-10T16:10:07.353Z","[4:10:07.350 PM]: CALL POST with timeout 90000: https://CN58973.us-east-2.aws.snowflakecomputing.com/session/heartbeat?requestId=736de361-4d9e-495e-b870-47787f3b0370"
"2024-10-10T16:10:07.353Z","[4:10:07.350 PM]: Get agent with id: https://cn58973.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"
"2024-10-10T16:10:07.353Z","[4:10:07.349 PM]: Issuing heartbeat call"
"2024-10-10T16:10:07.349Z","START RequestId: 5d5b2ef4-3a97-4334-87bb-a5696f10c641 Version: 2842"

As stated above, this run didn't actually use the snowflake connection at all. It was performing an action against a different data source. We were running the 1.13.0 driver for a few weeks without seeing this. It was only after upgrading to 1.14.0 that we started to seeing.

I'm sorry it's not much to go on. I noticed in the release notes for 1.14.0 there was a fix to the heartbeat callback handler. Not sure if this is related or not but it was the only thing noticeable I saw in this path.

@spdaley spdaley added the bug Something isn't working label Oct 15, 2024
@github-actions github-actions bot changed the title Sporadic error seen in AWS lambda after upgrading to 1.14.0 driver SNOW-1737830: Sporadic error seen in AWS lambda after upgrading to 1.14.0 driver Oct 15, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Oct 15, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-triage Issue is under initial triage label Oct 15, 2024
@sfc-gh-dszmolka
Copy link
Collaborator

hi and thank you for raising this issue! one notable change in 1.14.0 regarding the heartbeats was #908. Obviously we could just revert that in a patch to test if that single change fixes the issue for you, but then we would not get any details about the actual error it causes.

Unfortunately it seems during JSON.stringify-ing the error, it gets lost :( So I made a small patch with patch-package which does nothing else, just changes how we log the error thrown.

snowflake-sdk+1.14.0.patch

Would it be possible for you to:

  • install [email protected] as usual
  • apply the patch with any of the three mechanisms described in Applying patches section of patch-package
  • wait for the issue to happen again and collect the logs and share them.

The same way we can revert #908 too in a subsequent step to get proof it caused the error (or the opposite) but if we were doing it right now, there would be no details of the errors it causes, so I'm hoping we can do it in two steps.
Thank you in advance !

@spdaley
Copy link
Author

spdaley commented Oct 15, 2024

hi and thank you for raising this issue! one notable change in 1.14.0 regarding the heartbeats was #908. Obviously we could just revert that in a patch to test if that single change fixes the issue for you, but then we would not get any details about the actual error it causes.

Thanks. I should be able to do that. It may take a day or so but I'll let you know once I get results.

@spdaley
Copy link
Author

spdaley commented Oct 16, 2024

We've got this error a few more times now since I deployed the patched snowflake. Here's the additional log we get now:

[12:18:59.899 AM]: TypeError: conn.getId is not a function
    at /var/task/resolvers/node_modules/@oort-dev/cnt-services-infra/dist/src/common/SnowflakeConnection.js:41:63
    at options.callback (/opt/nodejs/node_modules/snowflake-sdk/lib/services/sf.js:1333:28)
    at Object.callback (/opt/nodejs/node_modules/snowflake-sdk/lib/services/sf.js:676:43)
    at /opt/nodejs/node_modules/snowflake-sdk/lib/http/base.js:38:24
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  request:

@sfc-gh-dszmolka
Copy link
Collaborator

sfc-gh-dszmolka commented Oct 16, 2024

thank you so much @spdaley, this helps! before going into disabling configuration which worked perfectly for you prior to installing 1.14.0 with #908 , do you have some time to try one more patch ?

snowflake-sdk+1.14.0.patch

This is the same as before (adding/fixing logging) plus reverting #908 . Nothing else was changed. So with this patch, it should be very easy to get proof if really the aforementioned PR caused the problems for you or not. If the errors cease to happen with 1.14.0 + patch which reverts 908, we should have a good indication where it comes from.

Thank you !

@spdaley
Copy link
Author

spdaley commented Oct 16, 2024

I can give that patch a shot too. One thing I didn't notice before is that our code seems to be where that error originates. Line 41 of that file looks like this:

            localConnection.connect((err, conn) => {
                if (err) {
                    reject(err);
                }
                else {
                    logger.debug('Connection Success: ', conn.getId()); <------ this is line 41
                    ...
                }
            });

None of this code has changed between 1.13.1 and 1.14.0 though. We've only been changing the driver.
Is this the callback that is getting passed to the heartbeat now? Is the callback being executed without the conn? Or with something else? Or is there a bug in our callback? It pretty much follows your docs other than we are using arrow functions instead of a regular function.

@spdaley
Copy link
Author

spdaley commented Oct 16, 2024

I have a stand-alone script that reproduces it now. Do the following:

  1. npm install snowflake-sdk 1.14.0
  2. Modify your installed environment and apply the first patch which logs the exception
  3. Optional Modify node_modules/snowflake-sdk/lib/connection/connection.js and change line 179 (give or take) to self.keepalive = setInterval(self.heartbeat, 10 * 1000, callback); ... note I did this because it looks like the minimum heartbeat frequency is 900 seconds? I got tired of waiting so just hard-coded it to 10. The same issue happens with 900 seconds though.
  4. Modify the following script and set the SNOWFLAKE_ values to something appropriate for your env
  5. Run it with node
  6. You should get output like this:
{"level":"TRACE","message":"[10:38:34.003 AM]: CALL POST with timeout 90000: https://SF13288.us-east-2.aws.snowflakecomputing.com/queries/v1/query-request?requestId=8200ecbd-debe-4bca-9d3c-e5f34aa6680a"}
{"level":"TRACE","message":"[10:38:34.091 AM]: socket reused = true"}
Failed to execute statement due to the following error: SQL compilation error:
Object 'YOUR_TABLE' does not exist or not authorized.
{"level":"DEBUG","message":"[10:38:44.001 AM]: Issuing heartbeat call"}
{"level":"TRACE","message":"[10:38:44.003 AM]: Get agent with id: https://sf13288.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"}
{"level":"TRACE","message":"[10:38:44.004 AM]: CALL POST with timeout 90000: https://SF13288.us-east-2.aws.snowflakecomputing.com/session/heartbeat?requestId=9697869b-8e50-41bd-9803-165bb51ec236"}
{"level":"TRACE","message":"[10:38:44.099 AM]: socket reused = true"}
{"level":"DEBUG","message":"[10:38:44.170 AM]: Encountered an error when sending the request. Details: "}
{"level":"DEBUG","message":"[10:38:44.174 AM]: TypeError: conn.getId is not a function\n    at /Users/seadaley/workspace/test2/test-backup.js:52:85\n    at options.callback (/Users/seadaley/workspace/test2/node_modules/snowflake-sdk/lib/services/sf.js:1333:28)\n    at Object.callback (/Users/seadaley/workspace/test2/node_modules/snowflake-sdk/lib/services/sf.js:676:43)\n    at /Users/seadaley/workspace/test2/node_modules/snowflake-sdk/lib/http/base.js:38:24\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {\n  req"}

For this test I didn't actually set a valid table because I didn't really care. I just wanted something in there to let me know the connection was working.

The script

const snowflake = require('snowflake-sdk');

snowflake.configure({
  logLevel: "DEBUG"
});

// Constants for connection
const SNOWFLAKE_ACCOUNT = 'your_account';
const SNOWFLAKE_USER = 'your_username';
const SNOWFLAKE_PASSWORD = 'your_password';
const SNOWFLAKE_DATABASE = 'your_database';
const SNOWFLAKE_SCHEMA = 'your_schema';
const SNOWFLAKE_WAREHOUSE = 'your_warehouse';
const SNOWFLAKE_ROLE = 'your_role';
const SNOWFLAKE_TABLE = 'your_table';

// Create a Snowflake connection object
const connection = snowflake.createConnection({
  account: SNOWFLAKE_ACCOUNT,
  username: SNOWFLAKE_USER,
  password: SNOWFLAKE_PASSWORD,
  database: SNOWFLAKE_DATABASE,
  schema: SNOWFLAKE_SCHEMA,
  warehouse: SNOWFLAKE_WAREHOUSE,
  role: SNOWFLAKE_ROLE,
  clientSessionKeepAlive: true
});

// Connect to Snowflake and return a promise
function connectToSnowflake() {
  console.log('Connecting to the DB');
  return new Promise((resolve, reject) => {
    connection.connect((err, conn) => {
      if (err) {
        reject('Unable to connect: ' + err.message);
      } else {
        console.log(`Successfully connected to Snowflake with connection ID: ${conn.getId()}`);
        resolve(conn);
      }
    });
  });
}

// Function to execute the query
function executeQuery() {
  return new Promise((resolve, reject) => {
    const query = `SELECT COUNT(*) FROM ${SNOWFLAKE_TABLE}`;

    connection.execute({
      sqlText: query,
      complete: (err, stmt, rows) => {
        if (err) {
          reject('Failed to execute statement due to the following error: ' + err.message);
        } else {
          console.log('Query successfully executed.');
          console.log('Count: ', rows[0]['COUNT(*)']);
          resolve();
        }
      }
    });
  });
}

// Main function to connect and then execute the query
async function main() {
  try {
    await connectToSnowflake();
    await executeQuery();
  } catch (error) {
    console.error(error);
  }
}

// Run the main function
main();

process.stdin.resume();

Other note, if I apply the second patch which undoes the callback/this heartbeat change, and I run this script, I get this output:

{"level":"DEBUG","message":"[10:41:08.364 AM]: Issuing heartbeat call"}
{"level":"TRACE","message":"[10:41:08.366 AM]: Get agent with id: https://sf13288.us-east-2.aws.snowflakecomputing.com-noKeepAlive from cache"}
{"level":"TRACE","message":"[10:41:08.367 AM]: CALL POST with timeout 90000: https://SF13288.us-east-2.aws.snowflakecomputing.com/session/heartbeat?requestId=72248e8d-6a5e-4e0e-bc66-74cb035173fb"}
{"level":"TRACE","message":"[10:41:08.451 AM]: socket reused = true"}
{"level":"DEBUG","message":"[10:41:08.520 AM]: Heartbeat response {\"data\":null,\"code\":null,\"message\":null,\"success\":true}"}

@spdaley
Copy link
Author

spdaley commented Oct 16, 2024

Just for full disclosure, we've been running patch 2 in our environment for half a day and have not seen these errors.

@sfc-gh-dszmolka
Copy link
Collaborator

this is excellent, appreciated @spdaley ! both for testing the patch(es) and the self-contained repro. now we have a strong indication that the issue you're seeing is connected to the change introduced in #908. We'll see how to best address this.

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Oct 17, 2024
@sfc-gh-dszmolka
Copy link
Collaborator

fix incoming on #939

@sfc-gh-dszmolka sfc-gh-dszmolka added the status-pr_pending_merge A PR is made and is under review label Oct 23, 2024
@sfc-gh-dszmolka
Copy link
Collaborator

PR is merged and will be part of the next release

@sfc-gh-dszmolka sfc-gh-dszmolka added status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. and removed status-pr_pending_merge A PR is made and is under review labels Nov 4, 2024
@sfc-gh-dszmolka
Copy link
Collaborator

released with Snowflake Node.js driver v1.15.0 in the October 2024 release cycle

@sfc-gh-dszmolka
Copy link
Collaborator

also changed the logs to avoid empty ({}) Error messages logged (#944). will be rolled out with November 2024 release cycle.
your repro helped tons in testing the change, really appreciated @spdaley !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status-fixed_awaiting_release The issue has been fixed, its PR merged, and now awaiting the next release cycle of the connector. status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

4 participants
@spdaley @sfc-gh-dszmolka @sfc-gh-snow-drivers-warsaw-dl and others