-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRIVERS-2578 Drivers use polling SDAM on AWS Lambda #1452
Changes from all commits
e29f381
2445f62
e676bbb
e749ada
c580f05
fc6449c
934990e
f509738
94b7206
4da5456
0cccfab
73fcd6f
acf3ed9
9e2acb0
fa108ae
0eb23a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -91,6 +91,28 @@ Round trip time. The client's measurement of the duration of one hello or legacy | |
The RTT is used to support `localThresholdMS`_ from the Server Selection spec | ||
and `timeoutMS`_ from the `Client Side Operations Timeout Spec`_. | ||
|
||
FaaS | ||
```` | ||
|
||
A Function-as-a-Service (FaaS) environment like AWS Lambda. | ||
|
||
serverMonitoringMode | ||
```````````````````` | ||
|
||
The serverMonitoringMode option configures which server monitoring protocol to use. Valid modes are | ||
"stream", "poll", or "auto". The default value MUST be "auto": | ||
|
||
- With "stream" mode, the client MUST use the streaming protocol when the server supports | ||
baileympearson marked this conversation as resolved.
Show resolved
Hide resolved
|
||
it or fall back to the polling protocol otherwise. | ||
- With "poll" mode, the client MUST use the polling protocol. | ||
- With "auto" mode, the client MUST behave the same as "poll" mode when running on a FaaS | ||
platform or the same as "stream" mode otherwise. The client detects that it's | ||
running on a FaaS platform via the same rules for generating the ``client.env`` | ||
handshake metadata field in the `MongoDB Handshake spec`_. | ||
|
||
Multi-threaded or asynchronous drivers MUST implement this option. | ||
See `Why disable the streaming protocol on FaaS platforms like AWS Lambda?`_ and | ||
`Why introduce a knob for serverMonitoringMode?`_ | ||
|
||
Monitoring | ||
'''''''''' | ||
|
@@ -203,7 +225,7 @@ Clients use the streaming protocol when supported | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
When a monitor discovers that the server supports the streamable hello or legacy hello | ||
command, it MUST use the `streaming protocol`_. | ||
command and the client does not have `streaming disabled`_, it MUST use the `streaming protocol`_. | ||
|
||
Single-threaded monitoring | ||
`````````````````````````` | ||
|
@@ -491,6 +513,22 @@ connected to a server that supports the awaitable hello or legacy hello commands | |
This protocol requires an extra thread and an extra socket for | ||
each monitor to perform RTT calculations. | ||
|
||
.. _streaming is disabled: | ||
|
||
Streaming disabled | ||
`````````````````` | ||
|
||
The streaming protocol MUST be disabled when either: | ||
|
||
- the client is configured with serverMonitoringMode=poll, or | ||
- the client is configured with serverMonitoringMode=auto and a FaaS platform is detected, or | ||
- the server does not support streaming (eg MongoDB < 4.4). | ||
|
||
When the streaming protocol is disabled the client MUST use the `polling protocol`_ | ||
and MUST NOT start an extra thread or connection for `Measuring RTT`_. | ||
|
||
See `Why disable the streaming protocol on FaaS platforms like AWS Lambda?`_. | ||
|
||
Streaming hello or legacy hello | ||
``````````````````````````````` | ||
|
||
|
@@ -528,9 +566,8 @@ Measuring RTT | |
|
||
When using the streaming protocol, clients MUST issue a hello or legacy hello | ||
command to each server to measure RTT every heartbeatFrequencyMS. The RTT command | ||
MUST be run on a dedicated connection to each server. For consistency, | ||
clients MAY use dedicated connections to measure RTT for all servers, even | ||
those that do not support awaitable hello or legacy hello. (See | ||
MUST be run on a dedicated connection to each server. Clients MUST NOT use | ||
dedicated connections to measure RTT when the streaming protocol is not used. (See | ||
`Monitors MUST use a dedicated connection for RTT commands`_.) | ||
|
||
Clients MUST update the RTT from the hello or legacy hello duration of the initial | ||
|
@@ -584,8 +621,8 @@ current monitoring connection. (See `Drivers cancel in-progress monitor checks`_ | |
Polling Protocol | ||
'''''''''''''''' | ||
|
||
The polling protocol is used to monitor MongoDB <= 4.4 servers. The client | ||
`checks`_ a server with a hello or legacy hello command and then sleeps for | ||
The polling protocol is used to monitor MongoDB < 4.4 servers or when `streaming is disabled`_. | ||
The client `checks`_ a server with a hello or legacy hello command and then sleeps for | ||
heartbeatFrequencyMS before running another check. | ||
|
||
Marking the connection pool as ready (CMAP only) | ||
|
@@ -661,6 +698,12 @@ The event API here is assumed to be like the standard `Python Event | |
heartbeatFrequencyMS = heartbeatFrequencyMS | ||
minHeartbeatFrequencyMS = 500 | ||
stableApi = stableApi | ||
if serverMonitoringMode == "stream": | ||
streamingEnabled = True | ||
elif serverMonitoringMode == "poll": | ||
streamingEnabled = False | ||
else: # serverMonitoringMode == "auto" | ||
ShaneHarvey marked this conversation as resolved.
Show resolved
Hide resolved
|
||
streamingEnabled = not isFaas() | ||
|
||
# Internal Monitor state: | ||
connection = Null | ||
|
@@ -671,8 +714,6 @@ The event API here is assumed to be like the standard `Python Event | |
rttMonitor = RttMonitor(serverAddress, stableApi) | ||
|
||
def run(): | ||
# Start the RttMonitor. | ||
rttMonitor.run() | ||
while this monitor is not stopped: | ||
previousDescription = description | ||
try: | ||
|
@@ -700,7 +741,10 @@ The event API here is assumed to be like the standard `Python Event | |
serverSupportsStreaming = description.type != Unknown and description.topologyVersion != Null | ||
connectionIsStreaming = connection != Null and connection.moreToCome | ||
transitionedWithNetworkError = isNetworkError(description.error) and previousDescription.type != Unknown | ||
if serverSupportsStreaming or connectionIsStreaming or transitionedWithNetworkError: | ||
if streamingEnabled and serverSupportsStreaming and not rttMonitor.started: | ||
# Start the RttMonitor. | ||
rttMonitor.run() | ||
Comment on lines
+744
to
+746
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason we want to disable the RTT monitor for servers that don't support streaming heartbeat (i.e. server versions before 4.4)? From reading the rationale "Why disable the streaming protocol on FaaS platforms like AWS Lambda?", it sounds like the goal of not starting the RTT monitor is to reduce open connections for FaaS environments. However, the suggested change would disable the RTT monitor for all databases earlier than 4.4, not just FaaS environments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't a new requirement - drivers shouldn't be creating RTT monitors when the polling protocol is used. The rtt monitors are only necessary when using the awaitable hello protocol, because then a round trip to the server can take up to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The spec wording and pseudo-code don't match your assertion. The Measuring RTT section specifically says drivers can still use a separate RTT monitor for any server version:
The current pseudo-code from the Monitor thread section also describes always starting the RTT monitor, independent of server version: class Monitor(Thread):
...
def run():
# Start the RttMonitor.
rttMonitor.run()
... The new spec wording in this PR only suggests disabling the RTT monitor when streaming is disabled, but nothing about whether or not it's supported:
Disabling the RTT monitor when streaming isn't supported is allowed by the spec. However, putting it in the pseudo-code makes it seem like a new requirement that conflicts with the spec wording. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ShaneHarvey Can you opine on this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this PR changes the rules to disallow using an extra thread/socket for RTT on <4.4 servers. The rationale is that the extra connection limits scaling and puts unnecessary load on the server, especially in Lambda where MongoClients are not shared. Although the motivation is mostly Lambda this is still a beneficial change for all environments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To @matthewdale 's point, should this section from Measuring RTT be removed or updated to reflect these rule changes? :
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I updated this section to say clients MUST NOT use dedicated connections to measure RTT in this case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the clarification. Does that imply that the polling monitor should measure RTT (e.g. to support CSOT)? For the Go driver, that would significantly increases the scope of this change because we'd have to build RTT monitoring capability into the polling monitor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm confused by this question because the polling protocol always included RTT monitoring. But yes, the polling protocol should measure RTT via it's own heartbeats, not with a separate dedication connection. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking closer, the actual change in the Go driver would probably be minimal. It does add complexity because there is no longer a single source of RTT data for CSOT, but the change only impacts a few lines of code. Consider this comment thread resolved. |
||
if (streamingEnabled and (serverSupportsStreaming or connectionIsStreaming)) or transitionedWithNetworkError: | ||
continue | ||
|
||
wait() | ||
|
@@ -733,13 +777,13 @@ The event API here is assumed to be like the standard `Python Event | |
response = connection.handshakeResponse | ||
elif connection.moreToCome: | ||
response = read next helloCommand exhaust response | ||
elif previousDescription.topologyVersion: | ||
elif streamingEnabled and previousDescription.topologyVersion: | ||
# Initiate streaming hello or legacy hello | ||
if connectTimeoutMS != 0: | ||
set connection timeout to connectTimeoutMS+heartbeatFrequencyMS | ||
response = call {helloCommand: 1, helloOk: True, topologyVersion: previousDescription.topologyVersion, maxAwaitTimeMS: heartbeatFrequencyMS} | ||
else: | ||
# The server does not support topologyVersion. | ||
# The server does not support topologyVersion or streamingEnabled=False. | ||
response = call {helloCommand: 1, helloOk: True} | ||
|
||
# If the server supports hello, then response.helloOk will be true | ||
|
@@ -1140,6 +1184,32 @@ the "awaited" field on server heartbeat events so that applications can | |
differentiate a slow heartbeat in the polling protocol from a normal | ||
awaitable hello or legacy hello heartbeat in the new protocol. | ||
|
||
Why disable the streaming protocol on FaaS platforms like AWS Lambda? | ||
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | ||
|
||
The streaming protocol relies on the assumption that the client | ||
can read the server's heartbeat responses in a timely manner, otherwise the | ||
client will be acting on stale information. In many FaaS platforms, like AWS | ||
Lambda, host applications will be suspended and resumed many minutes later. | ||
This behavior causes a build up of heartbeat responses and the client can end | ||
up spending a long time in a catch up phase processing outdated responses. | ||
This problem was discovered in `DRIVERS-2246`_. | ||
|
||
Additionally, the streaming protocol requires an extra connection and thread | ||
per monitored server which is expensive on platforms like AWS Lambda. The | ||
extra connection is particularly inefficient when thousands of AWS instances | ||
and thus thousands of clients are used. | ||
|
||
We decided to make polling the default behavior when running on FaaS platforms | ||
like AWS Lambda to improve scalability, performance, and reliability. | ||
|
||
Why introduce a knob for serverMonitoringMode? | ||
'''''''''''''''''''''''''''''''''''''''''''''' | ||
|
||
The serverMonitoringMode knob provides a workaround in cases where the polling | ||
protocol would be a better choice but the driver is not running on a FaaS | ||
platform. It also provides a workaround in case the FaaS detection | ||
logic becomes outdated or inaccurate. | ||
|
||
Changelog | ||
--------- | ||
|
@@ -1159,6 +1229,8 @@ Changelog | |
:2022-04-05: Preemptively cancel in progress operations when SDAM heartbeats timeout. | ||
:2022-10-05: Remove spec front matter reformat changelog. | ||
:2022-11-17: Add minimum RTT tracking and remove 90th percentile RTT. | ||
:2023-10-05: Add serverMonitoringMode and default to the polling protocol on FaaS. | ||
Clients MUST NOT use dedicated connections to measure RTT when using the polling protocol. | ||
|
||
---- | ||
|
||
|
@@ -1183,4 +1255,6 @@ Changelog | |
.. _Why synchronize clearing a server's pool with updating the topology?: server-discovery-and-monitoring.rst#why-synchronize-clearing-a-server-s-pool-with-updating-the-topology? | ||
.. _Client Side Operations Timeout Spec: /source/client-side-operations-timeout/client-side-operations-timeout.rst | ||
.. _timeoutMS: /source/client-side-operations-timeout/client-side-operations-timeout.rst#timeoutMS | ||
.. _Why does the pool need to support closing in use connections as part of its clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-does-the-pool-need-to-support-closing-in-use-connections-as-part-of-its-clear-logic? | ||
.. _Why does the pool need to support closing in use connections as part of its clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-does-the-pool-need-to-support-closing-in-use-connections-as-part-of-its-clear-logic? | ||
.. _DRIVERS-2246: https://jira.mongodb.org/browse/DRIVERS-2246 | ||
.. _MongoDB Handshake spec: /source/mongodb-handshake/handshake.rst#client-env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth extending the test plan for handshakes (https://github.com/mongodb/specifications/blob/master/source/mongodb-handshake/handshake.rst#test-plan) to check that the the initial handshake uses OP_QUERY? As of this ticket, there is a possibility that polling could be used with a server versions >= 4.4 . And AFAIK this is novel and so potentially important to ensure that correct handshake logic is being implemented.
It's probably not very important, as the wire message between streaming / polling are not that different. But I do notice we don't check this anywhere and so it could be useful to do so as a part of this change.
Here is an example of how the Go Driver intends to check this: https://github.com/mongodb/mongo-go-driver/blob/fe237bb0079574fb76c859332218d7c49921210d/mongo/integration/handshake_test.go#L177-L182
CC: @baileympearson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this idea is worthwhile but as a new drivers ticket since this PR doesn't change the OP_QUERY vs OP_MSG behavior.