Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow serializing SM state #239

Merged
merged 3 commits into from
Nov 19, 2024
Merged

Conversation

singpolyma
Copy link
Contributor

It is useful in some applications to get a snapshot of the stream management state for storage outside the process, in order to recover from crashes and other things. The get_sm_state already present is not suitable for this because it returns a live object tied in to the current context and such, and containing much unneeded internal-api data.

Introduce a new get_sm_serializable_state that creates instead a fixed snapshot, handing full memory ownership to the caller, and a dual set_sm_serializable_state which takes in the bare minimum needed values in order to resume and sets up a new sm_state based on these values.

@sjaeckel
Copy link
Member

Thanks for the PR.

Exposing library internals is against the self-set rules of this library and therefore not mergeable. By exposing the struct members, this struct is from then on part of the API&ABI and changing anything in it would require a major version increase.

IMO it would be better if we were exporting everything for the user, maybe into a binary blob!? An API could look something like:

/* This takes ownership of `sm_state`. It allocates a new blob buffer,
serializes all the data contained within the `sm_state` and returns
an opaque buffer `blob` of len `blen`.
After serialization it first zeroizes, then free's `sm_state` completely.

The data can IMO be in an arbitrary format of your choosing, feel
free to propose something. It should be simple enough and robust.

The only feature that format requires from my side is "versioning". */
int xmpp_sm_state_serialize(xmpp_sm_state_t *sm_state, void **blob, size_t *blen);
/* Reverse the serialization */
int xmpp_sm_state_deserialize(xmpp_ctx_t *ctx, const void *blob, size_t blen, xmpp_sm_state_t **sm_state);

This doesn't mean that this blob must not be "text editor compatible", but my target is to take the burden away from the user to care about handling that struct. Doing that might be OK for you, but it won't be OK for others.

Regarding the "protocol" of the exported data I'm not sure what we should do ... Maybe use the SSH wire format? c.f. RFC4251 Ch.5, it's simple enough to be re-implemented and pretty robust, also it supports all features we require if I'm not mistaken. IMO we don't have to care about low-level details like e.g. endianness, we should expect this data to only be read on the machine that it was written from.

What do you think? Let's discuss this, before diving into another implementation.

@singpolyma
Copy link
Contributor Author

singpolyma commented Feb 27, 2024

Exposing library internals is against the self-set rules of this library and therefore not mergeable

That's why I wrote a new struct instead of re-using any of the internals.

As you say though, I don't really care how I get this data, so it doesn't have to be a struct. So long as I can get the 4 values I need: current value of sent counter, current value of received counter, current sm id, and currently queued stanzas.

I guess what you're saying is that technically I could structure my code to not know about those 4 values. Looking at what I'm doing I guess that's true since I always just store them as-is and retrieve them as-is. I can't really think of a reason I would need to introspect except debugging so maybe that's fine.

This takes ownership of sm_state

I definitely need to be able to serialize without taking ownership, since strophe needs to keep ownership in order to stay connected and keep updating the state as it goes. I call this a lot (on every sm state change, though I don't have events for quite all of them in strophe yet but it's on my todo list).

Regarding the "protocol" of the exported data I'm not sure what we should do ... Maybe use the SSH wire format? c.f. RFC4251 Ch.5

This could work. I propose instead using CBOR (RFC8949) -- it will be a few more bytes, but not more complex to generated, and can be parsed by more tooling and has an official encoding for lists of strings, which a main part of the data here. We would of course only accept our own output as input and not a full arbitrary CBOR parser. So the format could be something like:

version sent (Big Endian)           received (Big Endian)   id (max length 32)  queue (max length 32)
\x01    \x1a\x00\x00\x00\x01        \x1a\x00\x00\x00\x01    \x62id              \x81\x59\x00\x03</>

@sjaeckel
Copy link
Member

sjaeckel commented Feb 28, 2024

I propose instead using CBOR (RFC8949)

It looks like we can trim this down to an acceptable complexity.

queue (max length 32)

Why would you limit to 32? I could understand to 24, but 32 requires to implement two types, 0x80..0x97 the array Major Type with the length included + 0x98 the array Major Type with "one-byte uint8_t for n".

From the cost of implementation complexity I would prefer to go with a "always non-preferred" way of encoding all required types as their 0x*a variant "four-byte uint32_t (follows|for n)". This would then basically be a type-tagged version of the SSH wire format, but I see your point of tooling to read that data.

Your example would then look as follows:

version                 sent (Big Endian)           received (Big Endian)   id
\x1a\x00\x00\x00\x01    \x1a\x00\x00\x00\x01        \x1a\x00\x00\x00\x01    \x7a\x00\x00\x00\x02id

queue
\x9a\x00\x00\x00\x01\x7a\x00\x00\x00\x03</>

FYI I've corrected the type of the queue entry from \x5* to \x7*, since we're always dealing with UTF-8 strings.

I definitely need to be able to serialize without taking ownership, since strophe needs to keep ownership in order to stay connected and keep updating the state as it goes.

I'm not sure whether I like that idea. The current way of only allowing the retrieval of the sm_state while disconnected is intentional, since otherwise we can't be sure whether that sm_state is valid or whether it's a stale one that had been updated internally already.

Maybe @pasis wants to add something to that discussion?

@singpolyma
Copy link
Contributor Author

I'm not sure whether I like that idea. The current way of only allowing the retrieval of the sm_state while disconnected is intentional, since otherwise we can't be sure whether that sm_state is valid or whether it's a stale one that had been updated internally already.

So I'll give you a better overview of what I'm doing (currently doing this in js with xmpp.js but planning to add a native target that uses libstrophe):

Every time a stanza comes in, or goes into the queue, or gets confirmed by an ack, or is known to have failed after a resume, I snapshot the sm state and serialize it, storing in persistent storage. This allows me to hydrate the state and attempt a resume even after a full crash or other unceremonious termination. In most cases this allows me to resume saving lots of time and keeping the stream intact. In the worst case, something has diverged and the resume fails pushing me back to a vanilla reconnect that I would have needed anyway without this.

I do understand that as a consumer of this API it's my job to make sure I snapshot every time there is a meaningful change, otherwise my serialized state will be useless.

@sjaeckel
Copy link
Member

sjaeckel commented Apr 9, 2024

Thanks for the explanation.

I guess we have to give up something for your use-case then, and I thought whether it wouldn't be better if it were more interactive? I'm thinking of a callback mechanism you can register which gets called each time we update our SM state.

My first idea looked like follows (NB: I'm not sure whether that's complete, I may have missed something :) )

enum xmpp_event {
   /* `h` == 0 -> enabled
    * `h` != 0 -> resumed
    * `data` == NULL
    */
   XMPP_SM_ENABLE,
   /* `h` == `sm_handled_nr`
    * `data` == NULL
    */
   XMPP_UPDATE_H,
   /* `h` == 0
    * `data` == stanza
    */
   XMPP_Q_ADD,
   /* `h` == 0
    * `data` equals to `data` of a previous `Q_ADD` event
    */
   XMPP_Q_DROP,
   /* `h` != 0
    * `data` equals to `data` of a previous `Q_ADD` event
    */
   XMPP_SM_Q_MOVE,
   /* `h` != 0
    * `data` equals to `data` of a previous `SM_Q_MOVE` event
    */
   XMPP_SM_Q_DROP,
};

typedef int (*xmpp_sm_callback)(xmpp_conn_t *conn, void *ctx, const char *id, enum xmpp_event ev, uint32_t h, const char *data);

int xmpp_sm_state_set_callback(xmpp_conn_t *conn, xmpp_sm_callback cb, void *ctx);
int xmpp_sm_state_restore(xmpp_conn_t *conn, enum xmpp_event ev, uint32_t h, const char *data);

The library calls the callback each time it changes the SM state with what happened.
To restore, the user has to replay the sequence of all events that haven't been DROPped resp. the last value of UPDATE_H.

This would be bandwidth friendly, but handling the data and restoring is less user friendly.

The complexity of the handling&restore, even though it's not so super complex, brought me back to what you basically proposed. Handing over the complete state via the callback, but in a serialized form.

typedef int (*xmpp_sm_callback)(xmpp_conn_t *conn, void *ctx, const void *sm_state, size_t sm_state_len);

int xmpp_sm_state_set_callback(xmpp_conn_t *conn, xmpp_sm_callback cb, void *ctx);
int xmpp_sm_state_restore(xmpp_conn_t *conn, const void *sm_state, size_t sm_state_len);

The library calls the callback each time it changes the SM state passing over the serialized state.
To restore the user has to pass the last state in.

What do you think?

Another question: what do you currently do with the "regular send queue"?

@singpolyma
Copy link
Contributor Author

What do you think?

Since that kind of callback is exactly what I need to simulate on my side either way, it seems pretty much ideal to me if we can get it as you say.

Another question: what do you currently do with the "regular send queue"?

I'm not sure what you mean here. Is there a second queue in use that is not tied to SM state?

@sjaeckel
Copy link
Member

sjaeckel commented Apr 9, 2024

What do you think?

Since that kind of callback is exactly what I need to simulate on my side either way, it seems pretty much ideal to me if we can get it as you say.

I guess the question should've been clearer then :D

What do you think of the two approaches? Could you live with the second one?

Another question: what do you currently do with the "regular send queue"?

I'm not sure what you mean here. Is there a second queue in use that is not tied to SM state?

Yes, there's the regular send queue which buffers stanzas before they get sent out.

  1. xmpp_send*() puts the (rendered) stanza in the send queue
  2. xmpp_run_once() tries to send the "next" element of the send queue
  3. after a stanza has been sent completely it is put in the SM queue

There's a mechanism of stanza counting involved to determine what was received on the server side

  1. we request the servers' h value by sending an <r>(equest) stanza after we put the first stanza into the send queue
  2. we drop all stanzas that are marked as received by the server when receiving the <a>(nswer) which contains the servers' current h

You can retrieve the regular send queue elements via xmpp_conn_send_queue_drop_element(), which will pop them off the list. This is as it is, since the initial use case was not what you're intending to do, but a flaky connection that maybe is too slow to transmit all data and the user wants to remove elements from the send queue if it goes above a certain limit of queued stanzas.

FYI I've found two issues after re-reviewing the implementation. This only matters if you drop send queue elements and I will provide a fix soon.

@singpolyma
Copy link
Contributor Author

What do you think of the two approaches? Could you live with the second one?

The second one being where there is an event that fires every time the SM state changes and it includes the serialized payload? Yes, this is ideal for me.

Yes, there's the regular send queue which buffers stanzas before they get sent out.

Hmm, that's unexpected. I would consider this part of the SM state really (on my side) since in the event of a reconnect these are still things that need to get re-tried (even if they were never tried once). So I guess if I save what you think of as SM state without this there will be a data loss race condition.

Would it be crazy from your side to include the contents of this queue in the new "state changed" event and restore it with the same restore call?

@sjaeckel
Copy link
Member

Would it be crazy from your side to include the contents of this queue in the new "state changed" event and restore it with the same restore call?

No, I also think this should be included in the serialized data.

The serialized format could then look like this:

uint32           version
array_of_string  send_queue
uint32           sent
uint32           received
string           id
map_of_string    sm_queue

Where the sm_queue map uses the h value as key and the stanza as value.

Am I missing something?

@singpolyma
Copy link
Contributor Author

I'm not sure why sm_queue needs to be a map? But I'm happy to encode it that way if it's better for you

@sjaeckel
Copy link
Member

sjaeckel commented Apr 10, 2024

I'm not sure why sm_queue needs to be a map?

Because we need to keep track of the h<->stanza relation. In the resumed response from the server, when trying resumption, we receive the last h the server processed and then we have to drop all stanzas with smaller h that are maybe still in the queue, but have to re-send all stanzas with higher h.

@sjaeckel
Copy link
Member

sjaeckel commented Apr 10, 2024

We could also calculate the h value from sent, that's true.
I thought it's maybe easier&cleaner as proposed. If you think it's sufficient or maybe even better as an array, we can also do it like that.

@singpolyma
Copy link
Contributor Author

I'm happy to do it as a map if that's easier. From outside I just get a blob and hand it back so doesn't matter too much to me

@singpolyma
Copy link
Contributor Author

I have pushed a new draft based on our discussion, see if this matches what you thought

@singpolyma singpolyma marked this pull request as ready for review October 21, 2024 18:48
Makefile.am Show resolved Hide resolved
@sjaeckel
Copy link
Member

sjaeckel commented Nov 19, 2024

I took the liberty to work a bit on this. I hope you're fine with the refactor!?

Just FTR: One can run a (failing) GH action locally with act [0].

  1. act -j release-test will spawn the docker container and run the failing CI job.
  2. docker ps -a will show you the ID of the still running container after failure.
  3. docker exec -it <container ID> /bin/bash will spawn a shell for you in that container.

[0] https://github.com/nektos/act

@singpolyma
Copy link
Contributor Author

@sjaeckel sure, seems likely to have equivalent functionality and still passes the test 👍

singpolyma and others added 3 commits November 19, 2024 16:26
It is useful in some applications to get a snapshot of the stream
management state for storage outside the process, in order to recover
from crashes and other things.  The get_sm_state already present is not
suitable for this because it returns a live object tied in to the
current context and such, and containing much unneeded internal-api
data.

Introduce xmpp_sm_state_set_callback which is called every time the
state changes with a serialized state, and a dual xmpp_sm_state_restore
which takes in this serialized state and sets up a new sm_state based on
that.

The serialization is considered opaque from the PoV of the API, but is
based on CBOR to facilitate easy debugging.
* Refactor `xmpp_sm_state_restore()`

Add new internal functions to load u32 and string values. This also brings
length checks in order to verify we don't read more sm_state than there
potentially is and type checks to ensure we have the correct CBOR types.

Each element is added to the queue right after creation, so it won't leak
in case creating its content fails.

* Refactor `sm_state_serialize()`

* Store & restore ints as/from big endian, CBOR requires that

* Bring API names in line with the usual naming

* Auto-format sources

Signed-off-by: Steffen Jaeckel <[email protected]>
`xmpp_run_once()` uses `FD_SET()` to determine which sockets to `select()`
on. The socket gets initialized to `INVALID_SOCKET = -1` inside
`xmpp_conn_new()` which leads to a buffer overflow once `FD_SET()` is
called.

This is only exposed when enabling a higher optimization level, which
was only done in the `release-test` CI job.

* Fix this testcase by initializing the socket to a possible value.
* Build the Valgrind CI jobs with `-O2`.
* Make the output of the `release-test` more verbose.

Signed-off-by: Steffen Jaeckel <[email protected]>
@sjaeckel sjaeckel merged commit e3d734c into strophe:master Nov 19, 2024
39 checks passed
@singpolyma singpolyma deleted the serializable-sm-state branch November 20, 2024 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants