The Streaming Data Sync Protocol (SDSP) allows publishers of realtime data to efficiently synchronize an object with decoupled subscribers. The protocol is designed to be interoperable with any underlying platforms and transports. The protocol is particularly effective when there is a single publisher and one or more decoupled subscribers (pub/sub pattern).
Unlike the many protocols that exist today enabling sync between publishers and subscribers, SDSP is designed to be agnostic in regards to the underlying streaming/realtime protocol, the data model and payload, the delta algorithm, the optional history retrieval feature and the platform. Additionally, the publisher can be decoupled from subscribers. For example, a protocol such as Meteor DDP may appear to have similar goals, yet it requires the publisher and subscriber to be coupled and communicate directly, it has limitations in terms of payload types that can be synchronized and is primarily designed for bi-directional synchronization.
SDSP is therefore a good fit when:
- You want to publish objects and subsequent changes in real time from typically one publisher to any number of subscribers
- Your data objects are sufficiently large that sending a delta for each update is more efficient than republish the entire object (as a guide, objects over 1KB normally satisfy this requirement)
- You have any underlying stream, realtime transport or queue (such as a WebSocket, Kafka, AMQP or Ably) to publish data deltas to subscribers
- You want the data sync updates to be portable without any coupling to the publisher. For example, a publisher could publish an update over HTTP, that data could be republished by a 3rd party into a queue, and then later be exposed to another 3rd party over Pub/Sub channels. The subscriber should be able to trivially reconstruct the entire data object without any access to the publisher or its systems
- Optionally, if you are publishing large objects that are unsuitable for the underlying transport due to their size, you have another storage mechanism such as Amazon S3 or Google Cloud Storage
For more background information on the motivation for this spec, please read the vendor neutral open approach to decoupled data synchronization article.
- Minimal complexity - implementing a publisher or subscriber from scratch should be simple
- Bandwidth efficiency - simplicity and portability takes precedence over bandwidth efficiency, however the protocol must be sufficiently efficient to make it useful
- Portability - the protocol aims to be sufficiently portable to operate on top of any streaming, realtime or queuing protocols, use any storage mechanism, allow any delta algorithm to be used and support any payload type
- Decoupled - the subscriber should never need to contact the publisher to reconstruct the original object and apply updates it receives
- Plug-in design - the libraries developed to implement this protocol should aim to be portable and not couple themselves to any realtime, streaming or storage technology. Instead, a plug-in design should be used for the transport, storage, history etc.
- Open protocol - this is a fledgling protocol at present, however contributions from the community and vendors are not only welcome, but necessary to make this protocol useful. We hope in time, a number of free and open source libraries will be made available using this protocol and the standard could be more officially defined
Source: Vendor neutral open approach to decoupled data synchronization article
Source: Vendor neutral open approach to decoupled data synchronization article
- "Delta Algorithm" is the algorithm used to generate a delta between two Objects.
- "Data Frame" is the data structure that is published over the Transport by a Publisher that includes one or more of: the data delta generated from the old and new object or the underlying applied change; the unique incrementing serial number; the Object UID; the delta algorithm used
- "History Plug-in" is a required plug-in, that using a pre-defined interface, allows the Library to retrieve Transport History in a uniform way. Some plug-ins may be bundled with the Library, however the Library must support 3rd party plug-ins
- "Library" is the client library that is used to create Data Frames ready for publishing on the Transport, and also consume Data Frames received from the Transport. The Library additionally provides APIs to generate data deltas from old and new Objetcts and publish and retrieves Object from an Object Storage Location
- "Object Storage Location" is a URI, preferably accessible over the Internet (this may not always be possible), containing the location of
- "Object Storage Plug-in" is an optional plug-in, that using a pre-defined interface, allows the Library to persist an Object and generate a URI and Object UID. Object Storage is most commonly used to avoid publishing large Objects on the Transport
- "Object UID" is the unique identifier assigned to an Object when persisted to the Object Storage Location. This can be assigned prior to being stored by the Publisher, or can be issued Storage service when issuing an Object Storage Location. In fact, the Object Storage Location itself is a UID, although it can be verbose and thus a shorted UID may be preferred.
- "Object" is an object, text or binary type that a Publisher wants to broadcast, along with any changes, to any number of subscribers. Subscribers consume Data Frames construct an Object that is kept in sync with each subsequent Data Frame
- "Publisher" is the device that is publishing the Object and generating Data Frames from the Library
- "Storage" is any available storage system that is used to store arbitrary binary or text objects
- "Subscriber" is the device that is consuming Data Frames and using the Library to construct the Object and emit updates
- "Transport" is any underlying delivery mechanism for the updates. This could for example be a realtime transport (such as a pub/sub channel), a stream (such as Kafka), a message queue (such as AMQP) or even just a static representation of the data perhaps exposed over HTTP with periodic polling
- "Transport History" is a service that allows previously published Data Frames on the Transport to be retrieved upon request
- "TTL" (commonly know as Time to live) is the mechanism that limits the amount of time History and Objects are available for an Object to retrieve. It is recommended that any additional authentication information is required to retrieve the Object from this URI and should instead be embedded in the URI. If separate authentication credentials are required, it will likely make it more difficult for Subscribers to retrieve the Object
Before a Publisher can start the sync process for an object and publish deltas for Subscribers, the Object must be set up. This is achieved by broadcasting a Data Frame with either the complete Object, or the location of the object so that the subscriber can retrieve that object. In the case of very large objects, the latter is the recommended approach.
- A unique ID must be assigned to every Object published over a Transport. It is recommended that this UID is a GUID, but this is not a requirement. The UID can either be assigned by the Publisher, or retrieved if and when the Object is persisted into Storage. If the Publisher assigns a UID, it is the responsibility of the Publisher to ensure the UID is unique. In the case that an Object is persisted into Storage, the URI itself could be used as the UID, or alternatively a UID returned in the response after storing the object in the Storage can be used. In this case, it is the responsibility of the Storage to ensure the UID is unique for each Object stored. When an Object Storage Plug-in is used, the plug-in will persist the object and always return an opaque URI and UID. There is no requirement that the UID and URI are the same and up to the Object Storage Plug-in
- The first Data Frame that is published for an Object contains the
uid
assigned in the previous step, theserial
attribute0
and either adataUri
attributing with the Object Storage Location (allowing any Subscriber to retrieve the Object) or adata
element containing the complete Object. Given all Transports have different constraints, it is at the discretion of the Publisher as to when an Object should be persisted in Storage or published in thedata
attribute. - The
ver
attribute (representing the version of the last published object) may optionally be included with the value0
. However this is unnecessary as a missing value is considered to be0
.
For every update to an Object, the Publisher will generate a delta that allows all subscribers to keep their copy of the object in sync with the Publisher's Object. As this specification is intentionally portable and can use any Delta Algorithm, the Publisher may generate the delta itself from the old and new Objects, or simply publish an already generated delta provided by the caller. Some Delta Algorithms may in fact support only one of these two models.
- Each Data Frame update contains a
uid
for the Object, and is assigned a newserial
number which is one whole number higher than the previousserial
for that Object UID. Theserial
number is an increment-only counter. - Every Data Frame must either contain a
delta
attribute with the delta based on the chosen Delta Algorithm to be applied to the Object at the previousserial
, or it must contain adata
Object ordataUri
with a URI to the Object that will completely replace the previous Object. It is at the discretion of the Publisher to decide when to publish a complete object vs a delta, however in a lot of cases it is better to publish the entire Object again when the delta is greater in size than the Object itself, or perhaps the cumulative deltas are greater. - When a delta is published by specifying a
delta
attribute only, the Data Frame must contain the attributealg
with a Data Algorithm code. See the list of reserved codes below, custom algorithms are prefixed withX-
such asX-crdt
. Additionally, thever
number must be incremented by one whole number from the last Data Frame. When the Data Frame is missing aver
attribute, the assume value is0
, and thus the firstdelta
applied to a published Object is always1
. Unlikeserial
, the version attribute is reset to zero whenever a new complete data object is published. - When a
data
ordataUri
attribute is included, thever
attribute (representing the version of the last published object) may optionally be included with the value0
. However this is unnecessary as a missing value is considered to be0
. - It is never valid to provide both a
delta
and adata
ordataUri
attribute. - A reserved attribute
checksum
may optionally be specified when thedelta
attribute is present. It may contain an object with at least the attributesval
andtype
. For Subscribers who want the additional assurity that the resulting Object after applying a delta is exactly the same as the Publisher's view, they can elect to compare the checksum of the modified Object and the Publisher's checksum. Theval
contains the checksum string value, and thetype
attribute describes how the checksum is generated, such asMD5
. However, given theMD5
checksum needs to be applied to an encoded object, thetype
attribute may take the formutf-8/MD5
which indicates that the object is first serialized to autf-8
string, and then encoded asMD5
. Other than the defined attributes, the mechanism for generating and validating checksums it outside of the scope of this specification. It is considered unnecessary in general to generate a checksum as the combination of auid
, aserial
and a determinstic delta algorithm ensures the Publisher and all Subscribers will always have the exact same objects unless there are platform incompatibilities or bugs in the delta libraries themselves. - The Data Frame may include a
meta
attribute. The Publisher can use this attribute to pass any arbitrary data that may be needed by the Publisher.
When a Publisher wishes to inform all Subscribers that the Object will no longer be updated and is now frozen / immutable, a Data Frame can be published as follows:
- The freeze Data Frame contains the
uid
for the Object. It is assigned a newserial
number which is one whole number higher than the previousserial
for that Object UID, and a newver
version which is one whole number higher than the previousver
for that Object UID. If the previous Object UID's attributever
was omitted, then thever
attribute of the new Data Frame must be1
. - The Data Frame additionally includes a
frozen
attribute with the valuetrue
.
Subscribers consume Data Frames and construct the Object. A Subscriber will typically perform the setup either in response to the first Data Frame provided to it, or pre-emptively when the Transport provides a mechanism for the Subscriber to retrieve the last published Data Frame.
When the Subscriber Library is initialized, it may optionally be configured with a history plug-in, that using a generic interface, allows the Subscriber to request previous Data Frames from the Transport History service.
The process is as follows:
- If a Subscriber wishes to pre-emptively construct the Object, the Subscriber will need a means to obtain the last published Data Frame or Object. If the history plug-inGiven the means to do this will probably be via a Publisher exposed API (such as REST), a bespoke API available in the Transport itself or via the history plug-in, it is difficult to define how this should be done in a portable way. As such, this specification makes no attempt to define how this feature should work. For Subscribers that have no means to obtain the last published Data Frame or Object, they must wait for the first Data Frame to be delivered before the Object can be constructed.
- Once the last published Data Frame is received, the Subscriber will need to do the following to construct the Object:
- If the Data Frame includes a
data
ordataUri
attribute, then the entire Object is available immediately by inspecting thedata
attibute or loading the Object from thedataUri
. At this point no further work is necessary and the Object Setup is complete. - If a
historyUri
attribute is present, then a request should be made to retrieve all historical Data Frames in a single request. - Otherwise, it is up to the Subscriber to obtain the historical Data Frames using the Transport History. There is no defined means in this specification to obtain history as this is likely to be Transport specific, however if a
historyUri
is not present in each Data Frame, then the Subscriber must be configured with a history plug-in to ensure it can retrieve previous Data Frames as necessary. The history plugin is provided the Data Frame in order to retrieve the Data Frame history. - The
historyUri
and the history plug-in must return an array of Data Frames. It is the Library's responsibility to prepare the Data Frames by sorting them in descending order byserial
, removing any duplicates and discarding any Data Frames with a serial greater than or equal to the last published Data Frame's serial. The Library should then iterate through the Data Frames ensuring thever
andserial
attributes descrease by one until the Data Frame with and emptyver
orver
0
is reached. This Data Frame contains the last published complete Object. - The Library botains the complete Object Data Frame and applies all Data Frame deltas
The Data Frame is a data structure that is published on the Transport for the initial Object setup, and then for subsequent updates. It is not invalid to include additional arbitrary attributes in the Data Frame, however this may lead to unexpected behavior as this specification evolves and new attributes are added that may conflict. We recommend additional attributes are included in the meta
attribute. If a new attribute is included in the root Data Frame object and it forms part of a proposed specification to include it as a documented supported attribute, we recommend prefixing the field with X-
as a convention to indicate that this X-
prefix may later be dropped.
The following algorithms used to generate Deltas have the following reserved codes:
md
- Myers Diffjp
- JSON Patch
It is invalid to use these algorithm codes unless the target Delta Algorithm is being used.
If an algorithm other than the aforementioned well-known algorithms is being used, then the code must start with the character X-
. This ensures that as new reserved codes are added, there will never be a conflict with existing implementations.
Please note the following conventions in the API definition:
- Methods have parenthesis such as
method()
- Attributes never have parenthesis and always include a return type such as
attribute: String
- Where a method relies on underlying IO, and in some languages may therefore be idiomatic to use a callback style, Promise or Channel etc. the return type appears as
=> io ReturnType
such asloadFromUrl(url: String) => io Response
. When the method does not rely on IO, it is simply presented with-> ReturnType
such assum(a: Int, b: Int) -> Int
- The interfaces will typically be defined within a namespace such as
SDSP
, however the way to address this varies immensely between languages so is left to the developer implementing the library to do this in an idiomatic way to avoid naming conflicts - The constructors are not defined in this specification for the
struct
type as it assumed the caller can construct thestruct
in an idiomatic way for the language - Where constructors are not defined for the
interface
type, this is mostly intentional as the implementation of thatinterface
will be transport, platform or language dependent, and as such should be defined by the developer - The
Int
type is generally considered to be a 64 bitInt
, but will in all but very few cases work with a 32 bitInt
. On 32 bit platforms, it is up to the developer to consider the upper limit of theserial
attribute - The
URI
type may not be supported in all platforms. If not, aString
type is compatible
struct DataFrame
uid: String
serial: Int
ver: Int
delta: Object
alg: String
data: Object
dataUri: URI
historyUri: URI
checksum: { val: String, type: String }
meta: Object
interface Library
constructor(
)
publish
interface ObjectStorage:
store(Object object) => io
struct ObjectStorageLocation:
uri: URI | String // URI preferred if supported in the language
uid: String
interface ObjectStorage:
store(Object object) => io
struct ObjectStorageLocation:
uri: URI | String // URI preferred if supported in the language
uid: String
- Duplicates should be avoided. Whilst a serial numbering system means it is not overly complex to remove duplicates, Transports that ensure there are no duplicates help to maintain the stated goal of "minimal complexity"
- Ordering is important. If messages arrive out of order, the subscriber can attempt to re-order them, but this adds complexity and risk that deltas cannot be applied because it waits indefinitely for an earlier delta. Transports that provide reliable ordering are important in helping to maintain the stated goal of "minimal complexity"
- The Transport should offer a Transport History feature allowing previously published Data Frames to be retrieved. If this is not possible, it is the responsibility of the Publisher or any middleware republishing the Data Frames to provide a
historyUri
attribute to retrieve all Data Frames prior to the recently received one i.e. all Data Frames from serial0
to the currentserial
less1
.
- To improve the portability of each
DataFrame
, it is recommended that ahistoryUri
URI attribute is included in eachDataFrame
(this can be appended by the Publisher or potentially by a middleware responsible for publishing) that:- Is a URL accessible over the Internet
- Does not require authentication as the URL itself should be secure (i.e. a cryptographically secure embedded token or GUID)
- As far as practical is a short URL to reduce the overhead in each frame of including a
historyUri
attribute - Returns an array of all recent
DataFrame
objects untilver
0 - When possible, includes the initial published Object instead of a
dataUri
to minimize the number of requests needed for a Subscriber to construct the Object
- The
DataFrame
data structure contains attribute names that could arguably be considered verbose given every frame, when using JSON encoding, includes the entire attribute name values. For example, an empty delta frame may contain the following JSON{"uid":"UID","serial":1,"ver":1,"delta":"{}","alg":"jp"}
. 22 bytes out of 57 total bytes are used for the attribute names alone (although this is a contrived example). If the frame size is important, we recommend using a Transport that provides efficient encoding for pre-defined data structures such as Google's ProtoBuf or Apache Avro - The
data
attribute of theDataFrame
supports any type of Object or language primitives such as aString
or binary blob. We consider the encoding and decoding of these objects on the transport outside of the scope of this specification as it is assumed the transport itself will provide suitable encoding and decoding mechanisms. For example, if thedata
attribute is a binary object, and the transport uses JSON encoding, then it is expected the publishing client would encode theDataFrame
using, for example, Base64 encoding, and the subscribing client could detect the encoding and decode theDataFrame
before providing theDataFrame
to the SDSP library.
- The Transport History must guarantee that a request is always up to date with the most recently published Data Frame. For example, if Data Frame with serial
5
is published, any Subscriber receiving that Data Frame must be able to issue a Transport History request immediately and expect to receive serial5
and lower.
- TBD
- Due to the choice of using an increment-only serial number system for each Data Frame, there is currently no means in the protocol for objects to diverge or fork.
- If two publishers simultaneously publish an update, they will generate conflicting serial numbers. Managing these conflicts due to concurrency issues is outside of the scope of this specification. It is required that the Publisher ensures only one update can be applied concurrently which can easily be achieved with locks or a single thread publishing updates.
- If a Subscriber receives a faulty Data Frame according to the protocol, or is unable to apply a delta, there is no mechanism for it to recover from this situation. Given the goals of this protocol are to be decoupled, we believe that the requirement to trigger a re-sync or re-publish of the entire object is outside of the scope of this protocol specification.
- Encoding and decoding issues are outside of the scope of this specification. It is assumed the Transport is responsible for all encoding and decoding issues and the Library consumes and publishes decoded objects.
- Encryption is not considered in this specification. It is assumed the Transport is responsible for encryption and decryption.
- The Object Storage Location should ideally be accessible freely over the Internet. If security is a concern, the URI could contain a cryptographically secure GUID or embedded token.
- The Object Storage Location and History Service should use a TTL that reasonably exceeds the maximum expected duration between a Data Frame being published and consumed.
- When an Object is stored in the Object Storage Location, one may naturally assume a hash of the underlying binary/object could be used as the UID of that object. This is fraught with problems and should be avoided. Instead we strongly recommend a GUID is assigned which guarantees uniqueness. Assuming for example you have two separate publishes starting with an empty JSON object such as
{}
. If a SHA-256 hash (checksum) is generated using the string representation, the UID would be44136FA355B3678A1146AD16F7E8649E94FB4FC21FE77E8310C060F61CAAFF8A
. If the two publishers then published the initial Data Frame, both would have serial number 0 and UID44136FA355B3678A1146AD16F7E8649E94FB4FC21FE77E8310C060F61CAAFF8A
. Yet, if both publishers then modify the object and publish diverging deltas, both new Data Frames would have the serial number 1 and the same UID. Any subscriber consuming both sets of Data Frames would now have conflicting deltas to apply. - Whilst the protocol supports multiple Objects being published concurrently over the same transport, in many cases this may add unnecessary complexity. For example, if a pub/sub channel is being used to publish updates for multiple different objects, subscribers for one object will have to consume Data Frames for objects they will simply discard. Additionally, it is plausible that users may mistakenly assume only one object exists per pub/sub channel and thus rely on the last received Data Frame to provide insight into the Object associated with a channel. Assuming a Transport History request is then made for the last X Data Frames, where X equates to the
serial
, it is plausible that other Data Frame updates will be returned in that request thus resulting in insufficient Data Frames necessary to construct the Object. All of these issues are surmountable, but for the goal of "minimal complexity", we recommend only publishing one object concurrently on a single Transport.
This project is currently sponsored and maintained by Ably, the realtime data distribution platform. The intention is not for any organization to own this open protocol standard, but instead for it to be jointly developed and maintained by the community and other leaders in the realtime space. The spec and website is hosted in a Github account that does not belong to any commercial organization.
There are two ways to get involved:
- As a contributor: simply raise an issue or submit a pull request. We'll do our best to incorporate your feedback and contributions.
- As a sponsor: sponsors are not commercial sponsors, but simply people or organizations wishing to participate and commit to ongoing development and support of this spec. Until we have a more formal inbox, please just contact the current sponsor Ably about becoming a sponsor. Everyone who can contribute constructively is welcome.
- Swarm Replicated Object Notation (RON) - an interesting data serialization format for distributed synchronization. The project is still in its infancy, but could be valuable in time as an efficient Delta Algorithm
Licensed under the Apache License, Version 2.0. Refer to LICENSE for the license terms.