Skip to content

QNetd protocol

Jan Friesse edited this page Oct 26, 2020 · 3 revisions

QNetd protocol is simple binary protocol. It's TCP based stream consisting of messages. Each message has following format:

0                   1                   2                   3                       4
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      Message type             |                   Length of message                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                            Options ...                                                            |
|                               .                                                                   |
|                               .                                                                   |
|                               .                                                                   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-      .....

Textually:

  • 2 bytes - Type of message
  • 4 bytes - Length of message data
  • Length of message data - Message data consisting of TLVs

Following message types are defined:

  • PreInit, type 0

    Initiated by client. Must contain cluster name (1). Minimal init before full init is executed (intentionally after TLS session is started). Can contain SeqNumber option (0).

  • PreInit Reply, type 1

    Reply to PreInit message (0). Must contain TLS Supported (2) option and TLS client cert required (3) option. When received by server, server error message (6) is sent.

  • StartTLS, type 2

    Init TLS (initiated by client). Right after this message, both client and server have to start handshake process. After handshake is finished, normal processing continues

  • Init, type 3

    Real init after TLS is started. Initiated by client. Must contain node id (9) option. Should contain list of messages (4) and options (5) supported by client. Must contain decision algorithm (11), heartbeat timeout (12), ring id (13) and tie_breaker(21). Server must reply with Init reply message (4).

  • Init reply, type 4

    Reply to Init message (3). If client sent list of messages (4) or options (5), returned reply contains list of messages (4) and/or options (5) supported by server. Must contain maximum size of message processed by server (7) and maximum size of message sent by server (8). Must contain list of algorithms (10) supported by server. Must contain reply error code (6) option. When received by server, server error message (5) is sent.

  • Server error, type 5

    Server reply to malformed client message. Must contain reply error code (6) option.

  • Set option, type 6

    Set connection option(s). Currently supported is heartbeat timeout (12) and keep active partition tie breaker (23). Server must reply with set option reply message (7)

  • Set option reply, type 7

    Return all supported connection options requested in Set option(6) message. List of supported options are equal to Set option(6) message.

  • Echo request, type 8

    Used mainly for heartbeat. Can contain every option and all options are returned back unchanged in echo reply message (9).

  • Echo reply, type 9

    Reply to echo request message (8). All options sent in Echo request (8) are returned unchanged.

  • Node list, type 10

    Inform server about node list stored in configuration file or membership layer. It must contain sequential number (0) and node list type (18). Currently 4 node list types are used:

    • 0 - Sent as initial configuration from configuration file - contains current config version (14). Node info (17) has filled node id (9) option and may have filled data center id (15) option.
    • 1 - Sent when configuration is changed from configuration file - contains current config version (14). Node info (17) has filled node id (9) option and may have filled data center id (15) options.
    • 2 - Sent when node is added/removed. This is only one where client should WAIT for answer and vote from server before continues in processing (in corosync this is sync phase). Must contain ring id (13) option. Node info (17) has filled only node id (9) option.
    • 3 - Send when node is added/removed or quorate state is changed. This is sent after client decided quorate state and it's primarily informative (server return VOTE_NO_CHANGE for vote (19) option) but if server decided to change quorate state, any other value of vote (19) option is allowed. It must contain quorate (20) option. Node info (17) has filled node id (9) and node state (16) options.

    It must contain at least one node info (17) option. Node info (17) option can be present multiple times, one entry for each node.

  • Node list reply, type 11

    Reply to node list (10). Must contain sequential number (0), vote (19), node list type (18) and ring id (13) option. Ring id is ether copied from Node list (10) message if Node list type (18) was set to 2, or stored value is used.

  • Ask for vote, type 12

    Sent by client. Must contain sequential number (0) option.

  • Ask for vote reply, type 13

    Reply to Ask for vote (12). Must contain sequential number (0), vote (19) and ring id (13) options.

  • Vote info, type 14

    Sent by server to inform client about changed vote. Must contain sequential number (0), vote (19) and ring id (13) options.

  • Vote info reply, type 15

    Sent by client as answer to Vote info (14) message. Must contain sequential number (0) option.

  • Heuristics changed, type 16

    Sent by client to inform server about change in heuristics. Must contain sequential number (0), and heuristics (22).

  • Heuristics changed reply, type 17

    Reply to Heuristics changed (16). Must contain sequential number (0), copy of heuristics (22), vote (19) and ring id (13) options.

Each of option is stored in TLV format:

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|             Type              |           Length              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             Value                             |
|                               .                               |
|                               .                               |
|                               .                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Textually:

  • 2 bytes - Type of option
  • 2 bytes - Length of of value field
  • Length of value field (depends of Type of option) - Value field

Following options are defined:

  • 0 - Sequential number. 4 bytes number in network order. It's always replied back without change.
  • 1 - Cluster name, variable length - Name of cluster (trailing zero byte is not included)
  • 2 - TLS Supported, 1 byte, 0 = TLS not supported by server, 1 = TLS Supported by server, 2 = TLS required by server
  • 3 - TLS Client cert required, 1 byte, 0 = client cert not required, 1 = client cert required
  • 4 - Supported messages - variable length - list of supported messages
  • 5 - Supported options - variable length - list of supported options
  • 6 - Reply error code - 2 bytes
  • 7 - Server maximum request message size - 4 bytes - Maximum size of message accepted by server
  • 8 - Server maximum reply message size - 4 bytes - Maximum size of message sent by server back to client
  • 9 - Node id - 4 bytes
  • 10 - Supported algorithms - variable length - List (array of 2 bytes values) of decision algorithms supported by server
  • 11 - Algorithm - Set algorithm (see Decision algorithms section) to use. Default is algorithm 0.
  • 12 - Heartbeat timeout - 4 bytes - Interval in millisecond 0 (disable heartbeat - default) or <1000 - 200000> in which client must send some message (usually echo request) before client is considered dead
  • 13 - Ring id - 12 bytes - Current ring id of corosync. Consists of 4 bytes long leader nodeid and 8 bytes long seq number.
  • 14 - Config version - 8 bytes - Current config file version
  • 15 - Data center id - 4 bytes
  • 16 - Node state - 1 byte - 1 = member, 2 = dead, 3 = leaving
  • 17 - Node info - variable length - compound of node id (9), data center id (15) and node state (16). Only node id (9) is required. All items are in standard TLV format.
  • 18 - Node list type - 1 byte - 1 = initial configuration (config file), 2 = changed configuration (config file), 3 = changed node list, 4 = quorum list (more informative node list with quorate field)
  • 19 - Vote - 1 byte - 1 (VOTE_ACK) = node has a vote, 2 (VOTE_NACK) = node doesn't have a vote, 3 (VOTE_ASK_LATER) = daemon didn't decided yet and client should ask later, 4 (VOTE_WAIT_FOR_REPLY) = daemon didn't decided yet, but it will inform client so client should wait for reply, 5 (VOTE_NO_CHANGE) = used mainly with informative node list types (18) like 4.
  • 20 - Quorate - 1 byte - 0 = Partition is not quorate, 1 = partition is quorate
  • 21 - Tie breaker - 5 bytes - Compound of 1 byte type and 4 bytes node id. Type can be 1 = lowest, 2 = highest and 3 = specific node id. 4 bytes of node id is used only for type 3.
  • 22 - Heuristics - 1 byte - 1 (pass) = heuristics passed, 2 (fail) = heuristics failed
  • 23 - Keep active partition tie breaker - 1 byte - 0 = Do not use keep active partition tie breaker (disabled), 1 = Do use keep active partition tie breaker (enabled)

Reply error codes:

  • 0 - No error
  • 1 - Unsupported needed message - when client sent supported messages option and it didn't contained required support for message. Not used for now.
  • 2 - Unsupported needed option - when client sent supported options option and it didn't contained required support for option. Not used for now.
  • 3 - TLS is required.
  • 4 - Unsupported message
  • 5 - Message too long
  • 6 - Preinit required
  • 7 - Message doesn't contain required option
  • 8 - Unexpected message. Message was not expected in the given context (flow). Example is when server receives preinit reply by client.
  • 9 - Can't decode message. Server wasn't able to decode message. Message was ether malformed or server wasn't able to alloc enough memory.
  • 10 - Server internal error. Sent in various situations, like server wasn't allocate memory, ...
  • 11 - Init required
  • 12 - Unsupported decision algorithm. Sent by server if client requested unsupported decision algorithm
  • 13 - Invalid heartbeat interval. Sent by server if client requested invalid heartbeat interval (too big or too small)
  • 14 - Unsupported decision algorithm message. Sent by server if client requested message which is not supported by selected decision algorithm.
  • 15 - Tie-breaker differs from other nodes - Sent by server if client in its init message requested tie-breaker which is different from rest of cluster where client wants to connect.
  • 16 - Algorithm differs from other nodes - Sent by server if client in its init message requested algorithm which is different from rest of cluster where client wants to connet.
  • 17 - Duplicate node id - Sent by server if client in its init message sent node id which is already existing in cluster. This can also be result of server not find out yet that "old" connection was closed. It's good idea to try reconnect later.
  • 18 - Invalid config node list - Configuration node list sent by client is not valid. It's ether empty or sender is not included.
  • 19 - Invalid membership node list - Membership node list sent by client is not valid. It's ether empty or sender is not included.

Decision algorithms:

  • 0 - Test - Vote is given by every client who asks for it
  • 1 - FFSplit - 50:50 Split
  • 2 - 2nodeLMS - Last man standing algorithm for 2 node cluster use case
  • 3 - LMS - Last man standing
Clone this wiki locally