Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interoperability Error: keyhash is MD5 and can't be converted to key value #1927

Closed
trittsv opened this issue Jan 31, 2024 · 9 comments
Closed

Comments

@trittsv
Copy link
Contributor

trittsv commented Jan 31, 2024

Hey, i receive the following error using cyclonedds-cxx 0.10.4:

1706682542.741748 [10]       recv: data(application, vendor 1.1): 10149c8:3672d04f:d953daf8:80000002 #1: deserialization Navigation_Reference__Position/P_Navigation_Reference_PSM::C_Position failed (for reasons unknown)
1706682552.750641 [10]       recv: data(application, vendor 1.1): 10149c8:3672d04f:d953daf8:80000002 #2: deserialization Navigation_Reference__Position/P_Navigation_Reference_PSM::C_Position failed (keyhash is MD5 and can't be converted to key value)
1706682555.722553 [10]       recv: data(application, vendor 1.1): 10149c8:3672d04f:d953daf8:80000002 #3: deserialization Navigation_Reference__Position/P_Navigation_Reference_PSM::C_Position failed (keyhash is MD5 and can't be converted to key value)

What have gone wrong here and how to fix?
I try to receive a message from rti connext.

Using the following idl file:

   @mutable
   struct C_Position
   {
      @hashid @key P_GVA_Common::T_IdentifierType A_sourceID;
      @hashid P_GVA_Common::T_MetadataType A_metadata;
      @hashid P_LDM_Common::T_Coordinate3DType A_currentPosition;
      @hashid P_GVA_Common::T_IdentifierType A_positionUncertainty_sourceID;
      @hashid P_GVA_Common::T_IdentifierType A_heightUncertainty_sourceID;
   };
@trittsv trittsv changed the title Interoperability: keyhash is MD5 and can't be converted to key value Interoperability Error: keyhash is MD5 and can't be converted to key value Jan 31, 2024
@eboasson
Copy link
Contributor

The first line is indicative of a difference in interpretation of the serialisation format. It may be a bug we have fixed since the 0.10.x branch, but @mutable doesn't get used all that much and so it may also be that Cyclone is self-consistent but different from RTI. Being different means at least one of the two is wrong, but it can just as easily be RTI as Cyclone ... See, e.g., eclipse-cyclonedds/cyclonedds-cxx#218. The only explanation I can think of for that one is that never even tested anything ... But that was during the COVID years, so 🤷‍♀️

Thus, if this isn't master, it'd be really good to try that one first before diving into the serialised representation trying to figure it out.

The other two are (in all cases I am aware of, anyway) caused by RTI's cavalierly sending data without a payload. The DDSI spec does need to allow for this combination because it is needed for ending a coherent set in the absence of new data but nowhere (as far as I know) does it say that the payload can be omitted for a regular sample. To me, it also stands to reason that if one sends a DATA submessage to communicate a change of value or instance state, one should include the value or key ...

What they send instead is the "key hash" (which is optional to begin with1). That key hash is a funny thing. It has special serialisation rules and if the upper bound of the serialised key length exceeds 16 bytes, the key hash is the MD5 of that special serialisation. Because, well, surely x ≠ y ⇒ MD5(x) ≠ MD5(y) 2

There have been various debates among those involved in Cyclone on what the correct key hash serialisation is various cases in XTypes because the spec is so unclear. These differences even affect the length of the serialised key, and so in some cases these two interpretations also differ in whether the key hash is the MD5 of the serialised key or the (padded) serialised key itself. If it's caused by that, then it is still a problem of a different interpretation of the intended serialised representation, of course.

Fortunately, I have heard of a setting on the DataWriter in Connext that forces it to send serialised keys instead of just this key hash. You might want to look for serialize_key_with_dispose ... YMMV, they may have changed things in the last few years. If with master the first error isn't there, then they might solve your problem.

Footnotes

  1. That is, except in a specific configuration of DDS Security, where the MD5 of the key value is to be sent to allow a subscriber to maintain a RHC without being able to decrypt the data — it needs the keys for this, and the MD5 of it supposedly protects the key value ... obscurity through obscurity, anyone?

  2. I shudder to think of what happens when you do https://github.com/eclipse-cyclonedds/cyclonedds/blob/1c9bc19f94c6dd6c62b573fed335a0663cb84d9f/src/core/ddsc/tests/instance_handle.c#L135 on an implementation that trusts the key hash to distinguish between instances ... but I digress.

@trittsv
Copy link
Contributor Author

trittsv commented Feb 1, 2024

Thank you for your answer @eboasson. Yes i was using Version 0.10.4.

It was challenging to build the master 2024-01-02 cyclonedds because of some compile errors:

With some manual quick and dirty patches i was able to compile master and run it.
It turns out that with the master the exact same error messages appear.

I also have started a support case on the rti side to ask for the serialize_key_with_dispose option you mentioned.
They typically need one night to answer, so i will be able to test this in the following days.

When errors appear between two different DDS-Vendors its so hard to decide which Vendor to ask, "who is to blame" 🙈.

In the mean time @eboasson shall i provide Wireshark data of that message that i want to send from rti to cyclone?

@eboasson
Copy link
Contributor

eboasson commented Feb 1, 2024

In the mean time @eboasson shall i provide Wireshark data of that message that i want to send from rti to cyclone?

That + the IDL would be a good first step. It would also be useful to also have the serialised data that Cyclone generates for the same sample: that makes it possible to play a game of "spot the differences".

What would of course be ideal is a simple reproducer that writes that one sample that doesn't get decoded, that would automatically give us the IDL and the Cyclone output 🙂 It would also make it easier to inspect what it is actually doing. But that may be a bit much to ask for ...

The data type definition looks like a standard ("GVA" is suggestive), but sometimes contents can also be sensitive. If some of this is data that can't be shared publicly, contact me privately to see what we can do.

@trittsv
Copy link
Contributor Author

trittsv commented Feb 2, 2024

@eboasson i nailed it down. It turns out that there is a problem with optional-types.

It can be reproduced with the following IDL.

module My_Test_Module
{
    struct Optional_Test_Data
    {
        @optional long optId;
    };
};

As long as i dont put any data into optId then it works.
In the moment i put any data into the optional no data is received and the error log above shows up.
1706878364.492494 [10] recv: data(application, vendor 1.1): 101ecd4:54b26e0a:7f7df7ca:80000003 #1: deserialization Navigation_Reference__Position/My_Test_Module::Optional_Test_Data failed (for reasons unknown)

My_Test_Module::Optional_Test_Data optData;
optData.optId({4}); // it works without this line
writer.write(optData);

I attached 3 wireshark logs and the source code.

  • one with the problematic communication wireshark-log.pcapng
  • one cyclone only wireshark-cyclone-only.pcapng
  • one rti only wireshark-rti-only.pcapng

@eboasson could you please have a look into the data? 🙏🏻

2024-02-02-cyclone-rti-crash.zip

@eboasson
Copy link
Contributor

eboasson commented Feb 2, 2024

@trittsv I think the captures used the wrong network interface because the only packets are the participant discovery ones. (In the case of RTI-only it is also possible that it needs to be told to avoid shared memory and use UDP instead). Whenever this happens to me, invariably I selected the ethernet in Wireshark but the kernel routed the traffic over loopback, and selecting loopback then solves the problem.

There should be a regular "DATA", not just "DATA(p)" in there.

@trittsv
Copy link
Contributor Author

trittsv commented Feb 2, 2024

@eboasson sorry my mistake, here is hopefully the correct one (?):
rti-to-cyclone-captures.zip

At least in rti-to-cyclone there are now these 46 4.159476 192.168.56.1 192.168.56.1 RTPS 476 INFO_DST, INFO_TS, DATA(r) -> Navigation_Reference__Position, HEARTBEAT

@eboasson
Copy link
Contributor

eboasson commented Feb 2, 2024

There's a lot more now, definitely, and the Cyclone-to-Cyclone capture definitely has it:

Frame 18: 132 bytes on wire (1056 bits), 132 bytes captured (1056 bits) on interface \Device\NPF_Loopback, id 0
Null/Loopback
Internet Protocol Version 4, Src: 192.168.56.1, Dst: 192.168.56.1
User Datagram Protocol, Src Port: 50730, Dst Port: 50727
Real-Time Publish-Subscribe Wire Protocol
    Magic: RTPS
    Protocol version: 2.1
    vendorId: 01.16 (Eclipse Foundation - Cyclone DDS)
    guidPrefix: 011073565e210d90e87674ea
    Default port mapping (Based on calculated domainId. Might not be accurate): domainId=173, participantIdx=33, nature=UNICAST_USERTRAFFIC
    submessageId: INFO_TS (0x09)
        Flags: 0x01, Endianness
        octetsToNextHeader: 8
        Timestamp: Feb  2, 2024 15:57:13.809944300 UTC
    submessageId: DATA (0x15)
        Flags: 0x05, Data present, Endianness
        octetsToNextHeader: 32
        0000 0000 0000 0000 = Extra flags: 0x0000
        Octets to inline QoS: 16
        readerEntityId: ENTITYID_UNKNOWN (0x00000000)
        writerEntityId: 0x00000203 (Application-defined writer (no key): 0x000002)
        [Topic Information (from Discovery)]
        writerSeqNumber: 1
        serializedData
            encapsulation kind: CDR2_LE (0x0007)
            encapsulation options: 0x0000
            serializedData: 0100000004000000
    submessageId: HEARTBEAT (0x07)
        Flags: 0x01, Endianness
        octetsToNextHeader: 28
        readerEntityId: ENTITYID_UNKNOWN (0x00000000)
        writerEntityId: 0x00000203 (Application-defined writer (no key): 0x000002)
        [Topic Information (from Discovery)]
        firstAvailableSeqNumber: 1
        lastSeqNumber: 1
        count: 1

but in the RTI-to-Cyclone, I do see all the discovery but not the application data. If you get the "deserialization failed", then the data must have been sent (otherwise I would've assumed a type assignability check problem), but then ... where is it?

(You could try the Tracing/PacketCaptureFile option in Cyclone: I haven't tried it in ages, but it used to work — it writes a capture file that Wireshark can read.

@trittsv
Copy link
Contributor Author

trittsv commented Feb 2, 2024

@eboasson i did some more recordings with the cyclone wireshark feature, i also captured DATA from rti to rti via to PCs.
See file here: 2024-02-02-complete-capture.zip

rti to rti (works)

serializedData
    encapsulation kind: D_CDR2_LE (0x0009)
    encapsulation options: 0x0000
    serializedData: 080000000100000004000000

cyclone-to-rti: (works - data arrived in rti)

serializedData
    encapsulation kind: CDR2_LE (0x0007)
    encapsulation options: 0x0000
    serializedData: 0100000004000000

rti-to-cyclone: (fails)
No DATA field in wireshark, even when i start the two execs on two different PCs.
Thats the only thing:
image

How do i have to understand the spec https://www.omg.org/spec/DDS-XTypes/1.2/PDF 7.6.2.1.2?
D_CDR2_LE should only be used in encoding version 2 and appendable?
But the IDL:

module My_Test_Module
{
    struct Optional_Test_Data
    {
        @optional long optId;
    };
};

does not have appendable, or is assumed that a optional makes the entire thing automatically appendable?
May i am completely wrong because i read the spec for the first time.

@eboasson
Copy link
Contributor

eboasson commented Feb 3, 2024

I figured it out, it is RTI's "fault" but not in the way I had expected, and I also expect them to they disagree with this blame assignment.

You're correct about the final vs appendable detail.

  • RTI considers the type appendable and correctly uses D_XCDR2_LE.
  • Cyclone considers it final and that leads to XCDR2_LE.

The 08000000 at the beginning of the payload is the size of the struct that gets prepended because of the D_XCDR2_LE. I checked, annotating it with @appendable (or equivalently, using Cyclone's idlc option -x appendable to change its default) makes Cyclone switch to D_XCDR2_LE as well. So that's the easy bit.

That leaves us with three questions:

  1. Why does Cyclone deviate from the spec and consider it final instead of appendable?
  2. Why does it trigger a deserialization failure when Cyclone obviously can handle D_XCDR2_LE?
  3. Why do the reader and writer match, if the types have different extensibility?

Re 1: that's been described elsewhere in comments quite a few times. The reasoning is very simple: if you take a system that predates XTypes and you build half of it with a DDS implementation that doesn't support XTypes and the other half with one that does (e.g., old Cyclone and new Cyclone), you end up with a broken system. There will not be any assignability checking between the two because of the effort spent on backwards compatibility, but they will use different serialisation rules and so the "old" half will silently misinterpret the data.

In my view, that's such a blatant violation of backwards compatibility that the only option is to deviate from the spec and use a different default. Interoperation with a proper XTypes-capable DDS implementation that uses appendable as its default is safe, because the assignability rules say the two cannot communicate. (Nope, didn't double-check that Cyclone does this correctly before writing this sentence 😊😳)

Re 2: Cyclone simply follows the spec (apart from the default extensibility) and the spec spells out what encoding the writer needs to use given the type. Cyclone deliberately rejects anything with the wrong encoding, because who can trust a writer that uses the wrong encoding?

This is of course something that could be done differently. There is one deserialiser for all the encodings, so it is not difficult to do what RTI does and accept anything that can be deserialised. That would be in line with the old adage about being liberal in what one accepts — but I think the many security incidents of the past decade (or more) have shown that trying to be helpful often leads to vulnerabilities. I am not saying there is one here, just that I no longer believe in the wisdom of that adage.

Re 3: this is where it gets interesting. The "type consistency" QoS setting of the data reader doesn't have the "force type validation" flag set, so it doesn't require assignability checking and do it only if the type information is available. Cyclone sends the standardised XTypes::TypeInformation parameter 0x75 for this purpose. RTI only sends a vendor-specific one (TYPE_OBJECT_LB, 0x8021 in Wireshark).

So, there is no possibility for Cyclone of doing an assignability check and it can only check the type name (cf. the note in XTypes 1.3 7.6.3.2.2). RTI could do an assignability check, but presumably didn't because there is little reason to send the data if there are no matching subscribers. It is also possible that they did, decided it didn't match but just multicast the data anyway, regardless of whether there are any subscribers. Why one would do that, I don't know, but it could be.

In short, had RTI sent type information as spec'd instead of only doing something vendor-specific, everything would have been handled by the book, and the only spec deviation would've been the default extensibility in Cyclone, for which I think there is a valid argument. And, thus, to me number 3 is the problem that causes the deserialisation failure and it is RTI's fault.

P.S. I can't be bothered to try right now, but I consider it well-nigh certain that Cyclone + OpenDDS would've gone by the book ...

@trittsv trittsv closed this as completed Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants