Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message sizes greater than around 262 kB drop out and don't get received #3053

Closed
calvertdw opened this issue Oct 28, 2022 · 14 comments
Closed

Comments

@calvertdw
Copy link

calvertdw commented Oct 28, 2022

Hello there, we are having trouble with large message sizes. We've tried increasing the socket buffer sizes but it doesn't seem to have any effect.

pqos.transport().send_socket_buffer_size
pqos.transport().listen_socket_buffer_size

wqos.reliability().kind = BEST_EFFORT_RELIABILITY_QOS;

Expected behavior

Sending messages that are large get over the network. For instance, 4k video messages of 1 MB, or colored point cloud data from a Realsense L515 that are 4 MB.

Current behavior

Messages larger than ~262 kB maybe send for a second or two, but then stop.

Steps to reproduce

Modify the HelloWorldPublisher and HelloWorldSubscriber, adding a RawCharMessage.idl.

struct RawCharMessage
{
	sequence<char, 10000000> data;
};

Double the message data size every 1 second.

// epoch time in seconds
static std::time_t lastDataSizeIncreaseTime = -1;
static int dataSize = 1;

bool HelloWorldPublisher::publish(
        bool waitForListener)
{
    if (listener_.firstConnected_ || !waitForListener || listener_.matched_ > 0)
    {
        data.data().clear();
        for (int i = 0; i < dataSize; i++) {
            data.data().push_back('a');
        }
        if ((std::time(0) - lastDataSizeIncreaseTime) >= 1) {
            dataSize = dataSize * 2;
            lastDataSizeIncreaseTime = std::time(0);
        }
        writer_->write(&data);
        return true;
    }
    return false;
}

Fast DDS version/commit

master

Platform/Architecture

Other. Please specify in Additional context section.

Transport layer

UDPv4

Additional context

Arch Linux and Fedora, up-to-date.

@calvertdw calvertdw added the triage Issue pending classification label Oct 28, 2022
@calvertdw
Copy link
Author

We changed from sending over a large company network to a direct wired connection between our machines and we were able to send up to 16 MB messages.

@contr4l
Copy link

contr4l commented Nov 2, 2022

We changed from sending over a large company network to a direct wired connection between our machines and we were able to send up to 16 MB messages.

Hi, I got a similar issue, IDL is

struct HelloWorld
{
	unsigned long index;
	sequence<char> message;
};

publish code is

bool HelloWorldPublisher::publish(
        bool waitForListener)
{   
    std::vector<char> msg;
    for (int i=0; i<500*1024; i++)
        msg.push_back('A');
    if (listener_.firstConnected_ || !waitForListener || listener_.matched_ > 0)
    {
        hello_.index(hello_.index() + 1);
        hello_.message(msg);
        writer_->write(&hello_);
        return true;
    }
    return false;
}

When the size is large such as bigger than 50KB, subscriber cannot receive any data,
But when the size reduce to 50Byte or 100Byte, it works normally.

If I change the sequence to char message[500*1024] or sequence<char, 500*1024>, it works fine as well.

So, I'm not sure it's a problem when using unfixed length vector in IDL.

@calvertdw
Copy link
Author

Does anyone know what kind of thing comes into play here? It seems that reliable mode doesn't work to mitigate this issue. It seems the message transmission quality has a steep dropoff, where trying to resend is only making things worse. On what kind of networks have you guys seen that reliable mode is effective. Is it just on long time scales and 99.99% reliable networks?

@ds58
Copy link

ds58 commented Nov 3, 2022

We changed from sending over a large company network to a direct wired connection between our machines and we were able to send up to 16 MB messages.

Hi, I got a similar issue, IDL is

struct HelloWorld
{
	unsigned long index;
	sequence<char> message;
};

publish code is

bool HelloWorldPublisher::publish(
        bool waitForListener)
{   
    std::vector<char> msg;
    for (int i=0; i<500*1024; i++)
        msg.push_back('A');
    if (listener_.firstConnected_ || !waitForListener || listener_.matched_ > 0)
    {
        hello_.index(hello_.index() + 1);
        hello_.message(msg);
        writer_->write(&hello_);
        return true;
    }
    return false;
}

When the size is large such as bigger than 50KB, subscriber cannot receive any data, But when the size reduce to 50Byte or 100Byte, it works normally.

If I change the sequence to char message[5001024] or sequence<char, 5001024>, it works fine as well.

So, I'm not sure it's a problem when using unfixed length vector in IDL.

I was able to replicate this exactly. The subscriber doesn't receive the data unless you specify the sequence size in the IDL. I wouldn't expect that to be normal behavior, but maybe it is?

EDIT: after further experimentation, the limit for this specific example (with an unfixed length char vector) seems to be a length of 100. A message with 101 chars in the char vector will not be received by the subscriber.

@ds58
Copy link

ds58 commented Nov 3, 2022

Looks like by default, Fast-DDS-Gen imposes this limit.

See on the left, I've set my IDL to:

struct RawCharMessage
{
	unsigned long index;
	sequence<char, 1000> data;
};

On the right, I've left it as unbounded:

struct RawCharMessage
{
	unsigned long index;
	sequence<char> data;
};

@JLBuenoLopez
Copy link
Contributor

There are a two different problems being reported on this ticket. On the one hand the one reported by the one who opened it (@calvertdw) which relates with sending large data message over lossy networks. In this case it is necessary to understand that several layers are involved with this issue.

  1. Fast DDS library splits the large data message to fit the UDP datagram which size is of ~65kB.
  2. The transport layer, IP protocol, fragments the UDP datagram at the same time depending on the network MTU. Usually the MTU is 1500, and consequently, each UDP datagram is split into ~40 IP fragments.

In order to receive any sample, you must receive every one of the UDP datagrams. If BEST_EFFORT is used, the UDP datagrams are sent once and only once. If the communication is RELIABLE the UDP datagrams are going to be resent (unless they are overwritten in the DataWriter's History which depends on how it has been configured with the HistoryQosPolicy). Depending on the publication rate and the network bandwidth this can also cause an overflowing of data that worsens the situation instead of improving it.

@calvertdw, you may try to increase the MTU if your network hardware allows it or limiting the maxMessageSize under the MTU to prevent IP fragmentation, leaving all fragmentation to Fast DDS.

The second issue reported here is a very common one (#2903, #2740, #2330...). By default, Fast DDS is currently configured with PREALLOCATED_MEMORY_MODE MemoryManagementPolicy. This means that if your data model is unbounded, Fast DDS will preallocate some memory for your data samples, but if your samples are larger, no more memory will be allocated. @contr4l and @ds58, you should change the MemoryManagementPolicy to one that allows for reallocations at run time. You have more information in the previous link.

@EduPonz
Copy link

EduPonz commented Nov 4, 2022

2. The transport layer, IP protocol, fragments the UDP datagram at the same time depending on the network MTU. Usually the MTU is 1500, and consequently, each UDP datagram is split into ~40 IP fragments.

To further elaborate on this, if your network experiences IP frame drops at a rate of 1 every 40, no UDP datagram can be reconstructed upon reception and consequently no UPD datagrams are ever handed over to Fast DDS. There is nothing that Fast DDS can do on this situation, since it is a problem of the reliability on lower layers, which can be caused by a myriad of reasons. However, as @JLBuenoLopez-eProsima points out, setting the maxMessageSize to something smaller than the MTU, you make your UPD datagrams fit into IP frames, and so Fast DDS will receive, in the scenario I proposed, 39 out of every 40 of those UDP datagrams containing very small RTPS data fragments and as a consequence there would only be resends for that 1 missing frame every 40.

@calvertdw
Copy link
Author

Thank you guys, that's extremely helpful.

you may try to increase the MTU if your network hardware allows it or limiting the maxMessageSize under the MTU to prevent IP fragmentation, leaving all fragmentation to Fast DDS.

in the scenario I proposed, 39 out of every 40 of those UDP datagrams containing very small RTPS data fragments and as a consequence there would only be resends for that 1 missing frame every 40.

These are both critical information to know when using Fast-DDS. Could we add that to the documentation for large data rates?
https://fast-dds.docs.eprosima.com/en/latest/fastdds/use_cases/large_data/large_data.html

problem of the reliability on lower layers, which can be caused by a myriad of reasons.

Is there some database or discussion of these possible reasons somewhere? Anywhere on the internet, it doesn't have to be documentation for Fast-DDS. It'd be nice to have at least some kind of list of common reasons to reference.

you may try to increase the MTU

From https://en.wikipedia.org/wiki/Maximum_transmission_unit:

Larger MTU is associated with reduced overhead. Smaller MTU values can reduce network delay.

It seems as though changing the MTU doesn't address the issue of reliability, so I think we would opt for merely reducing maxMessageSize to 1500.

@MiguelCompany MiguelCompany removed the triage Issue pending classification label Jan 31, 2023
@qpc001
Copy link

qpc001 commented Mar 12, 2023

Thr origin question is never fix. Even doing this :

  1. setting "maxMessageSize"
  2. Increasing socket buffers size

ALL this can't help sending and subscribing large data like 20000 * 20000 uint_8. The data is lost in network or somewhere...

@calvertdw
Copy link
Author

@qpc001 That is correct. We solved our problem only by reducing our message sizes. For us, this meant to stop sending point clouds and switch to sending JPEG or PNG compressed depth images.

I proposed some solutions above that should be addressed.

@Mario-DL
Copy link
Member

According to our CONTRIBUTING.md guidelines, I am closing this issue for now. Please, feel free to reopen it if necessary.

@calvertdw
Copy link
Author

I found some relevant advice now in the documentation. If we run into this again, we'll try these steps and reopen if it doesn't work. Thanks!
https://fast-dds.docs.eprosima.com/en/latest/fastdds/use_cases/large_data/large_data.html#example-sending-a-large-file

@calvertdw
Copy link
Author

This issue should really stay open. We still don't have a working solution and can't reliably send messages over ~262 kB.

@JLBuenoLopez
Copy link
Contributor

Hi @calvertdw

I can reopen it and move it to the Support discussion forum. As already explained, this is not a proper issue or bug in the library. Al least for the moment. Our CI checks that larger messages than 262 kB are sent so this can be caused by several other factors different from Fast DDS library, for instance, the network architecture, mis-configuration of QoS, mis-expectations...

You might consider changing to a different transport like TCP and/or modify the discovery mechanism as DDS relies on multicast and this is troublesome in Wifi connections. Discovery Server mechanism might be the way to go.

Finally, eProsima offers architecture studies for users that are struggling making Fast DDS work with their specific use case. You might consider contacting eProsima's commercial support team for more information.

@eProsima eProsima locked and limited conversation to collaborators Sep 29, 2023
@JLBuenoLopez JLBuenoLopez converted this issue into discussion #3892 Sep 29, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants