CPU overloads during the PDP phase with multiple Domain Participants using Simple Discovery #5519

MMarcus95 · 2024-12-19T18:20:12Z

Is there an already existing issue for this?

I have searched the existing issues

Expected behavior

The CPU consumption is not affected so much by the number of the spawned domain participant.

Current behavior

A CPU overload happens when spawning several domain participants.

Steps to reproduce

I'm spawning several domain participants in different threads using Simple Discovery as discovery mechanism. I'm using the following code (I'm spawning 170 domain participant in this case)

#include <fastdds/dds/domain/DomainParticipant.hpp>
#include <fastdds/dds/domain/DomainParticipantFactory.hpp>
#include <fastdds/dds/domain/DomainParticipantListener.hpp>
#include <fastdds/rtps/transport/UDPv4TransportDescriptor.hpp>

#include <chrono>
#include <thread>


eprosima::fastdds::dds::DomainParticipant* create_participant(const std::string& name){
    // Configure participant QoS
    eprosima::fastdds::dds::DomainParticipantQos participant_qos;
    // Use simple discovery
    participant_qos.wire_protocol().builtin.discovery_config.discoveryProtocol = eprosima::fastdds::rtps::DiscoveryProtocol::SIMPLE;
    // Configure discovery settings
    participant_qos.wire_protocol().builtin.discovery_config.leaseDuration = eprosima::fastdds::dds::Duration_t(3, 1);
    participant_qos.wire_protocol().builtin.discovery_config.leaseDuration_announcementperiod = eprosima::fastdds::dds::Duration_t(1, 2);
    // Increase limit of discoverable data readers/writers (default is 100u)
    participant_qos.wire_protocol().builtin.mutation_tries = 250u;
    // Set participant name
    participant_qos.name(name);
    // Use only UDPv4 transport
    auto udp_transport = std::make_shared<eprosima::fastdds::rtps::UDPv4TransportDescriptor>();
    participant_qos.transport().user_transports.push_back(udp_transport);
    participant_qos.transport().use_builtin_transports = false;
    // Create the participant
    eprosima::fastdds::dds::DomainParticipant *participant = eprosima::fastdds::dds::DomainParticipantFactory::get_instance()->create_participant(
        0,
        participant_qos,
        nullptr,
        eprosima::fastdds::dds::StatusMask::none()
    );
    if (!participant)
        throw std::runtime_error("Error: could not create participant");

    return participant;
}

void ddsparticipant_thread(std::stop_token st, const std::string name)
{
    // Create domain participant
    eprosima::fastdds::dds::DomainParticipant* participant = create_participant(name);
    
    while(!st.stop_requested())
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(2000));
    }
}

int main()
{
    // Number of participants to spawn
    const int num_participants = 170;

    // Spawn participants
    std::vector<std::jthread> threads;
    for (int i = 0; i < num_participants; ++i)
    {
        threads.push_back(std::jthread(ddsparticipant_thread, "participant_" + std::to_string(i)));
    }

    while (true)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(1000));
    }

    return 0;
}

Here instead there is a screenshot of the CPU consumption when spawning 70 and 170 domain participants

As a workaround, I'm already using the Discovery Server mechanism. However, some of the available tools for fastdds like DDS-Record-Replay or Fast-DDS-spy does not support Discovery Server. More in general, I was surprised to see this CPU overload, so I would like to understand better why it is happening.

Fast DDS version/commit

v3.1.0

Platform/Architecture

Other. Please specify in Additional context section.

Transport layer

UDPv4

Additional context

The test is executed inside a docker image with Ubuntu Jammy Jellyfish 22.04 amd64.

The CPU is an Intel 13th Gen i7-13700H, in the following there more details (from lscpu command)

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

The text was updated successfully, but these errors were encountered:

EugenioCollado · 2024-12-23T13:39:09Z

Hi @MMarcus95 ,

Thank you for reporting the issue. The behavior you are describing is already known, and we are actively working on a solution. The problem is due to an excess of discovery messages sent across the entire network (including to participants that are already matched) whenever a new participant spawns. This leads to an exponential increase in CPU usage with each additional participant.

We are currently testing the fix to address this inefficiency, and once it has been successfully validated it will be included in the next release.

In the meantime, using a tool such as the Discovery Server is an effective workaround, as it significantly reduces the amount of discovery traffic. Additionally, as a general recommendation, try to minimize the number of participants, as each participant inherently consumes resources due to its associated threads (check out the following table). Is there any particular reason for having this many participants in your setup?

Thank you for your patience and stay tuned for the upcoming release.

MMarcus95 · 2025-01-10T15:37:26Z

Hi @EugenioCollado,

thank you for the detailed explanation. Here at the Dynamic Legged Systems (DLS) lab we are developing a distributed, modular software framework for controlling robots (see A Practical Real-Time Distributed Software Framework for Mobile Robots).

In the current implementation we need some domain participants addressing several aspects of the framework behavior. Despite the domain separation, the CPU overload issue arised and we switched to Discovery Server. However, we would like to have a more distributed approach, using Simple Discovery. We could in the future try to reduce the number of domain participant. However, the CPU overload could anyway arise when using multiple robots.

That's said, I'm really eager to see this fixed in the next releases!

Thanks for all your effort.

EugenioCollado · 2025-01-13T07:24:32Z

Thank you for sharing details about your project at the DLS lab; it sounds fascinating! If you have any further needs, questions, or would like to be informed when the fix is ready, please don’t hesitate to reach out via email to [email protected].

Thank you for your valuable input, and we wish you continued success with your framework development!

MMarcus95 added the triage Issue pending classification label Dec 19, 2024

EugenioCollado added in progress Issue or PR which is being reviewed and removed triage Issue pending classification labels Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU overloads during the PDP phase with multiple Domain Participants using Simple Discovery #5519

CPU overloads during the PDP phase with multiple Domain Participants using Simple Discovery #5519

MMarcus95 commented Dec 19, 2024

EugenioCollado commented Dec 23, 2024 •

edited

Loading

MMarcus95 commented Jan 10, 2025

EugenioCollado commented Jan 13, 2025

CPU overloads during the PDP phase with multiple Domain Participants using Simple Discovery #5519

CPU overloads during the PDP phase with multiple Domain Participants using Simple Discovery #5519

Comments

MMarcus95 commented Dec 19, 2024

Is there an already existing issue for this?

Expected behavior

Current behavior

Steps to reproduce

Fast DDS version/commit

Platform/Architecture

Transport layer

Additional context

XML configuration file

Relevant log output

Network traffic capture

EugenioCollado commented Dec 23, 2024 • edited Loading

MMarcus95 commented Jan 10, 2025

EugenioCollado commented Jan 13, 2025

EugenioCollado commented Dec 23, 2024 •

edited

Loading