Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Some data is lost during transmission. #1

Open
XLzed opened this issue Oct 18, 2022 · 13 comments
Open

[BUG] Some data is lost during transmission. #1

XLzed opened this issue Oct 18, 2022 · 13 comments

Comments

@XLzed
Copy link

XLzed commented Oct 18, 2022

Describe the bug
Some data is lost during transmission,it causes the exception of grpc http2 deframe, and netty benchmark example hangs because of waiting for all data.

Steps to Reproduce

  • grpc command: ./build/example/install/hadronio/bin/hadronio grpc benchmark -m 10000 -rs 10000 -as 10000 -r 0.0.0.0
  • netty command: ./build/example/install/hadronio/bin/hadronio netty benchmark throughput -s -l 100000 -m 1000

Additional info

  • grpc exceptions
    • Stream x does not exist
    • Frame of type 0 must be associated with a stream.
    • INTERNAL: Encountered end-of-stream mid-frame
    • Frame length: x exceeds maximum: y
  • netty benchmark thourghput hangs
    image
@fruhland
Copy link
Contributor

Can you please provide some information on your test system? Especially, which type of network interconnect are you using (Ethernet, InfiniBand, etc.)?
The only error I recognize is "Stream x does not exist" from gRPC, but for me, it only occurs on a specific system and the benchmarks work fine on other systems.

@XLzed
Copy link
Author

XLzed commented Oct 18, 2022

Can you please provide some information on your test system? Especially, which type of network interconnect are you using (Ethernet, InfiniBand, etc.)? The only error I recognize is "Stream x does not exist" from gRPC, but for me, it only occurs on a specific system and the benchmarks work fine on other systems.

I test it locally and the machine have no rdma device, so the examples run with tcp only (I also set UCX_TLS=tcp).

System Info

  • Linux version 4.19.95-17 (root@runner-857a6918-project-16016-concurrent-0) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC))
  • openjdk 11.0.16 2022-07-19
  • OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu118.04)
  • OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu118.04, mixed mode, sharing)
  • UCX version:1.13.1
  • ucx_info:ucx_info.log

Sequence Number Test

I also add an additional seqNumber in the head of message to debug, and find that some messages are lost or not retrieved correctly . Some logs like: [WRN][HadronioSocketChannel] recv sequence number error, required [159], but get [290]

  • command: ./build/example/install/hadronio/bin/hadronio netty benchmark throughput -s -l 1000 -m 100000
    client.log server.log
  • command: ./build/example/install/hadronio/bin/hadronio grpc benchmark -m 100 -rs 10000 -as 10000 -s
    grpc-client.log grpc-server.log

I also tested between two machines that supports ROCEv2, but the exception also occurred. Some information of rdma test environment:

  • Ethernet controller: Mellanox Technologies MT28850
  • MLNX_OFED_LINUX-5.4-3.4.0.0
  • rdma-core v35.4

I can use ucx and ibverbs to communicate directly, maybe the logic of tag_send/recv or of RingBuffer cause this problem?

@XLzed
Copy link
Author

XLzed commented Oct 19, 2022

If I force the sendTaggedMessage to be blocking, the examples works fine.

//      final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress() + index, messageLength, tag, true, blocking);
        final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress() + index, messageLength, tag, true, true);

@fruhland
Copy link
Contributor

Thanks for the detailed report. I will try to reproduce the issue and have a look into whats going wrong.

@XLzed
Copy link
Author

XLzed commented Oct 20, 2022

It seems that tag matching semantic is not completed in order strictly. Maybe we have to deal with out-of-order, or use another semantic of UCX? I don't know if the data is still received in the same order as the receive buffer are submitted when the tasks can't complete in order.

@fruhland
Copy link
Contributor

According to this (openucx/ucx#6370), tag matching messages will be received in order.

  1. If I invoke two upc_tag_send_nb on same ep one by one,Will these two send requests will be completed in the invoke order?Does it matter with whether I use RC or not?
  1. They may be completed in a different order, but will be matched in the same order on receiver

@Yangfisher1
Copy link

We encountered the same problem as "Frame of type 0 must be associated with a stream". It happened when testing a grpc demo replaced with hadroNIO if we were using tcp transport(on my local mac) or RDMA under RoCEv2 environment. However, when it switched to an InfiniBand cluster, everything worked well. Is the problem solved?

@Yangfisher1
Copy link

We encountered the same problem as "Frame of type 0 must be associated with a stream". It happened when testing a grpc demo replaced with hadroNIO if we were using tcp transport(on my local mac) or RDMA under RoCEv2 environment. However, when it switched to an InfiniBand cluster, everything worked well. Is the problem solved?

It's not correctly. I found the problem might be due to the size of the RingBuffer. When I reduce the size of the data transfered by grpc, it works well.

@Yangfisher1
Copy link

I don't know how to explain this, but when I set DEFAULT_BUFFER_SLICE_LENGTH to 16K, which is the maxium size of data frame in HTTP2, the problem disappeared.

@XLzed
Copy link
Author

XLzed commented Aug 29, 2024

I don't know how to explain this, but when I set DEFAULT_BUFFER_SLICE_LENGTH to 16K, which is the maxium size of data frame in HTTP2, the problem disappeared.

There are some bugs, the transport protocol it implements does not guarantee that the received data can be processed in order. ucx's tag match semantics switches between eager and rndv based on dataSize to reduce latency by replacing multiple send with a single rdma read, which may cause callbacks using the rndv protocol are delayed, but do not affect the order in which buffers are received. However, the library uses the execution order of callback functions as the parsing order of the received buffer, resulting in a disordered packet order.

So I changed to use JUCX directly in my use case, which can avoid the data copy of ringbuffer meanwhile, but more code development is needed.

@Yangfisher1
Copy link

@XLzed Thanks! The problem seems like a little bit tricky.

Actually we developed a version of using JUCX directly to transmit data in grpc. However, it needs to modify the rpc handler code and we want a transparent solution and the project seems like what we want. But it looks like far way from directly using it 😂

@fruhland
Copy link
Contributor

We tested on different setups with ConnectX-3 and ConnectX-5 cards and never encountered this problem. It seems like InfiniBand cards are not affected by this.

@Yangfisher1
Copy link

@fruhland I tested the demo on a IB cluster and a RoCEv2 cluster. I think not the "IB cards" but the "IB switch" that prevent the problem. Because the RoCEv2 cluster also used CX6 while the underlying transport was based on UDP rather than IB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants