Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[relay client] choose nearest relay based on latency #2952

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mohamed-essam
Copy link
Collaborator

Describe your changes

Change picker to choose nearest relay based on connection latency.

Issue ticket number and link

#2950

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)
  • Extended the README / documentation, if necessary

@pappz
Copy link
Contributor

pappz commented Nov 26, 2024

Hi @mohamed-essam,
First of all, thank you for your PR. Your observation regarding the scheduler is a good point.

When we designed this feature we focused on the fastest connection time as possible. In a corner case, your logic will flip this requirement and block the connection more then the defined timeout time. Imagine that situation if the length of the list of servers is 8. And you have two really slow servers. Actually one of them is between 1-7 and 8. In this case, the code will block until 2*30 sec.

For your scenario, we suggest using only one domain with the Geo DNS service.

Hard to write good code for the scheduler but we can take try. I think your solution could works well because the
client.Connect() does not start new go routines so the scheduler does not trigger context switches, luckily. But if we develop something inside the Client struct that break this rule then your solution will be provide invalid latency numbers again. So in a nutshell you measure the running time of the client.Connect() and not just the network latency. And by the way I am working on a task that will start new parallel connection inside the Client code :/ . In the ideal case we should extend our network protocol, with elapsed time in the response (somehow).

But back to your PR. What do you think if after this line, you check the length of the channel and if it contains more then one items, then drain the channel and compare the elapsed times on that? And do not wait for any new connections.

@mohamed-essam
Copy link
Collaborator Author

Hello @pappz,

Thanks for the clarification of the requirement, I haven't really considered this case in my change, I think it might be possible to do something that is a mix of both, as in most cases delaying initial connection by some amount (think for example 200ms), could result in a much faster relay off the get go.

As for the Geo DNS, my problem with that solution is that it would rely only on geographical distance latency, and not take into account (for example) relay server load.

I'm working on a commit now that should address both the corner case of slow servers (by having some sort of upper limit of waiting after the first successful connection, which I'm thinking a value between 200-500ms should be sufficient to both allow cases of Go scheduling delays and weird networking to resolve), and the measurement of latency (by using httptrace inside ws.go Dial())

Please let me know what you think of that previous approach even if I haven't committed it yet 🙏

@pappz
Copy link
Contributor

pappz commented Nov 28, 2024

Measuring the time in deeper layers looks promising. Good idea. I will take a look and we will see.

I am still thinking about the length of the result channel in a 1 core system as you mentioned in the ticket. Can I ask you for a test? Could you print out the length of the result channel every time when read a result from the channel?

cr := <-resultChan
log.Debugf("resultChan len: %d", len(resultChan))

If queued with multiple results, then we do not need to wait more. In the end, maybe we do not choose the fastest server but definitely do not choose the slowest. If this approach shows unexpected results then the time window sounds a good way.

@mohamed-essam
Copy link
Collaborator Author

mohamed-essam commented Nov 29, 2024

I don't have access to a single core machine today, so as a workaround I executed netbird -F with cpuaffinity to force it into a single CPU.
I added one more debug line before receving from the channel (to see if there is sometimes a connection already ready without the processConnResults goroutine halting), and that seems to be the case:

2024-11-29T13:38:58+02:00 WARN client/picker.go:99: beforeSelect:len(resultChan)=0
2024-11-29T13:38:59+02:00 WARN client/picker.go:110: afterSelect:len(resultChan)=0
2024-11-29T13:38:59+02:00 WARN client/picker.go:99: beforeSelect:len(resultChan)=1
2024-11-29T13:38:59+02:00 WARN client/picker.go:110: afterSelect:len(resultChan)=0
2024-11-29T13:38:59+02:00 WARN client/picker.go:99: beforeSelect:len(resultChan)=1
2024-11-29T13:38:59+02:00 WARN client/picker.go:110: afterSelect:len(resultChan)=0

In this case (3 relays), there were 2 times where a connection was ready immediately after the last connection was processed.

This also seemed to be dependent on whether the CPU was loaded or not, when loaded it was more likely to be always empty (before and after select statement).

One more thing I noticed, is that when CPU was free, the order of the connections was almost always (somewhat) already sorted by latency, but when CPU was loaded, the order of connections was more likely to be the order in which they were configured in management. I'm assuming this is because the first item in the array is likely to be connected to first if the CPU load is high enough for the process to wait a bit until connecting to the next relay.

All in all, the slight time delay seems to be always choosing the lowest latency in both cases.

@mgarces mgarces linked an issue Nov 29, 2024 that may be closed by this pull request
@pappz
Copy link
Contributor

pappz commented Dec 4, 2024

Thank you for sharing your experience.
I would like to draw your attention to this work. What do you think is the net/http/httptrace usable in QUIC too? If so, I don't see any other blocking issues in this topic.

@mohamed-essam
Copy link
Collaborator Author

That's some nice work on the QUIC protocol, but given it's using the quic-go package I see no hint anywhere in it that it supports net/http/httptracer

However, it seems like it already implements its own version of it in logging.ConnectionTracer.

After some digging (due to the lack of documentation on that package), I found the equivalent methods of it we would need are

  • StartedConnection (which is called right before sending anything)
  • ChoseAlpn (which is called right after the initial handshake is done, which is the closest to TTFB in httptracer)

If the QUIC relay is merged before this PR, I could test the previous myself, if you can test the ConnectionTracer on the side and provide the difference between httptracer and ConnectionTracer values I believe it would provide a helpful insight.

Copy link

sonarqubecloud bot commented Dec 5, 2024

@mlsmaycon
Copy link
Collaborator

@mohamed-essam with the addition to QUIC proto support, we caused a merge conflict. can you check them?

@mohamed-essam
Copy link
Collaborator Author

Will try to do so over the weekend🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Relay] chosen home relay server is random
3 participants