Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add active peer probing and a cached addr book #90

Merged
merged 80 commits into from
Dec 18, 2024
Merged

Conversation

2color
Copy link
Member

@2color 2color commented Nov 27, 2024

What

This is an attempt to fix #16 by implementing #53.

Also fixes #25

How

  • New Cached Address Book
  • New Cached Router and enriches results with cached addresses when records have no addresses.
    • Implement a custom iterator for FindProviders that looks up cache, returns result with addrs if there's a cache HIT, or dispatches a FindPeer if there's a cache miss, which it then returns to the user once a result is rerturned
  • New background goroutine
    • Subscribes to identify and connectedness events and updates cached address book.
    • Runs a probe against all peers that meet probe criteria:
      • Not currently connected
      • Haven't been probed in the last threshold (1 hour)

New magic numbers

We have to start with some default. This PR introduces some magic numbers which will likely change as we get some operational data:

// The TTL to keep recently connected peers for. This should be enough time to probe
const RecentlyConnectedAddrTTL = time.Hour * 24
// Connected peers don't expire until they disconnect
const ConnectedAddrTTL = math.MaxInt64
// How long to wait since last connection before probing a peer again
const PeerProbeThreshold = time.Hour
// How often to run the probe peers function
const ProbeInterval = time.Minute * 5
// How many concurrent probes to run at once
const MaxConcurrentProbes = 20
// How many connect failures to tolerate before clearing a peer's addresses
const MaxConnectFailures = 3

Open questions

  • The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from FindProviders for which we have no cached multiaddrs remain unresolved.
    • Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of othe providers in the iterator.
    • Another way might be to subscribe to kad-dht query events (not 100% sure if this is possible) and add to probe loop
  • Should we probe the last connected addr or all addresses we have for a Peer?
  • When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from FindProviders have multiaddrs for a peer, it's up to date.
  • How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the peers map doesn't. temp solution: I've added some instrumentation for this

@2color 2color marked this pull request as ready for review November 28, 2024 15:43
@2color 2color requested review from lidel and aschmahmann November 28, 2024 15:57
this adds metric for evaluating all addr lookups
someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"}

I've also wired up FindPeers for completeness.
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass and dropped some suggestions inline. I also pushed with new metric (details inline).

As for Open questions, my thinking is:

  • The only peers we probe and cache are peers that we've successfully run identify with. Peers returned without multiaddrs from FindProviders for which we have no cached multiaddrs remain unresolved.
    • Should we try to call FindPeer inside the iterator so they can be resolved? This can blocking the streaming of the providers in the iterator.

Indeed, looking at someguy_cached_router_peer_addr_lookups shows we have cache miss quite often (0 addrs + cache also does not have them).

Was bit difficult to reason without some real-world input, so I've piped root CIDs hitting our staging environment, to populate the metric:

  • with CID duplicates: ssh [email protected] tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '{print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"
  • only unique CIDs: ssh [email protected] tail -F /var/log/nginx/access-json.log | stdbuf -o0 jq -r .request_uri | awk -F'[/&?]' '!seen[$3]++ {print $3; fflush()}' | xargs -P 100 -I {cid} curl -s -o /dev/null -w "%{http_code} %{url.path}\n" "http://127.0.0.1:8190/routing/v1/providers/{cid}"

A few minutes later http://127.0.0.1:8190/debug/metrics/prometheus shows:

# HELP someguy_cached_router_peer_addr_lookups Number of peer addr info lookups per origin and cache state
# TYPE someguy_cached_router_peer_addr_lookups counter
someguy_cached_router_peer_addr_lookups{cache="hit",origin="providers"} 1323
someguy_cached_router_peer_addr_lookups{cache="miss",origin="providers"} 6574
someguy_cached_router_peer_addr_lookups{cache="unused",origin="providers"} 7686

So yes, finding a way of decreasing miss feels useful, given how high it is.

Two ideas:

  • Lazy/easy: avoid blocking iterator by adding peers with cache misses to some queue, and then processing them asynchronously at some safe rate, populating cache in best-effort fashion. May not help first query, but all subsequent ones, over time, will get increased cache hit
  • Implement custom iterator: if peer hits cache miss, we dont return the peer, but silently moves to the next item, and puts current one at the side queue which is processed async calling findPeer. once the iterator hits the last item, we go back to items on the side queue. This way we don't slow down results with addrs, and we can wait and stream ones at the end without impacting perf of fast ones.
  • Should we probe the last connected addr or all addresses we have for a Peer?

See comment inline, iiuc host.Connect effectively probes all of known addrs, until success.
Probably good enough for now. If we need per-addr resolution, we may need ask go-libp2p for new API.

Note that vole libp2p identify <multiaddr> connects to specific multiaddr because it does not run routing and spawns a new libp2p host every time.

  • When should we augment results with cached addresses? Currently, it's done only when there are no results in the FindProviders from kad-dht. The presumption there is that if you get the results from FindProviders have multiaddrs for a peer, it's up to date.

I think current approach of hitting cache if regular routing returns no addrs is sensible.
It also makes it easier to reason about metrics like someguy_cached_router_peer_addr_lookups{origin,cache}

  • How do we prevent excessive memory consumption by the cached address book? The memory address book already has built in limits and clean up. However, the peers map doesn't. temp solution: I've added some instrumentation for this

Cap at TTL of 48h?

cached_addr_book.go Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
server_cached_router.go Outdated Show resolved Hide resolved
@2color 2color requested a review from lidel December 11, 2024 11:46
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly ready to ship / deploy early next week, but we need to address things marked with ⚠️:

  • proactively remove peers that failed probing beyond max probing backoff
  • skip peers that have no addrs and skipped probing

details and other smaller nits inline
(apologies if I misunderstood something, did this late Thursday, but wanted to give feedback before Friday)

CHANGELOG.md Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
cached_addr_book.go Outdated Show resolved Hide resolved
server_cached_router.go Outdated Show resolved Hide resolved
server_cached_router.go Outdated Show resolved Hide resolved
server_cached_router.go Outdated Show resolved Hide resolved
@2color
Copy link
Member Author

2color commented Dec 17, 2024

iiuc we don't have good mechanism for "forgetting" least peers, other than reaching PeerCacheSize limit here.

  • we have a single cache for peerState success (online) and failures (offline)

  • over time the cache will be 1M of peers and most of them may be offline (due to regular churn, or network attacks - we've seen sudden spikes in random peers that then disappeared forever)

    • A good safeguard would be to explicitly remove peer from cache if failed for longer than MaxBackoffDuration (see comment below)
  • the cache will unlikely drop items often on its own with PeerCacheSize = 1_000_000, so there is no point in using 2Q, maybe switch to simpler LRU at this point

I made a couple of additions to your suggestion:

  • When we remove the the peerState for a given peer, we also remove it from the cache. This is important, because otherwise, the peer will be probed again (since we deleted the state but we iterate on the address book for the probe)
  • Increase the MaxBackoffDuration to 48 hours (from 24 hours). This is to delay when a peer is removed from address book and peerState cache. Why? Because even if a peer announces a provider record, but then goes offline, we want to know that their offline (through the peerState cache) for as long as that provider record is valid for, so that we don't dispatch FindPeer calls to that peer in the fallback router.

When the max backoff duration is reached and a connection attempt fails
we clear the cached addresses and state. Since this state is useful to
prevent unncessary attempts to dispatch a find peer we should keep it
for as long as a provider record is valid for.
@2color 2color requested a review from lidel December 17, 2024 10:25
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2color, lgtm, fyi added metric for tracking offline/online probe results (increase rate over time) – c1ac41b

Feel free to merge and release and deploy to https://delegated-ipfs.dev/ 🙏
We will see if any follow-up is needed once we observe metrics in prod (fine to pipe gateway cids to smoke-test there too).


Note for self / future reference:

  • The way the cache works in this PR is that we require at least one successful libp2p identify lookup for a peer to be accepted to the cache.
    • Once in cache, we periodically probe to confirm if peer is online/offline, and refresh addrs
    • If a cached peer turns offline, we note it in cache, and stop returning it as one of providers, but we keep probing cached peers (incl. offline ones) with backoff just in case they are online again. If not, we drop once they been offline longer than the Amino DHT expiration window.
  • Basing cache add on libp2p identify success means the cache does not include peerids that were returned by (dht?) router without addrs but "always failed to identify"
    • I think it is ok to leave this as-is, if we ever need to decrease cost related to those, we can start caching these dead peers somewhere in "dispatched peer lookup" code to ensure they fall under backoff logic as well.

@2color
Copy link
Member Author

2color commented Dec 18, 2024

Basing cache add on libp2p identify success means the cache does not include peerids that were returned by (dht?) router without addrs but "always failed to identify"

We only cache failure count and last failure since we don't have any addresses to cache for such peers.

  • I think it is ok to leave this as-is, if we ever need to decrease cost related to those, we can start caching these dead peers somewhere in "dispatched peer lookup" code to ensure they fall under backoff logic as well.

We already do.

When we dispatch a peer lookup, we use call FindPeer on the cachedRouter:

peersIt, err := it.router.FindPeers(ctx, *record.ID, 1)

Which records failed connections:

r.cachedAddrBook.RecordFailedConnection(pid) // record the failure used for probing/backoff purposes

Which ensures that the backoff logic applies to those peers too (ones that never successfully identiftied) for up to MaxBackoffDuration (48 hours) of failures. Only if a PeerID continues to be unreachable for 48 hours they will be removed from both address book and cache.

@2color 2color merged commit d117b28 into main Dec 18, 2024
7 checks passed
@2color 2color mentioned this pull request Dec 18, 2024
@2color
Copy link
Member Author

2color commented Dec 18, 2024

Another follow up improvement we could look into is checking the last failed connection in the peerCache before blindly augmenting with addresses from the cachedAddrBook.

As it currently stands, even if probing a peer fails, the cached addresses will still be returned, at least until MaxBackoffDuration of 48 hours is reached and the addresses are removed from cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Local storage for local caching purposes No multiaddrs returned from provider record lookups
3 participants