Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimistic cache may return wrong result once it meets ServFail #436

Closed
BusyJay opened this issue Feb 12, 2025 · 3 comments · May be fixed by #437
Closed

Optimistic cache may return wrong result once it meets ServFail #436

BusyJay opened this issue Feb 12, 2025 · 3 comments · May be fixed by #437
Assignees
Labels
needs investigation Should be reproduced and analysed

Comments

@BusyJay
Copy link

BusyJay commented Feb 12, 2025

Optimistic cache will return stale cache and trigger a background resolve to refresh the cache. Consider following timeline:

receive query -> resolve -> ServFail
receive query -> cache hit and return ServFail
                        |
                       -> resolve -> MXDOMAIN -> skip updating cache

If a query always returns MXDOMAIN generally, and somehow upstream returns a ServFail, it will be cached for 30 seconds and then become stale. However, when background retry sees MXDOMAIN, it will skip refresh cache, so all further request will be always responded by a stale cahched ServFail. Clients may retry ServFail and flood adguardhome with meaningless requests.

A simple fix is don't return any stale ServFail response from cache.

BusyJay added a commit to BusyJay/dnsproxy that referenced this issue Feb 12, 2025
@ainar-g ainar-g added the needs investigation Should be reproduced and analysed label Feb 12, 2025
@EugeneOne1
Copy link
Member

Hello, @BusyJay, and thanks for the report. Could you please share some additional information about the failed query? We're interested in the following details:

  • the upstream responding with NXDOMAIN;
  • the domain requested;
  • an example of a dig/nslookup command that reproduces the behaviour.

@BusyJay
Copy link
Author

BusyJay commented Feb 14, 2025

The reason why I suspected this behavior was because I saw a lot of same failed query logs in a short time (50+ per second, 1m per day) that had the same response code ServFail and all of them had green check that indicates response are from cache.

But after further analysis, I find my assumption is not what's actually happening, though the assumption is still a valid bug. The real reason why logs are identical may be due to upstream always return ServFail. And because of optimistic refresh cache in the background, so all of those response seems from cache. And the reason why so many logs in a short time may due to no backoff in client application and adguardhome caches ServFail response.

The queries that will always return ServFail are querying https record of cmcos.albatross.10086.cn.

The topology of my network is devices -> linksys router -> dnsmasq -> adguardhome. After a lot of failed requests, dnsmasq stops accepting any dns queries for a while. I'm not sure whether the bad retries come from dnsmasq, linksys router or one of the devices. For now, I will add some sleep when response code is ServFail to my fork of adguardhome to get around this problem.

@EugeneOne1
Copy link
Member

@BusyJay, hello again. As for the original problem, we haven't been able to reproduce it. The only reason why new NXDOMAIN responses may not rewrite the cached ones is the lack of SOA records defining their TTL. Such responses aren't considered cacheable at all, so current behaviour is actually a cache poisoning mitigation.

FIY, SERVFAIL responses are cached for no more than 30 seconds, as per RFC 2308 Section 7.1, and we also have an issue about stale optimistic cache records (#318), so please consider upvoting it with "👍".

We'll close the issue for now, as it doesn't appear to be related to dnsproxy. Please consider opening another issue if the real problem appears to be within dnsproxy, and please add some steps for reproducing it to the report.

@EugeneOne1 EugeneOne1 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation Should be reproduced and analysed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants