-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retrospective on download outage #2264
Comments
An official status page for NodeJS would likely be helpful in situations like this for a couple of reasons:
Created #2265 to look in more detail at the process of creating a status page. |
No one has been added to @nodejs/build-infra in literally years. 5 years or something like that? We need to break that choke-point. It's OK if there are only 4 people in the group, but it's got to be 4 highly-active people. Right now, my guess is that we have 1 highly-active person, 2 moderately-active people, and 1 effectively-inactive person. I would like us to think about who we might trust with this level of access. I'm surprised we haven't added @sam-github to the list, to be honest. |
This, of course, also raises the issue of if we are taking a risky approach by relying entirely on volunteers. Assuming the Foundation would be willing to pay to have someone on staff doing the most critical work, should we consider that? (We'll have to carefully define the most critical work.) |
On-call schedule? No, wait, hear me out! The idea is that we may not trust (for example) me with the keys to the kingdom, but maybe we trust me with the cell phone numbers of the people who do have those keys. And we trust me to not abuse this and to only notify people in a real crisis. Maybe we can have someone on-call to figure out who is in a not-terrible time zone and try to get them on the phone if we have an outage like this one. I'm not sold on this being a great idea, but I'm trying to get all the ideas out anyway. |
I think the crux of the issue with build infra is coverage, we have 1 US, 2 Europe and 1 AUS which gives us decent coverage of time zones - but the European members are often harder to contact. Adding @sam-github would be a good start but still leaves a big void in the middle of the timezones. Obviously this is tricky due to the security concerns around the level of access this role gives. |
Having a proper system for an emergency contact makes sense, either through an on-call system of trusted volunteers (build-infra) or a paid position that comes with the responsibility of being okay with being woken up in the middle of the night if the world falls apart. |
@MattIPv4, it may not be clear from the project activity, https://github.com/nodejs/build/graphs/contributors?from=2020-01-08&to=2020-04-06&type=c (is there an equivalent view for comments on issues and PRs? I wish there was, but it wouldn't substantially change that view), but the top 3 folks at that link are developers, who primarily maintain jenkins and CI, and the 4th (Rod) is an expert, but only 1 person. We don't lack system for making emergency contacts, IMO, we lack people to contact. There has been talk of getting paid infra support in, this might change that conversation. cloudfare experts are few and far between in the Node.js project, or at least the number of them volunteering to join the build-wg is aprox zero. |
Totally makes sense, I intended for my comment to imply that if an on-call system is used, part of that is ensuring there are folks on the other end to call that can actually help. |
I should also say that I’m happy to help in any way I can (eg. getting #2265 going), though I would say I’m not an expert on infra. More than happy to help connect y’all with folks at DO or Cf if you think there’s anything specific we can help with. :) |
Talking about choke points is fine as long as you can prove that something would be different if we added more people, and in this case nothing would be different. Even if I was awake in the middle of the night watching this play out I have no idea what I would have done to "rectify" it. At the moment it looks like a Cloudflare problem that's out of our control. I know the feeling of impotence is frustrating, but just splashing out more access to more people doesn't necessarily improve the situation. We had this same conversation when we had a disk-full problem ("I can log in to nodejs.org but can't do X"), but even if more people had access, it's not clear that those people would know what to do to fix the problem. There's (1) access but there's also (2) knowledge of the systems involved and (3) understanding of how it all fits together and how to not break things even further. There's a lot to break and there's a lot a stake. Let's not try and make this look simpler than it is, you're just going to make it harder for those of us who carry the bulk of the burden because the blame gets focused like a laser -- "but it's so simple, you're at fault because you [haven't done X | did Y]". |
My assessment so far is written up @ #2263 (comment) Reproducing the most interesting part of that: What can we do better next time? 🤷♂ that graph looks to me like it's mostly out of our control and this is a CF hiccup that they haven't acknowledged on their status page. I'll do some more digging but I'm exhausting all but my most outlandish theories. This does bring into focus our edge cache invalidation process which hasn't evolved since we've pushed to full CF caching of our downloads. There's a lot of room for improvement there and options for replacing the process with an entirely better setup. It'll just take very careful work. |
I wouldn't say "nothing would be different". Having someone authoritative saying things every few hours along the lines of "We're aware of the problem and actively looking into it" and/or "Looks like a Cloudflare problem that's out of our control, but we'll provide more information when we have it" is a huge help, even if the outage lasts exactly as long as it would otherwise. These things are squishy, but I'd say that the actual outage was only 20% of the problem. A squishy 80% of the problem was not being able to communicate anything meaningful to end users. That looks bad, increases anxiety, and prompts FUD in places like reddit and Twitter. I appreciate what you're saying, which (if I'm understanding correctly) is that finding qualified people and (perhaps especially) taking the necessary time and effort to do knowledge transfer is vastly harder than my original comments imply. And then, of course, there's also the general issue of trust. It's not enough for someone to be qualified and trained. They need to be trusted as well. We've talked about how to improve our bus-factor more than once in the past, so perhaps step 1 is for me to go back through old email threads and meeting notes before saying much more about it. |
Cloudflare have responded to the support request that I think @mhdawson submitted. It's a fairly generic response saying that everything seems fine on their end, but does include this:
... continued with copypasta and links to help articles I don't think there's anything here that would explain such an anomaly. But, as a first course of action out of this, we probably need to do a systematic review of all of the path configurations in CloudFlare and in nginx that control expiry of our resources. Our current practices mean that we can say with a high degree of certainty which resources in |
|
I've been investigating all the paths that we have to requesting a cache purge from Cloudflare to see if there's a way we could have triggered an ongoing purge,purge,purge,purge. I think I've found one, but I can't use it to fully explain that graph. Background The way we do a purge is that we have a protected script on our main server that can receive a signal from a queue that there's a request to purge the cache. If it finds that signal it makes the API call. This queue can be modified by almost any user on the server, so we have various other scripts that add requests to the queue. We do this so we can keep the API call and its keys protected and don't have to expose those to whatever script & user might need to make that call. The queue is just a temp file that we The processes that can trigger a purge are:
Any of these can trigger a full cache purge. (Yes this is not optimal, but making it optimal is going to take very careful work). Late last year, in an attempt to improve security, I introduced a cron job on the server that changes files in /download/ to So, now with old /download/ files being locked in place, they can't be overwritten by anyone unless they have root access. But, it still is possible to queue assets for auto-promotion (and manual-promotion!) that belong to old releases. e.g. someone could rebuild v10.0.0 today and try and promote that. Thankfully that's not possible, however the scripts don't cleanly handle this case. The main promote script @ https://github.com/nodejs/build/blob/050fb9733697acb3dd5bb53891c2818e2c335bab/ansible/www-standalone/tools/promote/_promote.sh has an in-built assumption that it can both copy the files from staging to dist and then remove from staging. It's run as user The bug Here's what happened:
Results This pattern could explain a purge,purge,purge,purge but it can't explain the specific window of time where we were experiencing trouble. I don't know how to get from this bug to that behaviour, but it's at partial explanation at least. Fix I've fixed the immediate problem for now by removing those newly staged v10.20.0-rc.1 files and have rerun promotion and it's run fine and we've now got rc.2 and rc.3 which were lingering in staging waiting for attention, plus some nightlies and v8-canary files. My first pass at addressing this problem is to make the promotion process more robust. See changes specifically @ #2269 in https://github.com/nodejs/build/pull/2269/files#diff-03699a6e81ab58dc8fa9a218932c0588 which do a These changes are live on the server but review would be appreciated. This doesn't fully address the problems here. It's still going to error if it has to promote new files for old builds and that will lock up all promotion for that release type, but presumably that'll be noticed by someone who will complain about their builds not appearing. What needs to be done is a more robust "oh, this is not possible, I'm going to skip this one and log and/or email an error and continue to the next one". |
To me, it seems like this work is something that really should be done, so that all the versioned files (eg
What's the timeline there? Would this have immediately started on the 3rd? Was it late in the day? Could it be that there's been a lower cache rate all weekend but it wasn't until Monday mornning when Europe woke up that load increased enough to trip things up? |
Yes, I'm not going to argue that but it's "work" and has to be "done" by someone, with enough access and know-how about all of these systems work. Let's please put that aside, it's really not helpful to dwell on here aside from noting that it needs to be done. I, or someone else, will open an issue to explore the options ahead here, including an audit of what we have in place now.
Immediately I believe. The dates on those I don't have access to older data from the Cloudflare API that would show us that timeframe (see graph above). We do have all logs stored in GCP, but they're massive and filtering through them ourselves, calculating cache % around that time would be a massive job. I don't think that's likely to get done.
Yes, very good point, day of the week is probably a key component. Our traffic patterns are US week-day heavy (very). Not so much Europe. (See how increasinly jagged our metrics have become over time due to this effect: https://nodejs.org/metrics/). So it could be as simple as this putting stress on our primary server but not enough to push it over the edge until Monday US rolls around and boom we hit our same old limits. Maybe that's enough to explain it. Aside: https://nodejs.org/metrics/ is interesting if you look at the far right of the graphs - they're still running on the server logs but don't account for Cloudflare logs anymore (WIP), but they show a big spike as traffic is slipping through from Cloudflare. That demonstrates the effect in my graph above with more and more traffic getting bypassed. |
Hm, I would probably have expected this issue to surface on Friday then, as the US wakes up and goes to work. I think it's probably still likely that this is the (only) source of the issue and it just took a while to reach the breaking point. |
It seems the bottom line is, this is a simple service managed by volunteers, and its difficult and potentially unreasonable to run this service with the availability/reliability people (wrongly or rightly) expect with those volunteers. I've seen mention of introducing someone paid to manage infra, but has there been any consideration of not hosting these files internally/manually, but rather having an agreement with an established provider? The obvious one to me would be GitHub, after all they built one of the tools which was heavily affected, however they're charging for it. They're already talking now about implementing a cache, but maybe a mirror, or fully taking over would make more sense. I've no idea how possible or complex that it would be to do, but wanted to suggest it and at the very least I think it'd be worth getting open comms if there isn't already. |
I can confirm that's what happened
Thank you for digging in @rvagg |
Shared this on Slack earlier -- I've gone through the whole of nodejs.org/download/ and categorised what's immutable & can be set to a very long cache life, and what needs to have a short life. https://docs.google.com/spreadsheets/d/1a4ZOeHxuAXejcp0q5n753FVs0OLMx-p8Ers1aMEwFN4/edit#gid=0 I'm hoping that this will help enable the changes needed to set much longer caching on assets and to avoid purging the entire cache for each new asset promotion. |
Still trying to make this timeline fit with the events I laid out above and I can't, quite. This graph shows bytes downloaded off the main server. It's lumped per day so a bit hard to pull apart but you can see our normal weekly pattern and then the spike really hits on the 6th. Before that it seems quite normal. A trigger in the middle of the 3rd doesn't match that graph, nor the Cloudflare data previously posted. I think we know main areas for improvement, and I feel fairly confident I've uncovered bug that needs to be properly fixed (for now it's patched and is workable but not ideal going forward). But I'm not confident asserting that a particular sequence lead to what we saw go down during this incident. |
Another thing that doesn't seem explainable by the bug on our side is that the incident resolved itself without our intervention. |
Something we probably could have done during this was to put a banner on the website. |
If StatusPage is implemented, they provide a nice little widget that could be added to the site, which when there is an incident on the status page shows a popup on the website on every page where the widget script is included. |
Interestingly, that github actions have a fragile dependency on nodejs download availability has been known since last October, 2019. Also known for Azure, same timeframe, assumedly the same incident. I found these as top links while trying to find out more about the downstream consumers are that are so impacted by this outages, expecting to find issues from the last couple days, not last year. |
7-day and 30-day graphs from the primary www backend from DO just now. Anomaly is clear, but we've pushed bandwidth and load way down since taking the measures to slow down possible purging rate and plug possible holes. CPU sage in the 30 days dropping in April is a bit weird, no idea what that might be. |
Interesting now to see how often we purge cdn now that we have logging in place:
Plain "promote" are nightlies, v8-canaries and a few RCs flowing in. The ones with "resha_release" are release builds, so that would be 10.20.0 and the 12.x that went out this week. "build-site" are triggered either by updates to nodejs.org (release blog entries added and maybe other tweaks by that team) and auto checks that the site was build after the last release was made. Still too often though. We're not maximising our Cloudflare potential. Re the "promote" entries which are the bulk of these, I think that's mainly caused by the different speeds at which builds get pumped out. If you look at a nightly, e.g. https://nodejs.org/download/nightly/v14.0.0-nightly20200408c849f2d4f8/ and scan the timestamps, you can see that they'd be dropping in to staging over the course of ~1.5h, and we now have the cron job checking at 30 minute intervals (#2268). v8-canary builds are even worse, spanning ~2.5h, so potentially 5 purges while they dribble in each day. A possible solution to this might be for auto-promote jobs to have a "debounce" mechanism that checks for last-drop time and waits for last+30m (or some number) before taking action rather than just taking action when something's there. It might get frustrating for people building RCs and test builds, but they're relatively infrequent and maybe we can special-case those. |
My completely honest $0.02 would be that the whole idea of purging that cache on promotion should be scrapped. The only things changing are the index files, everything else is new so there isn't even a cache to clear there. It seems the entire nodejs.org cache gets purged just so that a couple of index files get updated a bit sooner? I personally think that there should just be a slightly shorter cache life set on those, with a really long cache life on everything else. Sure, they might not be completely up-to-date every minute of every day, but they'll update themselves once the cache life set on them invalidates. Edit: Certain promotions (anything going into latest), as well as site builds still need to purge. However, these should really use targeted cache purges with a list of specific assets given, so that the entire cache for everything else that remained unchanged isn't affected. |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
For picking up once the current issue is addressed and behind us, a retrospective to figure out how to make it better next time would be a good thing, I think.
Refs: nodejs/node#32683
Refs: nodejs/node#32683
The text was updated successfully, but these errors were encountered: