-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A proposal to make bbstorage highly available #230
Comments
I'd like to call out that I think that this is a big shortcoming of Bazel. Why can't it provide proper UI events in case such failures occur?
I think that an approach like this is completely unacceptable for any production workloads. It means that if a storage node is reaching its performance ceiling and health checks start to fail with some kind of random probability, the backends selected by clients also becomes random. So "newly started builds will always succeed" does not actually hold. In fact, a policy like this may even lead to cascading failures. Because if clients start to use the wrong shard, cache hits will drop, causing more new data to be written. This increases load, increasing the probability of health check failures. A better solution would be to have some kind of centralized process that does the health checking, and subsequently exposes which shards to use. But this raises the question: isn't this exactly what Kubernetes already does for us? I don't think it should be the goal of the Buildbarn project to reinvent features like that. Just set up health checks on the Kubernetes side properly, and you can already achieve high availability.
Note that this is already possible today, and there are many people that operate setups like these. The reason we don't do it as part of the example deployment is:
But users are free to set this up themselves.
I am fine with getting this feature added.
I don't think we should add these. |
Those are actually considered. I didn’t say what this health check actually did in my proposal. So it may have caused some confusion. Let me add something here. It seems to me that buildbarn is built with an assumption of the underlying storage system being fault tolerant. But for many out there, the disks/OS/rack switch can fail at any time and for a prolonged period. Not everyone is working with a stable physical infrastructure. The “health check” may not be precise. “Availability check” is. This one checks infrastructure failures, specifically the grpc unavailable error. Such an error occurs when k8s evicts pods due to many reasons I mentioned above. In these cases, one or many shards could be accidentally stopped, but k8s can’t reschedule the statefulset pods for a long time. The check doesn’t care about high load at all. That’s not what this design is for. Since you mentioned k8s can do this: what’s the solution for this scenario? For N shards writing to local NVMe deployed as statefulsets scheduled with PV pinned to individual hosts, when one host is rebooted at the middle of the night and subsequently can’t come back, what types of k8s automation can make bbstorage available for bazel? |
Use local NVMe is indeed the way to go. But instead of stateful sets, create n deployments. By using deployments, you're essentially telling Kubernetes that it's safe to launch replacements, even if there is no guarantee that the previously created pod has terminated/removed. If you do that, Kubernetes is capable of keeping your storage pods healthy at all times. Also get rid of pinning to individual hosts. Instead, add mirroring to your setup and disable LocalBlobAccess persistency. If you do that, you can always safely restart/upgrade half of your fleet without causing recently written data to get lost. So every time you push changes to production, you only push them to either the A or B nodes of your storage cluster. This will cause those nodes to lose their data, but that doesn't matter, as the data can be replicated back when accessed. So if you use sharding and mirroring, you'll end up having 2n deployments, each having |
We initially used this mirroring setup too, but with a statefulset. We didn’t use it for long because we were fighting with some high disk latency and replicating extremely large blobs. The Helm chart isn’t that nice for us to maintain either. Also, we’re really tight on storage and machine CPU/memory, so we eventually chose a non-mirroring solution. I like the way you use k8s deployment to lift the restriction though — a clear tradeoff between extra replication and availability. I am happy to keep this change in-house as there is at least one solution to this availability issue for the public. I’ll just do the grpc round-robin policy. |
I am proposing a way to make sharded storage highly available. This proposal is deployed in an environment that has tens of shards. The availability of bbstorage increases as expected.
Context
We've seen two types of production reliability issues in buildbarn design:
Note: the issue 2 isn't directly related to HA, but as we're trying different architecture, the lack of LB policy makes the architecture I am proposing here less useful.
HA sharded storage
Buildbarn currently doesn't have a way to handle backend failures -- when one backend fails, the action fails, which triggers bazel retry. The retry makes a build look like "hanging" until the backend connects to the frontend again. In another word, the availability of buildbarn depends on every single member in the shards.
Here, I am proposing a way to make sharded storage highly available: introducing a health-checked sharding strategy, where a shard is automatically disabled (or simply use the word in blobstore: "drained") when it's considered "unavailable." The shard selection algorithm will automatically, stably pick the next available shard.
(Note that the goal here is not 0 down time, but to let the system recover automatically from a very short amount of downtime.)
For those who need a bit more context on the current sharding mechanism: the key to understand how sharding is done is https://github.com/buildbarn/bb-storage/blob/master/pkg/blobstore/sharding/sharding_blob_access.go#L35-L56, where:
Because the random integer sequence is stable, when some shards are "drained", the next available shard will always be picked up stably.
The new sharding mechanism
The new mechanism changes the last step where it performs periodic health check (single-digit seconds) and marks unavailable shards "drained", effectively making them nil in the shard list. So the shard selection would pick the next available shards (temporary shards).
When unavailable shards come back, the shard selection algorithm will find out the origin shard number and stop sending traffic to the temporary ones.
The new mechanism is simple but effective. But it's not without its drawbacks.
Issue: Make consensus in routing clients
The sharding algorithm is done in the client (frontend, workers, etc.). When we have hundreds of such clients, how do we make consensus among these clients about which shards are gone?
It can be done using consensus algorithms with additional complexity, but in the simplest form, we don't try to make the consensus in our production setup. The goal we want to achieve is not 0 downtime, but automatically recover from a very short amount of downtime. A few builds will fail, but newly started builds will always succeed (with some cache miss).
In production, what we observe is that:
Architectural Improvement to help reduce inconsistency when storage goes offline
Instead of the traditional deployment where every frontend and worker connects to every shards. The setup below can simplify ops (adding/draining/removing shards):
becomes
This deployment brings many benefits in ops that I won't go into detail. This architecture allows a small amount of frontends to handle a large amount of traffic via load balancing. The smaller the frontend groups, the quicker the consensus can form.
To use the new architecture, the gRPC connection from worker to frontend must have L7 load balancing (instead of, say, the k8s L4 load balancing).
Upcoming changes to enable HA storage
This HA storage has been running for a while internally and it works for small/medium scale of buildbarn deployment.
If this design is interesting to others, I'd like to send out pull request to upstream these features:
health_check
in sharding configuration; when health check is enabled, use the new blob access.The text was updated successfully, but these errors were encountered: