Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A proposal to make bbstorage highly available #230

Open
lijunsong opened this issue Dec 9, 2024 · 4 comments
Open

A proposal to make bbstorage highly available #230

lijunsong opened this issue Dec 9, 2024 · 4 comments

Comments

@lijunsong
Copy link
Contributor

I am proposing a way to make sharded storage highly available. This proposal is deployed in an environment that has tens of shards. The availability of bbstorage increases as expected.

Context

We've seen two types of production reliability issues in buildbarn design:

  1. buildbarn's storage is the single point of failure even if it's sharded; all storage shards must be online to serve the traffic.
  2. storage's sharding and replicating strategy are client-side routing, but grpc config doesn't support LB policy change. For simple deployments, it is not an issue for not having LB policy, but for a more customized architecture, frontends/workers' grpc throughput to storage are usually throttled, causing slow download.

Note: the issue 2 isn't directly related to HA, but as we're trying different architecture, the lack of LB policy makes the architecture I am proposing here less useful.

HA sharded storage

Buildbarn currently doesn't have a way to handle backend failures -- when one backend fails, the action fails, which triggers bazel retry. The retry makes a build look like "hanging" until the backend connects to the frontend again. In another word, the availability of buildbarn depends on every single member in the shards.

Here, I am proposing a way to make sharded storage highly available: introducing a health-checked sharding strategy, where a shard is automatically disabled (or simply use the word in blobstore: "drained") when it's considered "unavailable." The shard selection algorithm will automatically, stably pick the next available shard.

(Note that the goal here is not 0 down time, but to let the system recover automatically from a very short amount of downtime.)

For those who need a bit more context on the current sharding mechanism: the key to understand how sharding is done is https://github.com/buildbarn/bb-storage/blob/master/pkg/blobstore/sharding/sharding_blob_access.go#L35-L56, where:

  1. a blob digest is used to generate a sequence of random integers (the integer is random but the sequence of the randomness is stable)
  2. the integer is used to index into a list of cumulative shard weights (https://github.com/buildbarn/bb-storage/blob/master/pkg/blobstore/sharding/weighted_shard_permuter.go).
  3. select a shard by testing whether the shard of the index is nil. If the shard is nil, it means the shard is marked drained, so continue with the next random integer.
    Because the random integer sequence is stable, when some shards are "drained", the next available shard will always be picked up stably.

The new sharding mechanism

The new mechanism changes the last step where it performs periodic health check (single-digit seconds) and marks unavailable shards "drained", effectively making them nil in the shard list. So the shard selection would pick the next available shards (temporary shards).

When unavailable shards come back, the shard selection algorithm will find out the origin shard number and stop sending traffic to the temporary ones.

The new mechanism is simple but effective. But it's not without its drawbacks.

Issue: Make consensus in routing clients

The sharding algorithm is done in the client (frontend, workers, etc.). When we have hundreds of such clients, how do we make consensus among these clients about which shards are gone?

It can be done using consensus algorithms with additional complexity, but in the simplest form, we don't try to make the consensus in our production setup. The goal we want to achieve is not 0 downtime, but automatically recover from a very short amount of downtime. A few builds will fail, but newly started builds will always succeed (with some cache miss).
In production, what we observe is that:

  1. When any storage goes offline, gRPC is notified almost immediately, so effectively within a few seconds, all clients are reading from/writing to the new shard. Cache miss indeed happens in this case.
  2. When storage comes back online, gRPC’s exponential backoff connection retry could lead to an inconsistent view of shards availability among clients. When a precondition fails, sometimes Bazel retries, but sometimes Bazel fails. IMO, the eventual consistent view is still better than the stuck builds we had before.

Architectural Improvement to help reduce inconsistency when storage goes offline

Instead of the traditional deployment where every frontend and worker connects to every shards. The setup below can simplify ops (adding/draining/removing shards):

  1. Put shards behind a group of frontends (internal frontend)
  2. Let the internal frontend do the routing work
  3. Workers and customer-facing frontends connect to the internal frontend.
              ┌─────────────► shard1        
              │                             
      worker  ├──────────────►shard2        
              │                 .           
              └──────────────►shardN        

becomes

                    ┌─────────────► shard1        
                    │                             
  workers──►frontend├──────────────►shard2        
                    │                 .           
                    └──────────────►shardN  

This deployment brings many benefits in ops that I won't go into detail. This architecture allows a small amount of frontends to handle a large amount of traffic via load balancing. The smaller the frontend groups, the quicker the consensus can form.

To use the new architecture, the gRPC connection from worker to frontend must have L7 load balancing (instead of, say, the k8s L4 load balancing).

Upcoming changes to enable HA storage

This HA storage has been running for a while internally and it works for small/medium scale of buildbarn deployment.

If this design is interesting to others, I'd like to send out pull request to upstream these features:

  1. grpc enable round robin policy via configuration
  2. Create a new blob access that checks the availability of each storage shard
  3. Add a field health_check in sharding configuration; when health check is enabled, use the new blob access.
@EdSchouten
Copy link
Member

EdSchouten commented Dec 27, 2024

The retry makes a build look like "hanging" until the backend connects to the frontend again.

I'd like to call out that I think that this is a big shortcoming of Bazel. Why can't it provide proper UI events in case such failures occur?

Issue: Make consensus in routing clients

The sharding algorithm is done in the client (frontend, workers, etc.). When we have hundreds of such clients, how do we make consensus among these clients about which shards are gone?

It can be done using consensus algorithms with additional complexity, but in the simplest form, we don't try to make the consensus in our production setup. The goal we want to achieve is not 0 downtime, but automatically recover from a very short amount of downtime. A few builds will fail, but newly started builds will always succeed (with some cache miss). In production, what we observe is that:

[...]

I think that an approach like this is completely unacceptable for any production workloads. It means that if a storage node is reaching its performance ceiling and health checks start to fail with some kind of random probability, the backends selected by clients also becomes random. So "newly started builds will always succeed" does not actually hold.

In fact, a policy like this may even lead to cascading failures. Because if clients start to use the wrong shard, cache hits will drop, causing more new data to be written. This increases load, increasing the probability of health check failures.

A better solution would be to have some kind of centralized process that does the health checking, and subsequently exposes which shards to use. But this raises the question: isn't this exactly what Kubernetes already does for us? I don't think it should be the goal of the Buildbarn project to reinvent features like that.

Just set up health checks on the Kubernetes side properly, and you can already achieve high availability.

Architectural Improvement to help reduce inconsistency when storage goes offline

Instead of the traditional deployment where every frontend and worker connects to every shards. The setup below can simplify ops (adding/draining/removing shards):

  1. Put shards behind a group of frontends (internal frontend)
  2. Let the internal frontend do the routing work
  3. Workers and customer-facing frontends connect to the internal frontend.

Note that this is already possible today, and there are many people that operate setups like these. The reason we don't do it as part of the example deployment is:

  • There is a slight performance overhead, as there is additional proxying of traffic.
  • The authentication policy on the frontends becomes a bit more complex, because the frontends need to accept both client and worker traffic.

But users are free to set this up themselves.

  1. grpc enable round robin policy via configuration

I am fine with getting this feature added.

  1. Create a new blob access that checks the availability of each storage shard
  2. Add a field health_check in sharding configuration; when health check is enabled, use the new blob access.

I don't think we should add these.

@lijunsong
Copy link
Contributor Author

Those are actually considered. I didn’t say what this health check actually did in my proposal. So it may have caused some confusion.

Let me add something here.

It seems to me that buildbarn is built with an assumption of the underlying storage system being fault tolerant. But for many out there, the disks/OS/rack switch can fail at any time and for a prolonged period. Not everyone is working with a stable physical infrastructure.

The “health check” may not be precise. “Availability check” is. This one checks infrastructure failures, specifically the grpc unavailable error. Such an error occurs when k8s evicts pods due to many reasons I mentioned above. In these cases, one or many shards could be accidentally stopped, but k8s can’t reschedule the statefulset pods for a long time.

The check doesn’t care about high load at all. That’s not what this design is for.

Since you mentioned k8s can do this: what’s the solution for this scenario? For N shards writing to local NVMe deployed as statefulsets scheduled with PV pinned to individual hosts, when one host is rebooted at the middle of the night and subsequently can’t come back, what types of k8s automation can make bbstorage available for bazel?

@EdSchouten
Copy link
Member

EdSchouten commented Dec 27, 2024

Use local NVMe is indeed the way to go. But instead of stateful sets, create n deployments. By using deployments, you're essentially telling Kubernetes that it's safe to launch replacements, even if there is no guarantee that the previously created pod has terminated/removed. If you do that, Kubernetes is capable of keeping your storage pods healthy at all times.

Also get rid of pinning to individual hosts. Instead, add mirroring to your setup and disable LocalBlobAccess persistency. If you do that, you can always safely restart/upgrade half of your fleet without causing recently written data to get lost. So every time you push changes to production, you only push them to either the A or B nodes of your storage cluster. This will cause those nodes to lose their data, but that doesn't matter, as the data can be replicated back when accessed.

So if you use sharding and mirroring, you'll end up having 2n deployments, each having replicas: 1. You then have 2n services, each routing traffic to a single pod. Then you list the 2*n hostnames of those services in your blobstore configuration.

@lijunsong
Copy link
Contributor Author

We initially used this mirroring setup too, but with a statefulset.

We didn’t use it for long because we were fighting with some high disk latency and replicating extremely large blobs. The Helm chart isn’t that nice for us to maintain either. Also, we’re really tight on storage and machine CPU/memory, so we eventually chose a non-mirroring solution. I like the way you use k8s deployment to lift the restriction though — a clear tradeoff between extra replication and availability.

I am happy to keep this change in-house as there is at least one solution to this availability issue for the public. I’ll just do the grpc round-robin policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants