Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Handling failure to wait for tablet types specified with --tablet_types_to_wait #17412

Open
ejortegau opened this issue Dec 19, 2024 · 7 comments
Labels

Comments

@ejortegau
Copy link
Contributor

ejortegau commented Dec 19, 2024

Question

Introduction

This is an RFC to discuss the behavior of vtgate's --tablet_types_to_wait flag. It starts by describing the current behavior, and moves towards discussing an issue we have experienced with it. Finally it describes proposed changes to address the issue.

The intention of the RFC is to gain information on whether other community members share concerns about or have experienced the issue, and whether they agree with the proposal to address it; or whether they have alternate proposals.

Current behavior.

When vtgate is started with --tablet_types_to_wait, during its Init() (here) a tablet gateway struct is created and then, a call to TabletGateway.WaitForTablets() is called on that struct. If the underlying work done by this function fails with a context deadline exceeded error, a warning is logged but the error is cleared. This means that vtgate's Init() is unaware that WaitForTablets() failed to find healthy tablets of the right types for all keyspaces/shards, and it proceeds to work normally.

Under normal circumstances, this is not a problem, because retrieving the list of Targets is fast. However, under some circumstances we have seen that this can fail. Particularly, during overload of the underlying topology service, the calls to it take too long. If the whole process exceeds the time specified with --gateway_initial_tablet_timeout, TabletGateway.WaitForTablets() to hit a context deadline exceeded. This is handled on its defer function to simply log a warning and clear the error. As a result, the vtgate's Init() is unaware that TabletGateway.WaitForTablets() actually failed to find healthy tablets of the right types for all keyspaces/shards, and it continues to start-up normally.

Notice we saw the above behavior during a topo overload, but there might be other situations leading to context deadline exceeded and therefore the same beavior (e.g. network issues).

Issues of the current behavior:

As describe above, if a vtgate fails to get all healthy tablets for all targets, it still joins service after waiting for --gateway_initial_tablet_timeout. As soon as such a vtgate receives a query for one of the shard-tablet types it has not yet gotten healthy tablets for, the query errors with something like Execute: target: <keyapace>.<shard>.<tablet_type>: no healthy tablet available for 'keyspace:"<keyspace>" shard:"<shard>" tablet_type:<tablet_type>', causing client-application visible errors. The issues persist until the vtgate eventually manages to get a healthy tablet of the right type - or is taken out of service.

This is only an issue during vtgate initialization, but not for already running vtgates.

Proposal

Adjusted proposal

During vtgate initialization, if a failure to fetch vttablets for all targets takes place and the error is a timeout/context deadline exceeded, vtgate retries until it succeeds. This would be implemented as some sort of retry loop around WaitForTablets in vtgate's Init.

Any other initialization errors will continue to be treated as they are today.

Original proposal below

Our proposal is that a vtgate that fails to get healthy tablets of the right types for keyspace/shards that are known to have tablets should not join service. This could be implemented in a number of ways, but we should be careful to distinghish two different scenarios:

  • A keyspace/shard has no tablets (e.g. the shard exists in the topology, but no tablets exist for it).
  • A keyspace/shard has tablets but the vtgate has not been able to get healthy ones during it's initialization.

In the first scenario, vtgate should be able to join service. A keyspace/shard with no tablets can be the result of a decommissioned keyspace/shard, for which all vttablets were removed but the topo record for the keyspace/shard was not deleted. In this case, joining service despite the failure to get healthy tablets does not lead to an issue (or at least, not any issue that was not already present on any pre-existing vtgates).

In the second scenario, vtgate should not start serving until it manages to get healthy tablets - even if that means hanging out forever. Otherwise, any queries it gets for the targets it's missing will fail.

For that, we propose that vtgates determine whether they need to wait for targets of a particular keyspace/shard by looking at an attribute in the topo record of the shard (let's temporarily call it has_tablets, but can be called something else). When a shard is created, the attribute will be set to false. It will only be set to true when a vttablet process is started up for that particular keyspace/shard. It will also be set to false when issuing DeleteTablet for the last tablet in the keyspace/shard.

With the above, we would suggest that vtgate's init waiting for tablets works as follows:

  1. Fetch the targets, filtering out the ones whose shards have has_tablets set to false.
  2. TabletGateway.WaitForTablets() would not clear the context deadline error so that vtgate's init knows when it failed to get all targets. If there are concerns with this behavior change, it could be controlled via a new, opt-in flag.

We look forward to your input.

@ejortegau ejortegau added Type: RFC Request For Comment Component: VTGate labels Dec 19, 2024
@deepthi
Copy link
Member

deepthi commented Dec 29, 2024

We'll need to think about the best way to handle this, but for starters I just want to note that if the topo is overloaded and GetTablets or GetTablet times out, there is no guarantee that GetShard won't timeout and land you in the same place.

@timvaillancourt
Copy link
Contributor

timvaillancourt commented Dec 29, 2024

For that, we propose that vtgates determine whether they need to wait for targets of a particular keyspace/shard by looking at an attribute in the topo record of the shard (let's temporarily call it has_tablets, but can be called something else).

As @deepthi alluded, this topo call to fetch the proposed boolean could also take a long time. I assume this call would "wait forever" with context.Background()?

When a shard is created, the attribute will be set to false. It will only be set to true when a vttablet process is started up for that particular keyspace/shard.

Do we need a new "field"? I was thinking we could check for len(resp.Tablets) == 0 using GetTablets. Or is the "empty shard" scenario you mention one where there ARE tablet records in the topo @ejortegau (but they are stopped)? Some more detail on what state is in the topo in the scenario might be helpful

@mattlord
Copy link
Contributor

@ejortegau is it not the case that you really want to (optionally) consider errors here to be fatal? https://github.com/slackhq/vitess/blob/ef248b36d2642777bd0b50003c202d53811b3622/go/vt/vtgate/tabletgateway.go#L171-L183

If you're having trouble communicating with the topo server — that being the underlying issue here AFAICT — then I don't see how topo server side changes are helpful. The topo server is the source of truth and if we cannot communicate with it to bootstrap the vtgate's cluster state / view of the world then it probably does make sense to abort (at least as an option) as I think that there are some assumptions in the component / code base that we're starting from an accurate view of the cluster and then modifying it as we go based on topo server watches and health check responses.

@ejortegau
Copy link
Contributor Author

Hi, all:

Thanks for your input. Please find below my comments/answers:

if the topo is overloaded and GetTablets or GetTablet times out, there is no guarantee that GetShard won't timeout and land you in the same place

That is a fair point. I guess I should clarify a bit more what specific failure mode was: we had topo overload in the cell topo clusters, but not in the global one, so calls to GetShard did work for us. Nevertheless, your point stands. However, if the proposed change was made (particularly, the part about not clearing the timeout error), a timeout to GetShard (which takes place inside srvtopo.FindAllTargetsAndKeyspaces(), called by TabletGateway.WaitForTablets()) would end up being propagated to Init and therefore would prevent the vtgate from joining service - which IMHO it should if it failed to get all relevant shards.

As @deepthi alluded, this topo call to fetch the proposed boolean could also take a long time. I assume this call would "wait forever" with context.Background()?

No, it would not need to wait forever. As mentioned on my reply to her above, a timeout there should also simply fail vtgate initialization imho.

you really want to (optionally) consider errors here to be fatal? https://github.com/slackhq/vitess/blob/ef248b36d2642777bd0b50003c202d53811b3622/go/vt/vtgate/tabletgateway.go#L171-L183

That's at least part of it, yes, hence why the proposal above states the following:

TabletGateway.WaitForTablets() would not clear the context deadline error so that vtgate's init knows when it failed to get all targets. If there are concerns with this behavior change, it could be controlled via a new, opt-in flag.

The rest of the proposal (extra attribute) is meant to safely do that differentiating created but empty shards from non empty shards for which we failed to wait for targets.

If you're having trouble communicating with the topo server — that being the underlying issue here AFAICT — then I don't see how topo server side changes are helpful

See my reply to @deepthi above, but in short: a failure to read the shard (and therefore the newly proposed attribute) would also lead to vtgate init failure.

The topo server is the source of truth and if we cannot communicate with it to bootstrap the vtgate's cluster state / view of the world then it probably does make sense to abort

This is 100% my view as well, hence why the proposal is to have vtgate initialization fail if things are messed up.

@deepthi
Copy link
Member

deepthi commented Jan 10, 2025

@ejortegau can you rework the proposal based on all the discussion? I think the main thing that needs to change is that there is no need to introduce a new topo field at all. If vtgate discovers 0 tablets without any timeout errors, it need not wait on tablets for that keyspace. If tablet discovery fails with a timeout, vtgate should keep retrying and wait until that succeeds.

@ejortegau
Copy link
Contributor Author

Hi, @deepthi . It has been updated. Please let me know if it looks good. Thanks!

@deepthi deepthi changed the title RFC: Handling failure to wait for tablet typs specified with --tablet_types_to_wait RFC: Handling failure to wait for tablet types specified with --tablet_types_to_wait Jan 23, 2025
@deepthi
Copy link
Member

deepthi commented Jan 23, 2025

@ejortegau lgtm. Please feel free to go ahead with the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants