-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Handling failure to wait for tablet types specified with --tablet_types_to_wait
#17412
Comments
We'll need to think about the best way to handle this, but for starters I just want to note that if the topo is overloaded and |
As @deepthi alluded, this topo call to fetch the proposed boolean could also take a long time. I assume this call would "wait forever" with
Do we need a new "field"? I was thinking we could check for |
@ejortegau is it not the case that you really want to (optionally) consider errors here to be fatal? https://github.com/slackhq/vitess/blob/ef248b36d2642777bd0b50003c202d53811b3622/go/vt/vtgate/tabletgateway.go#L171-L183 If you're having trouble communicating with the topo server — that being the underlying issue here AFAICT — then I don't see how topo server side changes are helpful. The topo server is the source of truth and if we cannot communicate with it to bootstrap the vtgate's cluster state / view of the world then it probably does make sense to abort (at least as an option) as I think that there are some assumptions in the component / code base that we're starting from an accurate view of the cluster and then modifying it as we go based on topo server watches and health check responses. |
Hi, all: Thanks for your input. Please find below my comments/answers:
That is a fair point. I guess I should clarify a bit more what specific failure mode was: we had topo overload in the cell topo clusters, but not in the global one, so calls to
No, it would not need to wait forever. As mentioned on my reply to her above, a timeout there should also simply fail vtgate initialization imho.
That's at least part of it, yes, hence why the proposal above states the following:
The rest of the proposal (extra attribute) is meant to safely do that differentiating created but empty shards from non empty shards for which we failed to wait for targets.
See my reply to @deepthi above, but in short: a failure to read the shard (and therefore the newly proposed attribute) would also lead to vtgate init failure.
This is 100% my view as well, hence why the proposal is to have vtgate initialization fail if things are messed up. |
@ejortegau can you rework the proposal based on all the discussion? I think the main thing that needs to change is that there is no need to introduce a new topo field at all. If vtgate discovers 0 tablets without any timeout errors, it need not wait on tablets for that keyspace. If tablet discovery fails with a timeout, vtgate should keep retrying and wait until that succeeds. |
Hi, @deepthi . It has been updated. Please let me know if it looks good. Thanks! |
--tablet_types_to_wait
--tablet_types_to_wait
@ejortegau lgtm. Please feel free to go ahead with the implementation. |
Question
Introduction
This is an RFC to discuss the behavior of
vtgate
's--tablet_types_to_wait
flag. It starts by describing the current behavior, and moves towards discussing an issue we have experienced with it. Finally it describes proposed changes to address the issue.The intention of the RFC is to gain information on whether other community members share concerns about or have experienced the issue, and whether they agree with the proposal to address it; or whether they have alternate proposals.
Current behavior.
When
vtgate
is started with--tablet_types_to_wait
, during itsInit()
(here) a tablet gateway struct is created and then, a call to TabletGateway.WaitForTablets() is called on that struct. If the underlying work done by this function fails with a context deadline exceeded error, a warning is logged but the error is cleared. This means that vtgate'sInit()
is unaware thatWaitForTablets()
failed to find healthy tablets of the right types for all keyspaces/shards, and it proceeds to work normally.Under normal circumstances, this is not a problem, because retrieving the list of Targets is fast. However, under some circumstances we have seen that this can fail. Particularly, during overload of the underlying topology service, the calls to it take too long. If the whole process exceeds the time specified with
--gateway_initial_tablet_timeout
,TabletGateway.WaitForTablets()
to hit a context deadline exceeded. This is handled on itsdefer
function to simply log a warning and clear the error. As a result, thevtgate
'sInit()
is unaware thatTabletGateway.WaitForTablets()
actually failed to find healthy tablets of the right types for all keyspaces/shards, and it continues to start-up normally.Notice we saw the above behavior during a topo overload, but there might be other situations leading to context deadline exceeded and therefore the same beavior (e.g. network issues).
Issues of the current behavior:
As describe above, if a
vtgate
fails to get all healthy tablets for all targets, it still joins service after waiting for--gateway_initial_tablet_timeout
. As soon as such avtgate
receives a query for one of the shard-tablet types it has not yet gotten healthy tablets for, the query errors with something likeExecute: target: <keyapace>.<shard>.<tablet_type>: no healthy tablet available for 'keyspace:"<keyspace>" shard:"<shard>" tablet_type:<tablet_type>'
, causing client-application visible errors. The issues persist until thevtgate
eventually manages to get a healthy tablet of the right type - or is taken out of service.This is only an issue during vtgate initialization, but not for already running vtgates.
Proposal
Adjusted proposal
During
vtgate
initialization, if a failure to fetchvttablet
s for all targets takes place and the error is a timeout/context deadline exceeded,vtgate
retries until it succeeds. This would be implemented as some sort of retry loop around WaitForTablets in vtgate's Init.Any other initialization errors will continue to be treated as they are today.
Original proposal below
Our proposal is that avtgate
that fails to get healthy tablets of the right types for keyspace/shards that are known to have tablets should not join service. This could be implemented in a number of ways, but we should be careful to distinghish two different scenarios:A keyspace/shard has no tablets (e.g. the shard exists in the topology, but no tablets exist for it).A keyspace/shard has tablets but thevtgate
has not been able to get healthy ones during it's initialization.In the first scenario,vtgate
should be able to join service. A keyspace/shard with no tablets can be the result of a decommissioned keyspace/shard, for which allvttablet
s were removed but the topo record for the keyspace/shard was not deleted. In this case, joining service despite the failure to get healthy tablets does not lead to an issue (or at least, not any issue that was not already present on any pre-existingvtgate
s).In the second scenario,vtgate
should not start serving until it manages to get healthy tablets - even if that means hanging out forever. Otherwise, any queries it gets for the targets it's missing will fail.For that, we propose thatvtgate
s determine whether they need to wait for targets of a particular keyspace/shard by looking at an attribute in the topo record of the shard (let's temporarily call ithas_tablets
, but can be called something else). When a shard is created, the attribute will be set tofalse
. It will only be set to true when avttablet
process is started up for that particular keyspace/shard. It will also be set tofalse
when issuingDeleteTablet
for the last tablet in the keyspace/shard.With the above, we would suggest that vtgate's init waiting for tablets works as follows:Fetch the targets, filtering out the ones whose shards havehas_tablets
set to false.TabletGateway.WaitForTablets()
would not clear the context deadline error so that vtgate's init knows when it failed to get all targets. If there are concerns with this behavior change, it could be controlled via a new, opt-in flag.We look forward to your input.
The text was updated successfully, but these errors were encountered: