Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

during add index on DXF, sst are ingested to L0, triggers ServerIsBusy and keeps failing/retrying #58807

Open
D3Hunter opened this issue Jan 8, 2025 · 4 comments
Labels
component/ddl This issue is related to DDL of TiDB. type/bug The issue is confirmed as a bug.

Comments

@D3Hunter
Copy link
Contributor

D3Hunter commented Jan 8, 2025

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

tried to reproduce it locally, but cannot reproduce right now, the env is:

  • 3pd 8c16g, 3 tidb 8c16g each with 50G disk, it's small, but it's enough to hold the target index which is around 8G; 3 tikv 8c32g
  • the table have about 200m rows, table structure similar to this
create table t(
    a bigint,
    b bigint,
    c bigint,
    d bigint,
    e varchar(32),
    .... more columns... total 41 columns, most are int/decimal types, some are varchar with a small length
    primary key(a, b, c),
    key(a, d, b)
)
  • add index with alter table t add index idx_ade(a, d, e);
  • the dxf task have 3 subtask, the second runs faster, so it ingest first(also the only one to ingest, as we use a distributed lock), and at the time of ingest, and when the subtask start ingest, seems all files are ingested to L0 directly, like below, so it triggers tikv's flow control, and report ServerIsBusy too many sst files are ingesting
    • before ingest we can see TiKV have some files at L0, and from tikv log its range is from a small table-id to a large one, and it includes the table-id of target table, so overlaps with the index KV range too. not sure if it's related
      Image
  • after TiKV compaction, the number of files in L0 decrease, so the second subtask continues, and success later
    Image
  • the first subtask success without ServerIsBusy for this subtask.
  • the third subtask is the last to ingest, and it keeps failed with ServerIsBusy too many sst files are ingesting, and from TiKV monitoring, it keeps ingest at L0, after the retry of local backend used up, DXF starts retry the whole subtask, but it never ends, and keeps reporting above error. the issue of unlimited retry of DXF is recored in here add retry limit to DXF #58814
    Image

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

8.1.1

@D3Hunter D3Hunter added type/bug The issue is confirmed as a bug. component/ddl This issue is related to DDL of TiDB. labels Jan 8, 2025
@lance6716
Copy link
Contributor

There's a TiKV log to display every SST file key range. Maybe we can see some strange SST range in it.

@lance6716
Copy link
Contributor

Also there's a "ingestion pick level" in metrics of "TiKV Details" - "RocksDB - kv", we can see if it first pick L6 then gradually goes to L5, L4, ... L0

Image

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Jan 9, 2025

I reproduced it locally, by lightning without setting tikv import-mode, add-index is the same

  • create below table, so table-id of t < t1 < t2
create table t(id int primary key auto_increment,a varchar(100),b varchar(100),c bigint,d varchar(1024));
create table t1(id int primary key auto_increment,a varchar(100),b varchar(100),c bigint,d varchar(1024));
create table t2(id int primary key auto_increment,a varchar(100),b varchar(100),c bigint,d varchar(1024));
  • insert into t and t2 in turn, and fill a lot of data, until tikv flush memtable into L0, then we will have a L0 range from table id of t to t2 which overlap with t1's range
  • import to t1 with a lot of region jobs(at least 11 to trigger tikv flow control)
  • we see
[2025/01/09 11:24:55.561 +08:00] [WARN] [region_job.go:659] ["meet error and handle the job later"] ["job stage"=wrote] [error="[Lightning:KV:ServerIsBusy]too many sst files are ingesting"] [region="{ID=478,startKey=7480000000000000FF6A5F728000000000FF0325C00000000000FA,endKey=7480000000000000FF6A5F728000000000FF0339DA0000000000FA,epoch=\"conf_ver:5 version:125 \",peers=\"id:479 store_id:1 ,id:480 store_id:4 ,id:481 store_id:5 \"}"] [start=74800000000000006A5F728000000000033560] [end=74800000000000006A5F7280000000000339DA]

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Jan 9, 2025

Also there's a "ingestion pick level" in metrics of "TiKV Details" - "RocksDB - kv", we can see if it first pick L6 then gradually goes to L5, L4, ... L0

for the case described in the issue, this panel have no data at the first report ServerIsBusy of second subtask, but it does have after a while.
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ddl This issue is related to DDL of TiDB. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

2 participants