-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocksdb using all allocated cpus due to contention on block cache #13191
Comments
I was watching https://www.youtube.com/watch?v=Tp9jO5rt7HU and it seems I may be running into this case: That video is from 1 year ago, I'm not sure if things have changed since. In my case, my block cache is set to 256mb which i assume is considered "small". Db size is around 1.5GB but that's compressed and my kvs get compressed very well, uncompressed would be much much larger (there is like 10M+ kvs in the db) a few questions:
@pdillinger what are your thoughts on this? |
I don't see any significant block cache indicators on the flame graph. I can't zoom into the names that are cut short. This looks more like a workload of excessive skipped internal keys (e.g. skipping over tombstones to find something). Are you using What makes you think block cache? |
yes.
I'm actually confused right now. The initial problem started with #13120 where something outside of my control deletes all files on disk that are being indexed in rocksdb. Then a clean up service that is meant to remove orphaned indexes runs, deletes pretty much the entire db as it notices the indexes don't point to anything on disk. This typically happens when not all services running on the disk are online in the sense that nothing is writing to rocksdb but there are many reads (the prefix lookups), however, sometimes the rest of the services are enable but rocksdb has reached a bad state and continue like that for a while. I also learned that the ttl compaction that I have won't trigger automatically and that they rely on flushes/writes to trigger the compaction. so the advice from the other github issue i linked is to have my clean up service issue manual compactions after it's done removing all the keys to avoid the case we are hitting here. for example, here is a recent occurrence:
Based on the graphs, you can see we don't have any accumulation of tombstones during the time we had the cpu spike. The only thing that spikes with the cpu is block cache related metrics and that's the only reason why I'm suspecting block cache even though from the flamgraph itself, it looks like it should be tombstones. Also as I said in the other linked ticket, sometimes waiting a few hours/days will fix it, other times i give up on it recovering and restart the service/rocksdb in which case it also fixes it. as for how i'm getting number of tombstones, etc.. I call https://github.com/zaidoon1/rust-rocksdb/blob/f22014c5f102744c8420d26d6ded90f340fb909c/src/db.rs#L2326-L2327. tombstones = num_deletions, live keys = num_entries and I sum that from all live files. I assume that's accurate |
Ok, I think I know exactly what is happening here:
The problem we have here is that while we got rid of the tombstones in SST files, we DID NOT get rid of all the tombstones in the memtables. This also explains why I see the block cache metrics spiking when this issue happens since the data being read exists in memory and not SST files. I think I need to:
Other things I can try:
some things from the rocksdb side:
@pdillinger @cbi42 since you both were helping me with this. Does my thought process make sense? And is my solution the best I can do here or is there something I haven't considered? |
Manual compaction does a flush first if memtable overlaps with the range being compacted. To confirm CPU is from skipping tombstones, you can track these perf context counters: rocksdb/include/rocksdb/perf_context.h Lines 139 to 147 in 18cecb9
|
@cbi42 just to make sure i'm not doing this wrong, when I run manual compaction, I call the following:
My understanding is this includes everything in the db so i'm manually compacting the entire cf. is that correct? |
Yes that's correct. |
I've added more metrics and I have a clear picture of what's going on here, at least for the most recent issue that we saw:
cpu usage: cronjob running: tombstones/entries in memtables: tombstones for SST files: manual compaction running at the end of the clean up job that fixes everything: So given that my issue happens while the cronjob is deleting stuff and the main issue seems to be tombstones in memtables, I was initially thinking of just using https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#trigger-compaction-on-deletes but that appears (based on description) but this is only useful for when SST files have lots of tombstones in a given range not memtables. Given that in my case, it could be the SST files that are full of tombstones or the memtables, my thinking is that I should also configure of course, I can update the cronjob to trigger manual compactions more often instead of at the end of the clean up process (say every 2K deleted keys or something like that) but I'm working on getting rid of the cronjob service all together and have the service that deletes things from disk also issue deletes to rocksdb directly so that we don't need to crawl the entire db to delete a few entries (in the happy path when we are not deleting the entire db). @cbi42 do you think that configuring compact on deletion factor (to handle tombstones in SST files) + memtable prefix bloom (to handle tombstones in memtables) will do what I want/handle this case or is there a gotchat that I'm not thinking about? |
looking at the code, I saw: Lines 835 to 837 in 44b741e
|
graphs:
flamegraph:
workload:
heavy prefix lookups (thousands per second) to check if a key prefix exists in the db
writes at a much much lower rate, around 200 RPS
Db size on disk: less than 2GB
rocksdb settings:
using prefix extractors + auto hyper clock cache + running rocksdb 9.7.4.
rocksdb options.txt
This is an extension of #13081 where I saw the same issue and blamed it on LRU so I switched to auto hyper clock cache and ran some tests which seemed to not repro the issue but it doesn't appear to be the case here.
It is very possible that many lookups are using the same prefix/looking up the same key. Would this cause contention for hyper clock cache? Is there something that I can tweak/tune? Maybe the "auto" hyper clock cache is the problem and I need to manually tweak some things?
The text was updated successfully, but these errors were encountered: