Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocksdb using all allocated cpus due to contention on block cache #13191

Open
zaidoon1 opened this issue Dec 6, 2024 · 9 comments
Open

rocksdb using all allocated cpus due to contention on block cache #13191

zaidoon1 opened this issue Dec 6, 2024 · 9 comments

Comments

@zaidoon1
Copy link
Contributor

zaidoon1 commented Dec 6, 2024

graphs:

Screenshot 2024-12-06 at 1 19 00 AM

flamegraph:

flamegraph

workload:

heavy prefix lookups (thousands per second) to check if a key prefix exists in the db

writes at a much much lower rate, around 200 RPS

Db size on disk: less than 2GB

rocksdb settings:

using prefix extractors + auto hyper clock cache + running rocksdb 9.7.4.

rocksdb options.txt

This is an extension of #13081 where I saw the same issue and blamed it on LRU so I switched to auto hyper clock cache and ran some tests which seemed to not repro the issue but it doesn't appear to be the case here.

It is very possible that many lookups are using the same prefix/looking up the same key. Would this cause contention for hyper clock cache? Is there something that I can tweak/tune? Maybe the "auto" hyper clock cache is the problem and I need to manually tweak some things?

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented Dec 6, 2024

I was watching https://www.youtube.com/watch?v=Tp9jO5rt7HU and it seems I may be running into this case:

Screenshot 2024-12-06 at 2 21 41 AM

That video is from 1 year ago, I'm not sure if things have changed since.

In my case, my block cache is set to 256mb which i assume is considered "small". Db size is around 1.5GB but that's compressed and my kvs get compressed very well, uncompressed would be much much larger (there is like 10M+ kvs in the db)

a few questions:

  1. is there a metric that we can track that can show whether hyperclock cache is hitting this case of looking for things to evict or not? It's not showing up in the flamegraph but maybe it's not meant to show up???
  2. if it is indeed this case or if let's assume it is the case, what is the solution/workaround?

@pdillinger what are your thoughts on this?

@pdillinger
Copy link
Contributor

I don't see any significant block cache indicators on the flame graph. I can't zoom into the names that are cut short. This looks more like a workload of excessive skipped internal keys (e.g. skipping over tombstones to find something). Are you using prefix_same_as_start or an iterate_upper_bound for your prefix queries? You don't want to be scanning to the next non-empty prefix just to discover the prefix you are interested in is empty.

What makes you think block cache?

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented Dec 12, 2024

Are you using prefix_same_as_start

yes.

What makes you think block cache?

I'm actually confused right now. The initial problem started with #13120

where something outside of my control deletes all files on disk that are being indexed in rocksdb. Then a clean up service that is meant to remove orphaned indexes runs, deletes pretty much the entire db as it notices the indexes don't point to anything on disk.

This typically happens when not all services running on the disk are online in the sense that nothing is writing to rocksdb but there are many reads (the prefix lookups), however, sometimes the rest of the services are enable but rocksdb has reached a bad state and continue like that for a while. I also learned that the ttl compaction that I have won't trigger automatically and that they rely on flushes/writes to trigger the compaction. so the advice from the other github issue i linked is to have my clean up service issue manual compactions after it's done removing all the keys to avoid the case we are hitting here.

for example, here is a recent occurrence:

  1. something misbehaves and deletes everything on disk
Screenshot 2024-12-11 at 10 11 49 PM
  1. clean up service runs to delete orphaned indexes
Screenshot 2024-12-11 at 10 18 51 PM
  1. after each run, clean up service issues a manual compaction
  2. here is others graphs showing different stats of rocksdb at the time
Screenshot 2024-12-11 at 10 28 24 PM Screenshot 2024-12-11 at 10 28 53 PM

Based on the graphs, you can see we don't have any accumulation of tombstones during the time we had the cpu spike. The only thing that spikes with the cpu is block cache related metrics and that's the only reason why I'm suspecting block cache even though from the flamgraph itself, it looks like it should be tombstones. Also as I said in the other linked ticket, sometimes waiting a few hours/days will fix it, other times i give up on it recovering and restart the service/rocksdb in which case it also fixes it.

as for how i'm getting number of tombstones, etc.. I call https://github.com/zaidoon1/rust-rocksdb/blob/f22014c5f102744c8420d26d6ded90f340fb909c/src/db.rs#L2326-L2327.

tombstones = num_deletions, live keys = num_entries and I sum that from all live files. I assume that's accurate

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented Dec 23, 2024

Ok, I think I know exactly what is happening here:

  1. writes to rocksdb stop because the service writing to rocksdb is shutdown
  2. reads to rocksdb continue at a rate of thousands per second doing prefix seeks
  3. some rare condition is triggered and the files that are being indexed by rocksdb are deleted (still trying to figure out why but that's outside of rocksdb so not our concern here)
  4. the cronjob service that makes sure rocksdb is in sync with what is on disk runs, sends a request to the service that has rocksdb open in read/write mode to create a checkpoint and then opens the checkpoint in read only mode, and deletes pretty much the entire db with DELETE calls to the read/write service filing the db with tombstones.
  5. the cronjob service finishes processing all the indexes and issues a manual compaction to remove all the tombstones as suggested in compaction not running or running very slowly when entire db is deleted? #13120

The problem we have here is that while we got rid of the tombstones in SST files, we DID NOT get rid of all the tombstones in the memtables. This also explains why I see the block cache metrics spiking when this issue happens since the data being read exists in memory and not SST files. I think I need to:

  1. run a manual flush after deleting all the records, to flush the memtables to SST files and then run a manual copmaction to get rid of all the tombstones in the SST files
  2. potentially also enable memtable_prefix_bloom?

Other things I can try:

  1. periodically flush memtables instead of waiting for the cronjob service to do it so I don't keep around lots of tombstones and degrade read performance?

some things from the rocksdb side:

  1. does it make sense to offer a "periodic flush seconds" similar to the "periodic compaction seconds"/db ttl to periodically flush memtables or something like the compaction deletion factory but for memtables instead of SST files?

@pdillinger @cbi42 since you both were helping me with this. Does my thought process make sense? And is my solution the best I can do here or is there something I haven't considered?

@cbi42
Copy link
Member

cbi42 commented Dec 23, 2024

Manual compaction does a flush first if memtable overlaps with the range being compacted.

To confirm CPU is from skipping tombstones, you can track these perf context counters:

uint64_t internal_key_skipped_count;
// Total number of deletes and single deletes skipped over during iteration
// When calling Next(), Seek() or SeekToFirst(), after previous position
// before calling Next(), the seek key in Seek() or the beginning for
// SeekToFirst(), there may be one or more deleted keys before the next valid
// key. Every deleted key is counted once. We don't recount here if there are
// still older updates invalidated by the tombstones.
//
uint64_t internal_delete_skipped_count;

@zaidoon1
Copy link
Contributor Author

@cbi42 just to make sure i'm not doing this wrong, when I run manual compaction, I call the following:

rocksdb_compact_range_cf(db,cf,NULL,0,NULL,0)

My understanding is this includes everything in the db so i'm manually compacting the entire cf. is that correct?

@cbi42
Copy link
Member

cbi42 commented Dec 23, 2024

Yes that's correct.

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented Jan 7, 2025

I've added more metrics and I have a clear picture of what's going on here, at least for the most recent issue that we saw:

  1. cronjob runs and starts deleting a lot of kvs
  2. as the cronjob is running and deleting things, tombstones start filling up
  3. even though I run a manual compaction at the end of the clean up job, it doesn't matter because while the clean up job is running and tombstones are accumulating, prefix lookups are happening and having to dealing with the many tombstones thus maxing the cpu.

cpu usage:

Screenshot 2025-01-07 at 10 10 46 AM

cronjob running:

Screenshot 2025-01-07 at 10 10 55 AM

tombstones/entries in memtables:

Screenshot 2025-01-07 at 10 10 39 AM

tombstones for SST files:

Screenshot 2025-01-07 at 10 30 39 AM

manual compaction running at the end of the clean up job that fixes everything:
Screenshot 2025-01-07 at 10 32 08 AM

So given that my issue happens while the cronjob is deleting stuff and the main issue seems to be tombstones in memtables, I was initially thinking of just using https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#trigger-compaction-on-deletes but that appears (based on description) but this is only useful for when SST files have lots of tombstones in a given range not memtables. Given that in my case, it could be the SST files that are full of tombstones or the memtables, my thinking is that I should also configure memtable_prefix_bloom since I expect 90% of the time the prefix lookups are negative lookups and that's when we are not hitting this special case where the entire db is being deleted in which case I expect nothing to be found.

of course, I can update the cronjob to trigger manual compactions more often instead of at the end of the clean up process (say every 2K deleted keys or something like that) but I'm working on getting rid of the cronjob service all together and have the service that deletes things from disk also issue deletes to rocksdb directly so that we don't need to crawl the entire db to delete a few entries (in the happy path when we are not deleting the entire db).

@cbi42 do you think that configuring compact on deletion factor (to handle tombstones in SST files) + memtable prefix bloom (to handle tombstones in memtables) will do what I want/handle this case or is there a gotchat that I'm not thinking about?

@zaidoon1
Copy link
Contributor Author

zaidoon1 commented Jan 9, 2025

looking at the code, I saw:

rocksdb/db/memtable.h

Lines 835 to 837 in 44b741e

// max range deletions in a memtable, before automatic flushing, 0 for
// unlimited.
uint32_t memtable_max_range_deletions_ = 0;
but this is for setting a max for too many range deletes in memtables before forcing a flush, not "regular' deletes. I can introduce a new feature that does the same for regular deletes if you think it's a good feature. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants