-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for Valkey to support field level expire/TTL for hash, set and sorted set #640
Comments
Historically this was never added due to the amount of complexity. If that still holds true, I think we should evaluate if this fits more as a separate command/modules to incorporate element level TTL for a key. Note: Tair* has implemented some of these features in a module. |
sounds like ideally the expiry logic should be embedded into the dict itself [and than can be easily reused for all dict users], have we evaluated such an option? [instead of the additional expiry dict] |
@zvi-code I documented a bit of that here: #169 (comment). Embedding the TTL into the dict itself is OK and allows you to do lazy expiration very fast. Active expiration is a lot more painful though, since you need some way to actively go find those values which only somewhat works. |
Thanks @madolson for sharing, I'll take a look. I think with resonable assumptions it could be easier. The assumptions are about how tight do we want to be on eviction time, at least intuitively I feel, it's perfectly ok to be accurate at a multi second granularity[have grouping of some kind at a seconds granularity]. But I'll post my thoughts on the other thread, it looks very interesting! |
I have been working on investigation of the problem for the last few weeks. I would like to provide some initial thoughts I came up with for which I plan to PoC and decide on the best approuch for implementation. First I would like to list the important tenets for the solution: Tenets
New APIAt this point we will only focus on introducing the new expiry API for hashes. I think once we do that it will be more trivial work to produce and duplicate the mechanisms for SETs and SortedSets. The new Key typeCurrently a key is always an SDS in Valkey. Although it is possible to match a key with external metadata (eg TTL) by mapping the key to the relevant metadata, it will incur extra memory utilization to hold the mapping. Some dictionaries use objects with embedded keys were the metadata can be set as part of the object. However that would require every dictionary which needs TTL support to use objects with embedded keys and might significantly complicate existing code paths as well as require extra memory in order hold the object metadata.
hkey memory layoutThere are 2 options to place the hkey metadata. Option 1 - place metadata in front of the sds We can place the metadata in front of the sds. This can be proved to be better in terms of memory access since the metadata might be loaded with the entire cacheline. This has the downside of having to relocate the sds (or memmove it) whenever we add or remove metadata blocks, but we assume this is not as frequent as metadata access. Option 2 - place metadata at the end of the sds We can place the metadata at the end of the sds. Since keys are immutable, there is no risk of having to frequently relocate the sds. In fact for large key strings (larger than 64 bytes sds allocation size) There is a better chance we will not have to rellocate the sds when we set or remove TTL from hash item, since if might already fit in the jemalloc bin. The problem with this memory ordering is that in case the hkey will have to support extra metadata in order to be kept in some secondarty index (see the next section - items expiration tracking), it might require many dereferrences from sds header to metadata which can exceed the L1 cache line size. The Key APIWe can start at a minimal API and extend it in the future. We can also decide to extend the metadata in the future when we would like to use the key in multiple indexes which require to place index specific metadata in each key.
Handling ListPack representationUser can configure hash type objects to use memory efficient representation using listpacks. Although we could add TTL to the listpack representation this might greatly complicate the usage when some items do not have TTL and others does. Using an external data structure to manage the TTL for listpack elements which were assigned TTL is possible but might reduce the memeory efficiency of this representation. In the sake of simplicity we will force listpack conversion to hashtable representation as soon as the first item TTL is set. This way we will not enforce memory efficiency degradation for users not setting TTL on hash object elements. Items expiration trackingIn order to manage active expiration, we have to keep a record of all items with TTL and have the ability to iterate and reclaim them from time to time. For general keys, we use a separate expiry hashtable for each kvstore element in order to store keys with TTL. During cron cycles (triggered roughly every server.hz) we also use a predefined configurable heuristic in order to bound the cron run time scanning the expiry dictionary. Option 1 - Expiration Radix treeThis option was also considered by @antirez (link) in order to catalog items with their expiration time. The technique makes use of the existing RAX data type in order to record 16 bytes “strings” representing the TTL and the object pointer value:
Actually this way of managing objects with timeout is already used in Valkey to manage blocked clients! (link) The good:
The bad:
Although we can think of optimizing this structure to be more performant and memory efficient by adopting alternative implementation (eg judy arrays or other proposed implementations) this work might still not yield better memory efficiency and will require long and complicated implementation. Option 2 - Deterministic Skip ListsSince we are introducing object expiration support for hash objects, we cannot allow sorting all items by their TTL since this would imply logarithmic complexity on HashTable operations. We could, however, attempt to achieve approximate constant complexity when modifying Objects. This is done by bounding the number of items stored in a sorted data structure. In order to manage Objects in a sorted data structure, we would first need to extend the object metadata: Skip lists have several disadvantages though. They require extra space to manage pointers between the middle layer nodes and even though they offer logarithmic complexity for all operations (search, insertion, deletion) they usually involve lots of memory dereferencing which can turn out to be expensive. This helps us maintain a bounded memory overhead per Object (16 bytes Metadata overhead + 8 / p, where p is the allowed gap). For example if we allow p = 8, we will get memory overhead of ~17 Bytes and if we allow p = 4 we will have a memory overhead of ~18 bytes etc... How do we limit the number of objects? As mentioned before, we cannot allow logarithmic complexity. this means we will have to keep the list small enough (say 512 elements). But what happens when we have more than 512 elements with expiry data? Option 3 (Recommended) Hash Time RangesWhen sorting the data is not possible, we can think of a way to impose semi-sorting on the hash elements by assigning them to buckets. Each bucket will correspond to a specific time range, so that all keys with TTL in the specific bucket time range will we placed in that bucket. The bucket time ranges should be carefully considered though. If we choose high resuolution buckets, we might risk loosing memory efficiency due to many buckets holding very few elements sharing the bucket metadata. If we choose low resolution bucket sizes we might miss reclaiming memory for a long period of time.
The main issue with timer wheels is the cascading operation which might be costly since it will require to iterate over all the items in the cascaded bucket. It is important to note that the higher the level resolution gets, the less frquent this cascading operation needs to take place and it can also increase the probability that these items will be lazily deleted reducing the need to actively iterate and expire them. Hash Timer bucketsThe bucket structureIn order to allow fast insertions and deletions of elements, we will make use of the new hashtable stracture. What is the required load factor?When using a bucket structure, It might be possible to maintain the elements using an intrussive doubly link list, where the prev and next pointers are imbedded in the hkey metadata. However this will require overhead of 24 bytes per hkey with TTL which might be optimized better. In order to make sure we maintain good memory efficiency we will need to make sure to extend the allowed hashtable loadfactor. For example the following expected memory overhead estimation was done based on poisson distribution of items: The following chart is a statistic evaluation of the expected extra metadata cost per hash item given a specific load factor: These are the current items I plan to focus my PoC on. Some other issues which will require addressing:
|
Support for this commands may come in the future[1]. But it will take some time, so for now it's better to drop them. This is a breaking change for 6.1. Close #78 [1]: valkey-io/valkey#640 Signed-off-by: Salvatore Mesoraca <[email protected]>
Shall we also support the EXPIREMEMBER syntax from KeyDB? That was the first implementation of this feature so I think we can respect the original syntax and also provide improved migration path for keydb users. |
Support for this commands may come in the future[1]. But it will take some time, so for now it's better to drop them. This is a breaking change for 6.1. Close valkey-io#78 [1]: valkey-io/valkey#640 Signed-off-by: Salvatore Mesoraca <[email protected]> Signed-off-by: Raphaël Vinot <[email protected]>
@ranshid I've read your proposal above about I've read about the HKEY, intrusive sds abstraction, now. Sds metadata sounds good. Probably put in the front of the string to avoid extra memory lookups. If the key is very large, we can make space for the TTL in advance to avoid the need to reallocate it. I did that in the keyspace robj already, in #1186.
I think this is slightly irrelevant and a bit wrong. The hashtable doesn't require keys to be embedded in anything. We could just use a Each hashtable application is slightly different though:
|
I think a combination of all options is the better solution. |
We've got several apps using KeyDB. One question - will empty sets be automatically deleted (i.e. after the last member expires)? I'm pretty sure that's how KeyDB does it. Otherwise we'd need to implement some kind of manual cleanup which would add unnecessary complication. |
The problem/use-case that the feature addresses
My customer has a use case where each element within a Hash/Set/Sorted Set data type would have different TTL. Right now, it is done by manually write client side logic or lua script. It would be great if Valkey could support element level TTL, so the element get expired independently.
Description of the feature
Valkey would allow client to specific different TTL value at element level.
Alternatives you've considered
None.
Additional information
None.
The text was updated successfully, but these errors were encountered: