Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Valkey to support field level expire/TTL for hash, set and sorted set #640

Open
xingbowang opened this issue Jun 12, 2024 · 9 comments
Assignees
Labels
client-changes-needed Client changes are required for this feature enhancement New feature or request

Comments

@xingbowang
Copy link

The problem/use-case that the feature addresses
My customer has a use case where each element within a Hash/Set/Sorted Set data type would have different TTL. Right now, it is done by manually write client side logic or lua script. It would be great if Valkey could support element level TTL, so the element get expired independently.

Description of the feature

Valkey would allow client to specific different TTL value at element level.

Alternatives you've considered

None.

Additional information

None.

@hpatro
Copy link
Collaborator

hpatro commented Jun 12, 2024

Historically this was never added due to the amount of complexity. If that still holds true, I think we should evaluate if this fits more as a separate command/modules to incorporate element level TTL for a key.

Note: Tair* has implemented some of these features in a module.

@zvi-code
Copy link
Contributor

sounds like ideally the expiry logic should be embedded into the dict itself [and than can be easily reused for all dict users], have we evaluated such an option? [instead of the additional expiry dict]
we could add expiry existence bitmap on the dict buckets [bit for bucket] to have optimized iterator over volatile keys, at some small cost at rehashing, or even ttl range bitmap to group similar ttl keys

@madolson
Copy link
Member

@zvi-code I documented a bit of that here: #169 (comment). Embedding the TTL into the dict itself is OK and allows you to do lazy expiration very fast. Active expiration is a lot more painful though, since you need some way to actively go find those values which only somewhat works.

@zvi-code
Copy link
Contributor

zvi-code commented Jun 20, 2024

Thanks @madolson for sharing, I'll take a look. I think with resonable assumptions it could be easier. The assumptions are about how tight do we want to be on eviction time, at least intuitively I feel, it's perfectly ok to be accurate at a multi second granularity[have grouping of some kind at a seconds granularity]. But I'll post my thoughts on the other thread, it looks very interesting!

@ranshid
Copy link
Member

ranshid commented Nov 25, 2024

I have been working on investigation of the problem for the last few weeks.

I would like to provide some initial thoughts I came up with for which I plan to PoC and decide on the best approuch for implementation.

First I would like to list the important tenets for the solution:

Tenets

  • memory efficiency - At the highest priority we target a minimal metadata overhead used in order to store and manage items TTL. While the optimal overhead to maintain item TTL is 8 bytes (could be less if we allow keeping offsets from the existing epoch diff time), we understand that maintaining active expiry logic will require use of more bytes for metadata. We will make our top priority effort to minimize this overhead.
  • Latency - After memory efficiency considerations we will require a solution which provides low latency for hash operations. Hash objects are expected to operate in O(1) for all single access operations (get, set, delete etc...) and we will not break this promise even for items with expiry.
  • CPU efficiency - After latency we priorities system CPU efficiency. For example we would like to avoid high CPU utilization caused by need to perform null active expiry checks during cron runs.
  • Copherency - We would like the reported system state to match the logical state as much as possible. For example the reported number of keys per DB is expected to match the number of keys which are NOT logically expired.

New API

At this point we will only focus on introducing the new expiry API for hashes. I think once we do that it will be more trivial work to produce and duplicate the mechanisms for SETs and SortedSets.
The proposed API will be very much identical to the Redis provided API (Redis 7.4). This is intentionally proposed in order to avoid breaking client applications already opted to use hash items TTL.

The new Key type

Currently a key is always an SDS in Valkey. Although it is possible to match a key with external metadata (eg TTL) by mapping the key to the relevant metadata, it will incur extra memory utilization to hold the mapping. Some dictionaries use objects with embedded keys were the metadata can be set as part of the object. However that would require every dictionary which needs TTL support to use objects with embedded keys and might significantly complicate existing code paths as well as require extra memory in order hold the object metadata.
We propose to introduce a new key type - HKEY which will use an intrusive sds abstraction.
So the type is simply defined as:

typedef sds hkey

hkey memory layout

There are 2 options to place the hkey metadata.

Option 1 - place metadata in front of the sds

image (4)

We can place the metadata in front of the sds. This can be proved to be better in terms of memory access since the metadata might be loaded with the entire cacheline. This has the downside of having to relocate the sds (or memmove it) whenever we add or remove metadata blocks, but we assume this is not as frequent as metadata access.
This also means the total metadata size might need to be kept aligned so to keep the sds address (the address of the start of the string) odd in order to support different hacks we did to minimize memory consumption.

Option 2 - place metadata at the end of the sds

image (5)

We can place the metadata at the end of the sds. Since keys are immutable, there is no risk of having to frequently relocate the sds. In fact for large key strings (larger than 64 bytes sds allocation size) There is a better chance we will not have to rellocate the sds when we set or remove TTL from hash item, since if might already fit in the jemalloc bin. The problem with this memory ordering is that in case the hkey will have to support extra metadata in order to be kept in some secondarty index (see the next section - items expiration tracking), it might require many dereferrences from sds header to metadata which can exceed the L1 cache line size.

The Key API

We can start at a minimal API and extend it in the future. We can also decide to extend the metadata in the future when we would like to use the key in multiple indexes which require to place index specific metadata in each key.
initially, a key is created the same way as a regular sds (unless we know it’s ttl at creation time).
The key will provide API in order to:

  1. hkey hkey_create(sds k);
    This will create be trivial function which will basically just return the sds cast as a key type.

  2. hkey hkey_set_ttl(hkey k, long long ttl);
    Set TTL metadata to the key. In case the key already has metadata this will override the existing metadata value. Adding and removing metadata MIGHT require reallocating the sds.
    We can avoid reallocating memory by checking the sds allocation size and match it to the string size.
    In case we have enough room for placing the metadata we can add the metadata in place (depending on the chosen memory layout).
    In order to understand if a specific key has ttl or not we will make use of the extra 5 available bits in the sds header flags. Since the sds type 5 (used for strings up to 32 bytes) is making use of ALL flags bits, we will have to convert the sds to type 8 when placing ttl on the specific key.

  3. hkey hkey_unset_ttl(hkey k, long long ttl);
    This will remove ttl data from the key. In case we can shrink the key we will reallocate the sds to a smaller sds in order to save memory (while also changing to type 5 if needed).

  4. long long hkey_get_ttl(hkey k);
    This will return the key assigned ttl (in case one exists). In case no TTL is assigned to the key, the returned value will be -1.
    Checking for ttl presence is rather easy. We first check the sds type (in case it is type 5 we know it has no ttl)
    In case of other sds types, we will validate the ttl (metadata) bit is set in the flags.

  5. void *hkey_get_value(hkey k);
    This will return the pointer to the value associated with that key (in case one exists).

  6. void *hkey_set_value(hkey k);
    This will set the pointer to the value associated with that key.

Handling ListPack representation

User can configure hash type objects to use memory efficient representation using listpacks. Although we could add TTL to the listpack representation this might greatly complicate the usage when some items do not have TTL and others does. Using an external data structure to manage the TTL for listpack elements which were assigned TTL is possible but might reduce the memeory efficiency of this representation. In the sake of simplicity we will force listpack conversion to hashtable representation as soon as the first item TTL is set. This way we will not enforce memory efficiency degradation for users not setting TTL on hash object elements.

Items expiration tracking

In order to manage active expiration, we have to keep a record of all items with TTL and have the ability to iterate and reclaim them from time to time. For general keys, we use a separate expiry hashtable for each kvstore element in order to store keys with TTL. During cron cycles (triggered roughly every server.hz) we also use a predefined configurable heuristic in order to bound the cron run time scanning the expiry dictionary.
recall that our first tenet was targeting memory efficiency - we would like our cron to avoid stalling evicting expiry items for a long periods of time. Also recall the third tenet is targeting CPU efficiency. we would like to minimize cron cycles spent iterating over items which were not expired.

Option 1 - Expiration Radix tree

This option was also considered by @antirez (link) in order to catalog items with their expiration time. The technique makes use of the existing RAX data type in order to record 16 bytes “strings” representing the TTL and the object pointer value:

Example
image (6)

  • The first 8 bytes represent the time since the Unix epoch in milliseconds.

    • Assume the current timestamp is 1700236800123 milliseconds since the epoch.
    • This value is represented in hexadecimal: 0000018C77B5A403.
  • The next 8 bytes represent a hypothetical memory address.

    • Assume the memory address is 0x7FFDFAD4C920.
    • This value in hexadecimal is: 7FFDFAD4C9200000 (padded or truncated to 8 bytes as needed).

image (7)

Actually this way of managing objects with timeout is already used in Valkey to manage blocked clients! (link)

The good:

  • Battle tested existing implementation.
  • Fast delivery as this implementation is almost trivial.
  • Provides O(1) insertions and deletions with bounded trie hight of 16 bytes.

The bad:

  • poor memory efficiency. benchmarking this solution over both small and large number of items indicate up to 52 bytes per Object (including the TTL embedded in the hkey object)
  • Performance impact - although the trie hight is bounded, when encapsulating large number of items the trie is expected to hold many internal nodes. this will require up to 16 dereferences between different nodes allocated to support the trie structure which can lead to impact on performance on the regular path (when performing hash inserts, deleted and TTL updates) for items holding TTL.

Although we can think of optimizing this structure to be more performant and memory efficient by adopting alternative implementation (eg judy arrays or other proposed implementations) this work might still not yield better memory efficiency and will require long and complicated implementation.

Option 2 - Deterministic Skip Lists

Since we are introducing object expiration support for hash objects, we cannot allow sorting all items by their TTL since this would imply logarithmic complexity on HashTable operations. We could, however, attempt to achieve approximate constant complexity when modifying Objects. This is done by bounding the number of items stored in a sorted data structure.

In order to manage Objects in a sorted data structure, we would first need to extend the object metadata:
image (8)
Note that we added the Next pointer to the object metadata (now each object overhead is 16 bytes when expiration time is set). We will manage the objects in a data structure known as a deterministic skip-list. Unlike regular skip-list which uses random choosing for the size of the “fat-keys”, deterministic skip list is always reconstructed to support the optimal list (or tree) structure. For example deterministic skip lists are isomorphism to the Btree family of trees . Specifically B+ trees can usually be implemented using a deterministic skip list.

image (9)

Skip lists have several disadvantages though. They require extra space to manage pointers between the middle layer nodes and even though they offer logarithmic complexity for all operations (search, insertion, deletion) they usually involve lots of memory dereferencing which can turn out to be expensive.
One option to reduce the memory overhead of maintaining the middle nodes, is to make use of the optimal list structure and manage it in a single array. Following this idea (link) we always preallocate a certain amount of memory per fast lane based on a hypothetical maximum t of keys. As long as n < t, all inserts can be managed inside the data structure:
image (10)

This helps us maintain a bounded memory overhead per Object (16 bytes Metadata overhead + 8 / p, where p is the allowed gap). For example if we allow p = 8, we will get memory overhead of ~17 Bytes and if we allow p = 4 we will have a memory overhead of ~18 bytes etc...
This implementation also allows future optimization which makes use of AVX/SIMD instructions for fast lookups in the fast lane (internal nodes).

How do we limit the number of objects?

As mentioned before, we cannot allow logarithmic complexity. this means we will have to keep the list small enough (say 512 elements). But what happens when we have more than 512 elements with expiry data?
For that we will manage The RAX data structure mentioned before. Whenever a skip list has grown over the maximum amount of Objects we will split it to 2 lists. SkipLists are great in supporting range split and merges. We can locate the median value in a logarithmic time and since the list is already sorted, splitting it mainly requires allocating and memmoving the inner nodes (which is indeed a linear operation, but it is done only on rare cases where we need to split the list). The two resulting lists are mapped by a RAX structure which always holds the minimal TTL and pointer value in each of the lists and points to the matching skiplist.

Option 3 (Recommended) Hash Time Ranges

When sorting the data is not possible, we can think of a way to impose semi-sorting on the hash elements by assigning them to buckets. Each bucket will correspond to a specific time range, so that all keys with TTL in the specific bucket time range will we placed in that bucket. The bucket time ranges should be carefully considered though. If we choose high resuolution buckets, we might risk loosing memory efficiency due to many buckets holding very few elements sharing the bucket metadata. If we choose low resolution bucket sizes we might miss reclaiming memory for a long period of time.
Heirerical Timer Wheelsis a tequnique presented back at 1987 to manage large number of timer operations. Since then it was adopted by many applications (including Apache Kafka and the linux kernel up to some point) to manage scheduled tasks.
The idea behind timer wheels is to keep events closer in time in a higher resolution while events which are scheduled longer in the future can be kept in lower resolution. The scan is always performed on the highest resolution layer. whenever we complete scanning a full cycle of a lower resolution scan we “cascade down” the items from the higher resolution layer to the higher layer. The benefits of this design is that

  1. We can bound the number of low resolution buckets so that even if we have less elements per bucket they will not live for a long time (up to the total range of the higher resolution layer)
  2. The higher we extend the buckets resolution, we increase the probability of placing many items in the same bucket.
  3. When scanning items in the highest resolution layer, we only iterate over items which are expiered.

The main issue with timer wheels is the cascading operation which might be costly since it will require to iterate over all the items in the cascaded bucket. It is important to note that the higher the level resolution gets, the less frquent this cascading operation needs to take place and it can also increase the probability that these items will be lazily deleted reducing the need to actively iterate and expire them.
I will not get into the details but there are many ways we can optimize the cascading process. For example we can improve memory access by batch prefetching the items. We can also track the size, mean and variance in each bucket and take cascading decitions based on them.

Hash Timer buckets

image (11)

The bucket structure

In order to allow fast insertions and deletions of elements, we will make use of the new hashtable stracture.
Each bucket will be pointed by a RAX entry indicating the start time of the specific time range. The elements inside the bucket will be kept in a hashtable structure. In order to manage and save memeory of very small buckets (eg less then X itemes where X=13) we can manage the items pointers in a listpack.

image (12)

What is the required load factor?

When using a bucket structure, It might be possible to maintain the elements using an intrussive doubly link list, where the prev and next pointers are imbedded in the hkey metadata. However this will require overhead of 24 bytes per hkey with TTL which might be optimized better. In order to make sure we maintain good memory efficiency we will need to make sure to extend the allowed hashtable loadfactor.

For example the following expected memory overhead estimation was done based on poisson distribution of items:
image (13)

The following chart is a statistic evaluation of the expected extra metadata cost per hash item given a specific load factor:
image (14)

These are the current items I plan to focus my PoC on.

Some other issues which will require addressing:

  • Active expiry cycle and credits - The current expiry cron is assigning CPU cycles for expire key scaning according to the ratio of expired to sampled elements, so when many elements were scanned but few expired it assumes less to-be-expired elements. I think we can present somewhat better huristic which nakes use of the Binary Index Tree (BIT) so that the cron will always have the ability to perform raw evaluation of the number of keys already expired.

  • Defrag of Items with keys
    Since in our proposed design (all alternative) we never need to search the expired elements by the key name we will limit the key data saved to the 8 byte pointer address. however this will probably require carefull handling of defrag which will always remove and insert the key again to the TTl tracking data structure.

  • Writable Replica keys expire
    TODO, but the current plan is to avoid managing active expiry for writable replicas.

aiven-sal added a commit to valkey-io/valkey-py that referenced this issue Nov 27, 2024
Support for this commands may come in the future[1].
But it will take some time, so for now it's better
to drop them.

This is a breaking change for 6.1.

Close #78

[1]: valkey-io/valkey#640

Signed-off-by: Salvatore Mesoraca <[email protected]>
@zuiderkwast
Copy link
Contributor

Shall we also support the EXPIREMEMBER syntax from KeyDB? That was the first implementation of this feature so I think we can respect the original syntax and also provide improved migration path for keydb users.

Rafiot pushed a commit to Rafiot/valkey-py that referenced this issue Dec 3, 2024
Support for this commands may come in the future[1].
But it will take some time, so for now it's better
to drop them.

This is a breaking change for 6.1.

Close valkey-io#78

[1]: valkey-io/valkey#640

Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Raphaël Vinot <[email protected]>
@zuiderkwast
Copy link
Contributor

@ranshid I've read your proposal above about I've read about the HKEY, intrusive sds abstraction, now. Sds metadata sounds good. Probably put in the front of the string to avoid extra memory lookups. If the key is very large, we can make space for the TTL in advance to avoid the need to reallocate it. I did that in the keyspace robj already, in #1186.

Some dictionaries use objects with embedded keys were the metadata can be set as part of the object. However that would require every dictionary which needs TTL support to use objects with embedded keys and might significantly complicate existing code paths as well as require extra memory in order hold the object metadata.

I think this is slightly irrelevant and a bit wrong. The hashtable doesn't require keys to be embedded in anything. We could just use a {key, value} struct as an entry for a hash field-value if we wanted. I want to replace the dict with hashtable for sets (#1094), sorted sets (#1096) and hashes (#1095) anyway, regardless of TTL. It doesn't significantly complicate the code. Only the keyspace is complex.

Each hashtable application is slightly different though:

  • Sets use sds as the element. If we want TTL for set elements, the sds metadata sounds like a very good idea. Alternatively, a wrapper struct with an embedded sds string, which is very similar.

  • Sorted set has a skiplist node as a natural container that contains everything about an element. It's the natural entry when swapping dict for hashtable. I can see three possibilities here:

    1. put TTL in the element (ele) sds metadata
    2. put the entire skiplist node in the sds metadata
    3. optional TTL in the skiplist node.

    Which one is better?

  • Hashes have field and value that are often both quite small. It would be very good to be able to have both embedded in the same cache line. In your suggestion above, I only see ttl and value-pointer in the field's sds metadata. Could we embed the entire value contents in the sds metadata of the field? Or should we use another wrapper struct where we can embed field, value and ttl?

  • Keyspace uses robj with embedded key, ttl and optionally also the value (EMBSTR). Would it make sense to embed the entire robj header in the key's sds metadata?

@ranshid
Copy link
Member

ranshid commented Dec 8, 2024

Sorted set has a skiplist node as a natural container that contains everything about an element. It's the natural entry when swapping dict for hashtable. I can see three possibilities here:

put TTL in the element (ele) sds metadata
put the entire skiplist node in the sds metadata
optional TTL in the skiplist node.

I think a combination of all options is the better solution.
We can use the suggested hkey to embed all the skiplist +TTl +score data in the case of zset and just TTL for sets etc...
In my current in progress PoC I allocate all the data including a skiplist fatkey data (for the skiplist) at the end or start of the sds since it is immutable. The hkey itself is basically a skiplist node since it can always access the score, levels and optionally TTL.

@binaryfire
Copy link

Shall we also support the EXPIREMEMBER syntax from KeyDB? That was the first implementation of this feature so I think we can respect the original syntax and also provide improved migration path for keydb users.

We've got several apps using KeyDB. EXPIREMEMBER support would make Valkey a drop-in replacement for KeyDB, which would be great.

One question - will empty sets be automatically deleted (i.e. after the last member expires)? I'm pretty sure that's how KeyDB does it. Otherwise we'd need to implement some kind of manual cleanup which would add unnecessary complication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-changes-needed Client changes are required for this feature enhancement New feature or request
Projects
Status: Idea
Development

No branches or pull requests

7 participants