Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add parameter for limiting maximum AOF files size on disk #1425

Closed
wants to merge 101 commits into from

Conversation

kronwerk
Copy link
Contributor

one can optionally want to limit AOF disk usage not with disk real size, but with some tailored value. this PR makes it possible.
possible use cases:

  1. filling disk up with AOF to the real limit may break something in user's environment (logs, watchdogs, anything) - setting lower value prevents that completely;
  2. setting this new limit to some low value may allow automatic solving of filling disk problem for those users who don't care too much about disk cost:
  • intense writes may overcome automatic AOF rewrite with some user's settings at some moment;
  • AOF reach lower than full disk limit (e.g. 50-60% of disk);
  • eventually automatic AOF rewrite comes in once more and finally fixes that.
    while on 100% filled disk (like nowadays) that automatic solution is impossible due to insufficient space for tmp AOF rewrite files (0 space, actually) - one should size up disk under running database which makes some people a bit nervous, as I noticed, and in some cases that's not always convenient or even possible.

Fixes #540

Signed-off-by: kronwerk <[email protected]>

improved aof-max-size tests

Signed-off-by: kronwerk <[email protected]>
@kronwerk kronwerk marked this pull request as draft December 11, 2024 14:02
Signed-off-by: kronwerk <[email protected]>
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 84.09091% with 427 lines in your changes missing coverage. Please review.

Project coverage is 70.86%. Comparing base (1acf7f7) to head (1f3dd5c).
Report is 100 commits behind head on unstable.

Files with missing lines Patch % Lines
src/module.c 5.21% 109 Missing ⚠️
src/valkey-benchmark.c 2.80% 104 Missing ⚠️
src/networking.c 91.46% 36 Missing ⚠️
src/scripting_engine.c 74.77% 28 Missing ⚠️
src/io_threads.c 0.00% 24 Missing ⚠️
src/cluster_legacy.c 90.41% 14 Missing ⚠️
src/rdb.c 79.71% 14 Missing ⚠️
src/module.h 0.00% 13 Missing ⚠️
src/defrag.c 89.32% 11 Missing ⚠️
src/replication.c 95.26% 10 Missing ⚠️
... and 17 more
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1425      +/-   ##
============================================
+ Coverage     70.71%   70.86%   +0.15%     
============================================
  Files           119      121       +2     
  Lines         64652    65180     +528     
============================================
+ Hits          45717    46189     +472     
- Misses        18935    18991      +56     
Files with missing lines Coverage Δ
src/adlist.c 76.92% <100.00%> (+5.56%) ⬆️
src/allocator_defrag.c 100.00% <100.00%> (ø)
src/cluster.c 89.21% <100.00%> (+0.70%) ⬆️
src/commands.def 100.00% <ø> (ø)
src/connection.h 87.50% <ø> (-0.15%) ⬇️
src/eval.c 57.06% <100.00%> (+0.04%) ⬆️
src/evict.c 97.71% <100.00%> (ø)
src/geo.c 93.58% <100.00%> (ø)
src/geohash_helper.c 99.15% <ø> (ø)
src/hyperloglog.c 92.23% <ø> (ø)
... and 49 more

... and 5 files with indirect coverage changes

@kronwerk kronwerk marked this pull request as ready for review December 11, 2024 17:57
src/aof.c Outdated
ssize_t nwritten = 0, totwritten = 0, nonewritten = -1;

if (aof_max_size && (unsigned long long)aof_current_size >= aof_max_size) {
errno = ENOSPC;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message:
Error writing to the AOF file: <ENOSPC>
is misleading if the issue is not actual disk space but hitting the configured max file size. Consider logging a more explicit message, and using EFBIG instead of ENOSPC for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about smth like that, but haven't found a better error code than ENOSPC. the one you offered is (https://man7.org/linux/man-pages/man2/open.2.html):
EOVERFLOW pathname refers to a regular file that is too large to be opened. The usual scenario here is that an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 tried to open a file whose size exceeds (1<<31)-1 bytes; see also O_LARGEFILE above. This is the error specified by POSIX.1; before Linux 2.6.24, Linux gave the error EFBIG for this case.

seems that it's more misleading (imho) - we're not opening a file, we're writing to it, and we're exactly out of space (though the limit is not a real disk limit).

we can try to define our own errno type, a new one, to differ it from the real disk limit and form a more descriptive message based on that upper in code stack - but shouldn't that be an overkill for such an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POSIX error codes aren't tied to specific system/library calls and can be applied broadly. In the example above, it was in the context of open. Applications can use their own interpretations, this is an example:
https://github.com/valkey-io/valkey/blob/unstable/src/module.c#L5427

I'm fine with using either error codes, but there should be a custom error message to avoid misleading users into thinking it's a real disk space issue if ENOSPC is used.

@madolson madolson requested a review from soloestoy December 17, 2024 18:06
@madolson madolson added the major-decision-pending Major decision pending by TSC team label Dec 17, 2024
@kronwerk kronwerk marked this pull request as draft January 13, 2025 08:31
JimB123 and others added 19 commits January 27, 2025 18:35
Addresses valkey-io#1393

Changes:
* During AOF loading or long running script, this allows defrag to be
initiated.
* The AOF defrag test was corrected to eliminate the wait period and
rely on non-timer invocations.
* Logic for "overage" time in defrag was changed. It previously
accumulated underage leading to large latencies in extreme tests having
very high CPU percentage. After several simple stages were completed
during infrequent blocked processing, a large cycle time would be
experienced.

Signed-off-by: Jim Brunner <[email protected]>
…ency when handshake timedout (valkey-io#1307)

In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.

This fixes the bug described in valkey-io#1251.

---------

Signed-off-by: Pierre Turin <[email protected]>
This can happen when scripts are running for long period of time and the server attempts to defrag it in the whileBlockedCron.

Signed-off-by: Ran Shidlansik <[email protected]>
…ccess_key` (valkey-io#1363)

## Summary
This PR fixes valkey-io#1346 where we can get rid of the long term credentials by
using OpenID Connect. OpenID Connect (OIDC) allows your GitHub Actions
workflows to access resources in Amazon Web Services (AWS), without
needing to store the AWS credentials as long-lived GitHub secrets.

---------

Signed-off-by: vudiep411 <[email protected]>
…#1429)

This change makes the binary build on the target ubuntu version.

This PR also deprecated ubuntu18 and valkey will not support:

- X86:
  - Ubuntu 20
  - Ubuntu 22
  - Ubuntu 24
 - ARM:
   - Ubuntu 20
   - Ubuntu 22
   
Removed ARM ubuntu 24 as the action we are using for ARM builds does not
support Ubuntu 24.

---------

Signed-off-by: Roshan Khatri <[email protected]>
…key-io#1430)

This update addresses several issues in defrag:
1. In the defrag redesign
(valkey-io#1242), a bug was introduced
where `server.cronloops` was no longer being incremented in the
`whileBlockedCron()`. This resulted in some memory statistics not being
updated while blocked.
2. In the test case for AOF loading, we were seeing errors due to defrag
latencies. However, running the math, the latencies are justified given
the extremely high CPU target of the testcase. Adjusted the expected
latency check to allow longer latencies for this case where defrag is
undergoing starvation while AOF loading is in progress.
3. A "stage" is passed a "target". For the main dictionary and expires,
we were passing in a `kvstore*`. However, on flushall or swapdb, the
pointer may change. It's safer and more stable to use an index for the
DB (a DBID). Then if the pointer changes, we can detect the change, and
simply abort the stage. (If there's still fragmentation to deal with,
we'll pick it up again on the next cycle.)
4. We always start a new stage on a new defrag cycle. This gives the new
stage time to run, and prevents latency issues for certain stages which
don't operate incrementally. However, often several stages will require
almost no work, and this will leave a chunk of our CPU allotment unused.
This is mainly an issue in starvation situations (like AOF loading or
LUA script) - where defrag is running infrequently, with a large
duty-cycle. This change allows a new stage to be initiated if we still
have a standard duty-cycle remaining. (This can happen during starvation
situations where the planned duty cycle is larger than the standard
cycle. Most likely this isn't a concern for real scenarios, but it was
observed in testing.)
5. Minor comment correction in `server.h`

Signed-off-by: Jim Brunner <[email protected]>
Fixes four cases where `stringmatchlen` could overrun the pattern if it
is not terminated with NUL.

These commits are cherry-picked from my
[fork](https://github.com/thaliaarchi/antirez-stringmatch) which
extracts `stringmatch` as a library and compares it to other projects by
antirez which use the same matcher.

Signed-off-by: Thalia Archibald <[email protected]>
The CI job was introduced in valkey-io#1363, we should skip it in forks.

Signed-off-by: Binbin <[email protected]>
Introduced in valkey-io#1363, the file name does not match.

Signed-off-by: Binbin <[email protected]>
We deprecate the usage of classic malloc and free, but under certain
circumstances they might get imported from intrinsics. The original
thought is we should just override malloc and free to use zmalloc and
zfree, but I think we should continue to deprecate it to avoid
accidental imports of allocations.

Closes valkey-io#1434.

---------

Signed-off-by: Madelyn Olson <[email protected]>
The creation of fragmentation is delayed when we use lazy-free. You can
induce some of the active-defrag tests to fail by artificially adding a
delay in the lazyfree process, similar to the issues seen in valkey-io#1433 and
issues like
https://github.com/valkey-io/valkey/actions/runs/12267010712/job/34226304803#step:7:6538.
The solution is to always do sync free during tests.

Might close valkey-io#1433.

Signed-off-by: Madelyn Olson <[email protected]>
The new `hashtable` provides faster lookups and uses less memory than
`dict`.

A TCL test case "SRANDMEMBER with a dict containing long chain" is
deleted because it's covered by a hashtable unit test
"test_random_entry_with_long_chain", which is already present.

This change also moves some logic from dismissMemory (object.c) to
zmadvise_dontneed (zmalloc.c), so the hashtable implementation which
needs the dismiss functionality doesn't need to depend on object.c and
server.h.

This PR follows valkey-io#1186.

---------

Signed-off-by: Rain Valentine <[email protected]>
Signed-off-by: Viktor Söderqvist <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
…alkey-io#1356)

This is a follow of valkey-io#1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion valkey-io#1305 for more information.

Signed-off-by: Binbin <[email protected]>
…o#1436)

After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

Signed-off-by: Binbin <[email protected]>
We have set the secret as `AWS_S3_TEST_BUCKET` for test bucket and I
missed it in the initial review.

Signed-off-by: Roshan Khatri <[email protected]>
Avoid tmpfs as fadvise(FADV_DONTNEED) has no effect on memory-backed
filesystems.

Fixes valkey-io#897

---------

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: ranshid <[email protected]>
Co-authored-by: ranshid <[email protected]>
Co-authored-by: Ran Shidlansik <[email protected]>
…hen allocator is not jemalloc (valkey-io#1303)

Introduce compile time option to force activedefrag to run even when
jemalloc is not used as the allocator.
This is in order to be able to run tests with defrag enabled
while using memory instrumentation tools.

fixes: valkey-io#1241

---------

Signed-off-by: ranshid <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Signed-off-by: ranshid <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Upgrade `typos` and fix corresponding typos

---------

Signed-off-by: Viktor Szépe <[email protected]>
Asan now supports making sure you are passing in the correct pointer
type, which seems useful but we can't support it since we pass in an
incorrect pointer in several places. This is most commonly done with
generic free functions, where we simply cast it to the correct type.

It's not a lot of code to clean up, so it seems appropriate to cleanup
instead of disabling the check.

---------

Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
enjoy-binbin and others added 27 commits January 27, 2025 18:35
… the FAIL (valkey-io#1191)

Imagine we have a cluster, for example a three-shard cluster,
if shard 1 doing a CLUSTER RESET HARD, it will change the node
name, and then other nodes will mark it as NOADR since the node
name received by PONG has changed.

In the eyes of other nodes, there is one working primary node
left but with no address, and in this case, the address report
in MOVED will be invalid and will confuse the clients. And in
the same time, the replica will not failover since its primary
is not in the FAIL state. And the cluster looks OK to everyone.

This leaves a cluster that appears OK, but with no coverage for
shard 1, obviously we should do something like CLUSTER FORGET
to remove the node and fix the cluster before using it.

But the point in here, we can mark the NOADDR node as FAIL to
advance the cluster state. If a node is NOADDR means it does
not have a valid address, so we won't reconnect it, we won't
send PING, we won't gossip it, it seems reasonable to mark it
as FAIL.

Signed-off-by: Binbin <[email protected]>
When the cluster changes, we need to persist the cluster configuration,
and these file IO operations may cause latency.

Signed-off-by: Binbin <[email protected]>
The commands used in valkey-cli tests are not important the reply schema
validation. Skip them to avoid the problem if tests hanging. This has
failed lately in the daily job:

```
[TIMEOUT]: clients state report follows.
sock55fedcc19be0 => (IN PROGRESS) valkey-cli pubsub mode with single standard channel subscription
Killing still running Valkey server 33357
```

These test cases use a special valkey-cli command `:get pubsub` command,
which is an internal command to valkey-cli rather than a Valkey server
command. This command hangs when compiled with with logreqres enabled.
Easy solution is to skip the tests in this setup.

The test cases were introduced in valkey-io#1432.

Signed-off-by: Viktor Söderqvist <[email protected]>
After valkey-io#1545 disabled some tests for reply schema validation, we now have
another issue that ECHO is not covered.

```
WARNING! The following commands were not hit at all:
  echo
ERROR! at least one command was not hit by the tests
```

This patch adds a test case for ECHO in the unit/other test suite. I
haven't checked if there are more commands that aren't covered.

Signed-off-by: Viktor Söderqvist <[email protected]>
This PR replaces dict with the new hashtable data structure in the HASH
datatype. There is a new struct for hashtable items which contains a
pointer to value sds string and the embedded key sds string. These
values were previously stored in dictEntry. This structure is kept
opaque so we can easily add small value embedding or other optimizations
in the future.

closes valkey-io#1095

---------

Signed-off-by: Rain Valentine <[email protected]>
In some cases unix groups could have whitespace and/or `\` in them.
One example is my workstation. It's a MacOS in an Active Directory
domain. So my user has group `LD\Domain Users`.
Running `make test` on `unstable` and `8.0` branches fails with:

I'm not sure if we need to fix this in 8.0. But it seems that it should
be fixed in unstable.

Signed-off-by: secwall <[email protected]>
`sds` is a typedef of `char *`.

`const sds` means `char * const`, i.e. a const-pointer to non-const
content.

More often, you would want `const char *`, i.e. a pointer to
const-content. Until now, it's not possible to express that. This PR
adds `const_sds` which is a pointer to const-content sds.

To get a const-pointer to const-content sds, you can use `const
const_sds`.

In this PR, some uses of `const sds` are replaced by `const_sds`. We can
use it more later.

Fixes valkey-io#1542

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients
to inform users about if clients are paused.

---------

Signed-off-by: zhaozhao.zz <[email protected]>
Adds filter options to CLIENT LIST:

    * USER <username>
      Return clients authenticated by <username>.
    * ADDR <ip:port>
      Return clients connected from the specified address.
    * LADDR <ip:port>
      Return clients connected to the specified local address.
    * SKIPME (YES|NO)
      Exclude the current client from the list (default: no).
    * MAXAGE <maxage>
      Only list connections older than the specified age.

Modifies the ID filter to CLIENT KILL to allow multiple IDs

    * ID <client-id> [<client-id>...]
      Kill connections by client ids.


This makes CLIENT LIST and CLIENT KILL accept the same options.

For backward compatibility, the default value for SKIPME is NO for
CLIENT LIST and YES for CLIENT KILL.

The MAXAGE comes from CLIENT KILL, where it *keeps* clients with the
given max age and kills the older ones. This logic becomes weird for
CLIENT LIST, but is kept for similary with CLIENT KILL, for the use case
of first testing manually using CLIENT LIST, and then running CLIENT
KILL with the same filters.

The `ID client-id [client-id ...]` no longer needs to be the last
filter. The parsing logic determines if an argument is an ID or not
based on whether it can be parsed as an integer or not.

Partly addresses: valkey-io#668

---------

Signed-off-by: Sarthak Aggarwal <[email protected]>
Just like spell-check workflow, we should allow to trigger it
in push events, so that the forks repo can notice the format
thing way before submitting the PR.

Signed-off-by: Binbin <[email protected]>
…d other commands(valkey-io#1517)

Some commands that use unix-time, such as `EXPIREAT` and `SET EXAT`, should include the deleted keys in the `expired_keys` statistics if the specified time has already expired, and notifications should be sent in the manner of expired.

---------

Signed-off-by: Ray Cao <[email protected]>
…1312)

This commit creates a new compilation unit for the scripting engine code
by extracting the existing code from the functions unit.
We're doing this refactor to prepare the code for running the `EVAL`
command using different scripting engines.

This PR has a module API change: we changed the type of error messages
returned by the callback
`ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc` to be a
`ValkeyModuleString` (aka `robj`);

This PR also fixes valkey-io#1470.

---------

Signed-off-by: Ricardo Dias <[email protected]>
…1563)

This issue affected only two message types (CLUSTERMSG_TYPE_PUBLISH and CLUSTERMSG_TYPE_PUBLISHSHARD) because they used a light message header, which caused the CLUSTER INFO stats to miss sent/received message information for those types.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
Co-authored-by: Binbin <[email protected]>
When processing a cluster bus PING extension, there is a memory leak
when adding a new key to the `nodes_black_list` dict. We now make sure
to free the key `sds` if the dict did not take ownership of it.

Signed-off-by: Pierre Turin <[email protected]>
Update comments and log message in `cluster_legacy.c`.

Follow-up from valkey-io#1441.

Signed-off-by: Pierre Turin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Co-authored-by: Binbin <[email protected]>
…#1584)

The test case checks for expire-cycle in LATENCY LATEST, but with the
new hash table, the expiry-cycle is too fast to be logged by latency
monitor. Lower the latency monitor threshold to make it more likely to
be logged.

Fixes valkey-io#1580

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
…#1586)

change the format of the dual channel replication logs so that it will
not
conflict with existing log formats like modules. 

Fixes: valkey-io#1509

Signed-off-by: Ran Shidlansik <[email protected]>
The desync regression test was created as a regression test for the
following bug:
in case we embed NULL termination inside inline/multi-bulk message we
will not be able to perform strchr in order to
identify the newline(\n)/carriage-return(\r) in the client query buffer.
this can influence (for example) replica reading primary stream and keep
filling it's query buffer endlessly consuming more and more memory.

In order to handle the above risk, a check was added to verify the
inline bulk and multi-bulk size are not exceeding the 64K bytes in the
query-buffer. A test was placed in order to verify this.

This PR introduce the following fixes to the desync regression test:
1. fix the sent payload to flush 1024 bytes block of 'A's instead of
'payload' which was sent by mistake.
2. Make sure that the connection is correctly terminated on protocol
error by the server after exceeding the 64K and not over 64K.
3. add another test intrinsic which will also verify the nested bulk
with embedded null termination (was not verified before)

fixes valkey-io#1583


NOTE: Although it is possible to change the use of strchr to a more
"safe" utility (eg memchr) which will not pause scan at first occurrence
of '\0', we still like to protect against over excessive usage of the
query buffer and also preserve the current behavior(?). We will look
into improving this though in a followup issue.

---------

Signed-off-by: Ran Shidlansik <[email protected]>
Signed-off-by: ranshid <[email protected]>
remove socket nonblocking and simplify the validation

fixes valkey-io#1592

Signed-off-by: ranshid <[email protected]>
This includes a way to run two versions of the server from the TCL test
framework. It's a preparation to add more cross-version tests. The
runtest script accepts a new parameter

    ./runtest --other-server-path path/to/valkey-server

and a new tag "needs:other-server" for test cases and start_server.
Tests with this tag are automatically skipped if `--other-server-path`
is not provided.

This PR adds it in a CI job with Valkey 7.2.7 by downloading a binary
release.

Fixes valkey-io#76

---------

Signed-off-by: Viktor Söderqvist <[email protected]>
This PR builds upon the [previous entry prefetching
optimization](valkey-io#1501) to further
enhance performance by implementing value prefetching for hashtable
iterators.

## Implementation
Modified `hashtableInitIterator` to accept a new flags parameter,
allowing control over iterator behavior.
Implemented conditional value prefetching within `hashtableNext` based
on the new `HASHTABLE_ITER_PREFETCH_VALUES` flag.
When the flag is set, hashtableNext now calls `prefetchBucketValues` at
the start of each new bucket, preemptively loading the values of filled
entries into the CPU cache.
The actual prefetching of values is performed using type-specific
callback functions implemented in `server.c`:
- For `robj` the `hashtableObjectPrefetchValue` callback is used to
prefetch the value if not embeded.

This implementation is specifically focused on main database iterations
at this stage. Applying it to hashtables that hold other object types
should not be problematic, but its performance benefits for those cases
will need to be proven through testing and benchmarking.

## Performance

### Setup:
- 64cores Graviton 3 Amazon EC2 instance.
-  50 mil keys with different value sizes.
-  Running valkey server over RAM file system.
-  crc checksum and comperssion off.

### Action
- save command.

### Results
The results regarding the duration of “save” command was taken from
“info all” command.
```
+--------------------+------------------+------------------+ 
| Prefetching        | Value size (byte)| Time (seconds)   | 
+--------------------+------------------+------------------+ 
| No                 | 100              | 20.112279        | 
| Yes                | 100              | 12.758519        | 
| No                 | 40               | 16.945366        | 
| Yes                | 40               | 10.902022        |
| No                 | 20               | 9.817000         | 
| Yes                | 20               | 9.626821         |
| No                 | 10               | 9.71510          | 
| Yes                | 10               | 9.510565         |
+--------------------+------------------+------------------+
```
The results largely align with our expectations, showing significant
improvements for larger values (100 bytes and 40 bytes) that are stored
outside the robj. For smaller values (20 bytes and 10 bytes) that are
embedded within the robj, we see almost no improvement, which is as
expected.

However, the small improvement observed even for these embedded values
is somewhat surprising. Given that we are not actively prefetching these
embedded values, this minor performance gain was not anticipated.

perf record on save command **without** value prefetching:
```
                --99.98%--rdbSaveDb
                          |          
                          |--91.38%--rdbSaveKeyValuePair
                          |          |          
                          |          |--42.72%--rdbSaveRawString
                          |          |          |          
                          |          |          |--26.69%--rdbWriteRaw
                          |          |          |          |          
                          |          |          |           --25.75%--rioFileWrite.lto_priv.0
                          |          |          |          
                          |          |           --15.41%--rdbSaveLen
                          |          |                     |          
                          |          |                     |--7.58%--rdbWriteRaw
                          |          |                     |          |          
                          |          |                     |           --7.08%--rioFileWrite.lto_priv.0
                          |          |                     |                     |          
                          |          |                     |                      --6.54%--_IO_fwrite
                          |          |                     |                                         
                          |          |                     |          
                          |          |                      --7.42%--rdbWriteRaw.constprop.1
                          |          |                                |          
                          |          |                                 --7.18%--rioFileWrite.lto_priv.0
                          |          |                                           |          
                          |          |                                            --6.73%--_IO_fwrite
                          |          |                                                            
                          |          |          
                          |          |--40.44%--rdbSaveStringObject
                          |          |          
                          |           --7.62%--rdbSaveObjectType
                          |                     |          
                          |                      --7.39%--rdbWriteRaw.constprop.1
                          |                                |          
                          |                                 --7.04%--rioFileWrite.lto_priv.0
                          |                                           |          
                          |                                            --6.59%--_IO_fwrite
                          |                                                               
                          |          
                           --7.33%--hashtableNext.constprop.1
                                     |          
                                      --6.28%--prefetchNextBucketEntries.lto_priv.0
```
perf record on save command **with** value prefetching:
```
               rdbSaveRio
               |          
                --99.93%--rdbSaveDb
                          |          
                          |--79.81%--rdbSaveKeyValuePair
                          |          |          
                          |          |--66.79%--rdbSaveRawString
                          |          |          |          
                          |          |          |--42.31%--rdbWriteRaw
                          |          |          |          |          
                          |          |          |           --40.74%--rioFileWrite.lto_priv.0
                          |          |          |          
                          |          |           --23.37%--rdbSaveLen
                          |          |                     |          
                          |          |                     |--11.78%--rdbWriteRaw
                          |          |                     |          |          
                          |          |                     |           --11.03%--rioFileWrite.lto_priv.0
                          |          |                     |                     |          
                          |          |                     |                      --10.30%--_IO_fwrite
                          |          |                     |                                |          
                          |          |                     |          
                          |          |                      --10.98%--rdbWriteRaw.constprop.1
                          |          |                                |          
                          |          |                                 --10.44%--rioFileWrite.lto_priv.0
                          |          |                                           |          
                          |          |                                            --9.74%--_IO_fwrite
                          |          |                                                      |          
                          |          |          
                          |          |--11.33%--rdbSaveObjectType
                          |          |          |          
                          |          |           --10.96%--rdbWriteRaw.constprop.1
                          |          |                     |          
                          |          |                      --10.51%--rioFileWrite.lto_priv.0
                          |          |                                |          
                          |          |                                 --9.75%--_IO_fwrite
                          |          |                                           |          
                          |          |          
                          |           --0.77%--rdbSaveStringObject
                          |          
                           --18.39%--hashtableNext
                                     |          
                                     |--10.04%--hashtableObjectPrefetchValue
                                     |
                                      --6.06%--prefetchNextBucketEntries        

```
Conclusions:

The prefetching strategy appears to be working as intended, shifting the
performance bottleneck from data access to I/O operations.
The significant reduction in rdbSaveStringObject time suggests that
string objects(which are the values) are being accessed more
efficiently.

Signed-off-by: NadavGigi <[email protected]>
…alkey-io#1294)

As discussed in PR valkey-io#336.

We have different types of resources like CPU, memory, network, etc. The
`slowlog` can only record commands eat lots of CPU during the processing
phase (doesn't include read/write network time), but can not record
commands eat too many memory and network. For example:

1. run "SET key value(10 megabytes)" command would not be recored in
slowlog, since when processing it the SET command only insert the
value's pointer into db dict. But that command eats huge memory in query
buffer and bandwidth from network. In this case, just 1000 tps can cause
10GB/s network flow.
2. run "GET key" command and the key's value length is 10 megabytes. The
get command can eat huge memory in output buffer and bandwidth to
network.

This PR introduces a new command `COMMANDLOG`, to log commands that
consume significant network bandwidth, including both input and output.
Users can retrieve the results using `COMMANDLOG get <count>
large-request` and `COMMANDLOG get <count> large-reply`, all subcommands
for `COMMANDLOG` are:

* `COMMANDLOG HELP`
* `COMMANDLOG GET <count> <slow|large-request|large-reply>`
* `COMMANDLOG LEN <slow|large-request|large-reply>`
* `COMMANDLOG RESET <slow|large-request|large-reply>`

And the slowlog is also incorporated into the commandlog.

For each of these three types, additional configs have been added for
control:

* `commandlog-request-larger-than` and
`commandlog-large-request-max-len` represent the threshold for large
requests(the unit is Bytes) and the maximum number of commands that can
be recorded.
* `commandlog-reply-larger-than` and `commandlog-large-reply-max-len`
represent the threshold for large replies(the unit is Bytes) and the
maximum number of commands that can be recorded.
* `commandlog-execution-slower-than` and
`commandlog-slow-execution-max-len` represent the threshold for slow
executions(the unit is microseconds) and the maximum number of commands
that can be recorded.
* Additionally, `slowlog-log-slower-than` and `slowlog-max-len` are now
set as aliases for these two new configs.

---------

Signed-off-by: zhaozhao.zz <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Fixes reply-schema-validator test job which needs coverage for all
commands.

Failing job:
https://github.com/valkey-io/valkey/actions/runs/12969591890/job/36173810824

Signed-off-by: Viktor Söderqvist <[email protected]>
Fixes the unit test for hashtable random fairness intermittent failures when
running with the `--accurate` flag.

https://github.com/valkey-io/valkey/actions/runs/12969591890/job/36173815884#step:10:105

The test case picks a random element out of 400, repeated 1M times, and
then checks that 60% of the elements are picked within 3 standard
deviations from the number of times they're expected to be picked. In
this test run (with `--accurate`), the expected number is 2500 and the
standard deviation is 50, which is only 2% of the expected value. This
makes the check too strict and makes the test flaky.

As an alternative, we allow 80% of the elements to be picked within 10%
of the expected number. With this alternative condition, we can also
raise the check for the non-edge case from 60% to 80% of the elements to
be within 3 standard deviations. (With fewer repetitions, 3 standard
deviations is greater than 10% of the expected value, so this new
condition only affects the `--accurate` test run.)

Additional change: Set a random seed to the hash function in the test
suite. Until now, we only seeded the random number generator.

Signed-off-by: Viktor Söderqvist <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Major decision pending by TSC team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NEW] Limit maximum size on disk of AOF files. Avoid disk full, long load times.