-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filecache: use os.CreateTemp to create files #2088
Conversation
Signed-off-by: Achille Roussel <[email protected]>
Signed-off-by: Achille Roussel <[email protected]>
I submitted a second commit after realizing that we could get rid of the in-memory synchronization in the file cache and rely exclusively on the atomicity guarantees offered by the file system. Besides the code simplification, relying on the file system also offers synchronization that works when multiple processes share the same cache directory. |
so, the assumption essentially is A) reads the file, validates the header in other words, if this is correct we should be able to reproduce the failure on CI by crafting a cache entry with correct metadata but random or zeroed data in the body In a quick experiment I commented out lines 206-210 in: wazero/internal/engine/compiler/engine_cache.go Lines 201 to 217 in 99c057b
i.e. the cache now always returns a right-sized buffer with whatever garbage the trick is similar for wazevo in wazero/internal/engine/wazevo/engine_cache.go Lines 225 to 229 in 330be2b
and the resulting failures look compatible to me trace for `compiler`
trace for wazevo
I think it might be worth hardening the file structure on disk adding a "trailer" after the body; i.e. the file structure would become (more or less):
notice that wazevo data comes with a trailing "sourceMap" but this is skipped unless a specific flag in Fn_HEAD the TRAILER might contain either a checksum of BODY or just a repetition of the LEN (found already in EXEC_HEADER). The second is cheaper because it's enough to check LEN against the LEN found in EXEC_HEADER, but it's a bit less safe. (I am not implying this should be part of the PR, just reasoning out loud) |
Hardening of the file format sounds great to me, I would add the checksum even if it's more compute, the in-memory cache should amortize that cost quite effectively, something like CRC would probably be good enough? That being said, this PR would still fix the problem as-is since it makes the insertion of files in the cache an atomic operation, so there is never a conditions where opening a file would cause the truncation of an existing one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to add a checksum, Castagnoli has hardware support in hash/crc32
.
But this PR should also fix the race, as long as no one corrupts the cache on purpose.
Signed-off-by: Achille Roussel <[email protected]>
Signed-off-by: Achille Roussel <[email protected]>
This might be a fix for #2039
Since #1816, the file cache appended
.tmp
to the file path name to avoid leaving incomplete files in the cache when an error occurs. However, this is the definition ofos.Create
:File creation does not include the
O_EXCL
flag, which means that when the file already exists, it is opened and truncated.This PR changes the file cache to use
os.CreateTemp
instead, which guarantees that the file name will be unique by injecting a random string in the file name and opening the file withO_EXCL
to prevent racing calls that would accidentally use the same name from both suceeding:In #2039, we observe errors where the memory mapping containing the code was getting corrupted somehow. We first assumed that it was caused by a lifecycle issue on the compiled modules that hold these memory segments. The hypothesis I'm making with this change is that concurrent calls to
os.Create
for the same compiled module race each other; when the second call occurs, it truncates the content that was being written by the first call. After that, the program cannot rely on the content of the file anymore.I don't have a test to verify the hypothesis, but regardless, I believe this change is useful to make because it addresses a potential race condition that will manifest sooner or later.