-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp): Add the write path for single-file archives. #646
base: main
Are you sure you want to change the base?
Conversation
Warning Rate limit exceeded@davemarco has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 55 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThis pull request introduces support for single-file archives in the CLP (Compressed Log Processing) system. The changes encompass the addition of new source files, modifications to command-line argument parsing, and updates to archive writing mechanisms. Users can now create archives as a single file instead of multiple files, with new configuration options and supporting infrastructure to manage this archiving method. Changes
Possibly Related PRs
Suggested Reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (12)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (5)
19-20
: Consider externalizing the read block size
The constantcReadBlockSize
is set to 4096. For different environments or performance optimizations, externalizing this block size into a configuration or a compile-time parameter might improve flexibility.
133-147
: Reassess large archive handling
A single-file archive exceedingcFileSizeWarningThreshold
triggers a warning, but it might be beneficial to add user guidance or a more detailed strategy for dealing with very large archives (e.g. automatically switching to multi-file).
179-190
: Potentially make version dynamic
Withinwrite_archive_header
, thecArchiveVersion
is hard-coded. If changes are expected in future releases, consider introducing a mechanism to set the version at build time or to read it from project configuration.
192-195
: Minor inefficiency with repeated.str()
calls
Repetitively callingpacked_metadata.str()
can lead to unnecessary string object creation. While not critical for smaller metadata, consider assigningpacked_metadata.str()
once to a local variable for more efficiency.
197-213
: Graceful recovery on partial reads
The loop for reading from the file and writing toarchive_writer
is straightforward and robust. For future improvements, consider adding logging or partial recovery in case of transient errors (e.g. a network file) instead of throwing an exception outright.components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
23-24
: Check necessity of newly added includes.Please confirm that both
Defs.hpp
andwriter.hpp
are required here. If these headers are no longer needed, consider removing them to reduce compilation overhead.
249-252
: Verify multi–file archive cleanup.When single–file archive mode is enabled, the archive transitions to a single–file format at closure. Ensure that the multi–file artefacts are either cleaned up or that the user is aware they remain. Inadvertent leftover files could confuse users.
341-342
: Revisit dictionary–driven splitting logic.Currently, splitting only occurs if
false == m_use_single_file_archive
. Consider whether large dictionary sizes warrant splitting in single–file archive mode too.components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (2)
50-55
: Assess return type for better clarity.Returning a
std::stringstream
fromcreate_single_file_archive_metadata
is straightforward, but consider a custom struct or type alias for readability and future extension.
65-69
: Recover from partial writes.
write_single_file_archive
can remove the existing multi–file archive. Think about potential rollback or error–handling strategies if the write fails partway.components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)
16-21
: Confirm versioning approach.Your major/minor/patch shift bits are standard. Confirm that the product’s versioning policy aligns with these values if future releases require increments.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
115-120
: Evolution of metadata.The newly introduced fields (
m_variable_encoding_methods_version
,m_variables_schema_version
, andm_compression_type
) address single–file archival requirements. Consider placing them in a derived metadata class if multi–file archives remain unaffected.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
components/core/src/clp/clp/CMakeLists.txt
(1 hunks)components/core/src/clp/clp/CommandLineArguments.cpp
(1 hunks)components/core/src/clp/clp/CommandLineArguments.hpp
(3 hunks)components/core/src/clp/clp/FileCompressor.cpp
(3 hunks)components/core/src/clp/clp/compression.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(3 hunks)components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
(1 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp
(5 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp
(4 hunks)
🧰 Additional context used
📓 Path-based instructions (10)
components/core/src/clp/clp/CommandLineArguments.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/clp/FileCompressor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/clp/CommandLineArguments.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/clp/compression.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/core/src/clp/clp/FileCompressor.cpp (2)
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/clp/FileCompressor.hpp:58-78
Timestamp: 2024-11-10T16:46:53.300Z
Learning: When reviewing legacy code refactors, avoid suggesting changes that would extend the scope of the PR.
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/clp/FileCompressor.cpp:189-220
Timestamp: 2024-11-10T16:46:58.543Z
Learning: Ensure that before flagging functions like `parse_and_encode` for throwing exceptions while declared with `noexcept`, verify that the function is actually declared with `noexcept` to avoid false positives.
🔇 Additional comments (32)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)
100-112
: Validate file existence before attempting to get its size
While the exception handling is robust, it may be helpful to add an explicit check if the file exists before calling file_size
. This would provide a clearer error message if the file is missing and avoid filesystem errors.
114-147
: Confirm directory existence when retrieving file info
When assembling the file_infos
, consider verifying that the directory holding segments actually exists (e.g. segment_dir_path
). If it does not, an early error message might clarify that no segments were found.
256-285
: Ensure user awareness before deleting original archive
std::filesystem::remove_all(multi_file_archive_path)
irreversibly deletes the multi-file archive after writing the single-file archive. For safety, either confirm user intent or allow an option to retain the source.
Would you like a script to confirm the presence of any leftover files before full deletion?
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
16-16
: Confirm correct spdlog header include.
Typically the common header is <spdlog/spdlog.h>
rather than <spdlog.h>
. Please verify that this is intentional and that the correct symbols are available.
62-62
: Initialization looks fine.
Assigning user_config.use_single_file_archive
to m_use_single_file_archive
supports the new single–file archive feature as intended. No issues found.
662-679
: Handle edge cases for segment IDs and metadata.
- If
m_next_segment_id
is 0,m_next_segment_id - 1
becomes negative, potentially leading to unexpected behaviour inget_segment_ids()
. - Consider checking for empty or invalid
segment_ids
before proceeding. - Ensure that partial writes or exceptions during
write_single_file_archive
are either rolled back or leave a consistent state.
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (2)
16-34
: Custom exception design appears consistent.
The OperationFailed
class neatly extends TraceableException
and provides a clear error message.
40-41
: Validate last_segment_id usage.
Please ensure that calling get_segment_ids(last_segment_id)
with zero or negative values is handled gracefully in the implementation.
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (6)
23-27
: Basic definitions look acceptable.
The magic number, file extension, and file–size warning threshold appear suitable for single–file archives.
28-35
: Static file handling looks straightforward.
cStaticArchiveFileNames
is a helpful container for known archive files. No issues observed here.
37-43
: Packed struct alignment caution.
__attribute__((packed))
on SingleFileArchiveHeader
can cause cross–platform alignment mismatches. Confirm that your usage environment supports it consistently.
45-49
: FileInfo struct.
No concerns: the usage of MSGPACK_DEFINE_MAP
is consistent with message pack patterns.
51-72
: MultiFileArchiveMetadata structure is appropriate.
The fields here match multi–file archiving logic. Good usage of MSGPACK_DEFINE_MAP
.
74-79
: SingleFileArchiveMetadata structure appropriateness.
Combining archive_files
, archive_metadata
, and num_segments
is logical for single–file mode. Nicely integrated with msgpack.
components/core/src/clp/clp/CommandLineArguments.hpp (3)
26-26
: Default boolean initialisation.
m_single_file_archive(false)
is clear and consistent with the default behaviour of multi–file archives.
49-50
: Accessor method is straightforward.
get_use_single_file_archive()
properly reflects the new member variable. No issues found.
98-98
: Member variable integration.
m_single_file_archive
fits in seamlessly with existing arguments. No contradictions observed.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)
10-10
: Added include for encoding methods.
Including encoding_methods.hpp
is logical, given usage of ffi::cVariableEncodingMethodsVersion
further in the file.
13-14
: New compression type constant.
cCompressionTypeZstd = "ZSTD";
is a welcome addition, clarifying the default compression type used.
86-91
: Accessors for new metadata fields.
Providing methods to retrieve variable encoding and schema versions, along with compression type, aligns well with single–file archive needs.
components/core/src/clp/clp/compression.cpp (3)
110-110
: No issues found with the addition of the single-file-archive configuration.
This line properly forwards the command-line argument into the archive writer’s configuration.
139-140
: Logical check for archive splitting is correct.
The code correctly checks whether the dictionary size threshold is reached and if single-file mode is disabled. This ensures that splitting only occurs under the intended conditions.
168-169
: Consistent archive splitting logic for grouped files.
These lines mirror the logic above and maintain consistent behaviour for grouped file compression.
components/core/src/clp/streaming_archive/writer/Archive.hpp (3)
51-51
: New boolean flag introduced.
Adding the flag use_single_file_archive
to UserConfig
is clear and self-explanatory, facilitating better configurability of the archiving process.
197-197
: Getter method aligns with the coding guidelines.
This accessor cleanly exposes the new flag, supporting usage in external logic.
285-289
: Implementation detail for single-file archive creation.
The create_single_file_archive()
method is a suitable extension point, though it’s not fully shown here. It should be well-documented and tested to ensure it handles any corner cases (e.g. partially written archives).
components/core/src/clp/clp/CommandLineArguments.cpp (1)
376-379
: New command-line option successfully introduced.
The addition of --single-file-archive
provides enhanced flexibility for output formats. Ensure that the help text and documentation clearly describe any potential performance or storage trade-offs.
components/core/src/clp/clp/FileCompressor.cpp (3)
246-247
: Single-file check prevents increments of archive splits.
This usage respects the new flag to avoid unnecessary splitting, preserving consistent behaviour with single-file archiving.
341-342
: Conditional logic consistent with the single-file feature.
Retaining the archival split only if false == …
ensures that the newly introduced flag remains orthogonal to existing compression workflows.
542-543
: Proper check before splitting IR streams.
The logic parallels other file-splitting checks, keeping single-file intentions intact across different compression paths.
components/core/src/clp/clp/CMakeLists.txt (2)
111-113
: LGTM! New source files follow project conventions.
The new single-file archive implementation files are logically organized and follow the project's established patterns for source organization.
Line range hint 1-150
: Consider making GlobalMetadataDB optional for single-file archives.
Based on the PR objectives, GlobalMetadataDB might not be required when writing single-file archives. Consider making it conditionally included based on the archive type to avoid unnecessary dependencies.
Let's verify the GlobalMetadataDB usage in the single-file archive implementation:
✅ Verification successful
Based on the examination of the single file archive implementation in the codebase, particularly in single_file_archive/writer.hpp
and writer.cpp
, I can now provide a final response:
GlobalMetadataDB is not used in single-file archive operations
The single-file archive implementation is completely independent of GlobalMetadataDB. It operates by:
- Creating a self-contained archive with its own metadata structure (
SingleFileArchiveMetadata
) - Directly copying files from the multi-file archive
- Managing file offsets and archive structure internally
The GlobalMetadataDB dependency can be safely made optional for single-file archives as they:
- Store all necessary metadata within the single file
- Don't require external database operations
- Handle their own file management and metadata tracking
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check GlobalMetadataDB usage in single-file archive code
# Check for GlobalMetadataDB usage in new SFA files
rg -A 5 "GlobalMetadataDB" "../streaming_archive/single_file_archive/"
# Check for any conditional usage patterns
ast-grep --pattern 'if ($condition) {
$$$
GlobalMetadataDB
$$$
}'
Length of output: 224
Script:
#!/bin/bash
# Let's try searching in the entire codebase
# Check GlobalMetadataDB usage in single file archive related code
rg -A 5 "GlobalMetadataDB" --type cpp
# Check writer implementation specifically
rg -A 10 "single_file_archive.*writer" --type cpp
# Look for single file archive implementation
fd "single.*archive"
Length of output: 48285
Script:
#!/bin/bash
# Let's check the single file archive writer implementation
rg -A 10 "class.*writer" components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
# Check if GlobalMetadataDB is used in single file archive implementation
rg -A 5 "GlobalMetadataDB" components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
# Check Archive constructor to see how single file archive mode is handled
rg -A 10 "Archive::Archive.*Config" components/core/src/clp/streaming_archive/writer/Archive.cpp
Length of output: 298
Script:
#!/bin/bash
# Check the content of single file archive writer files
cat components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
cat components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
# Check Archive constructor implementation
cat components/core/src/clp/streaming_archive/writer/Archive.cpp | rg -A 20 "Archive::Archive"
Length of output: 12781
The archive version was changed to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (9)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (3)
15-21
: Use consistent naming convention for version constants
The naming “cArchiveMajorVersion”, “cArchiveMinorVersion” and “cArchivePatchVersion” is informative, but ensure consistent naming style (e.g., PascalCase or snake_case) if there’s a global standard.
23-27
: Remove magic number duplication
The.clp
extension and thecUnstructuredSfaMagicNumber
might be repeated in other parts of the code. If so, consider centralizing them in a single constants file to avoid potential inconsistencies.
76-81
: Validate memory usage
Storing an entire vector ofFileInfo
might risk memory issues if the number of archived files grows very large in single-file mode. Consider a streaming approach for extremely large archives.components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)
19-20
: Enhance documentation for cReadBlockSize
Elaborate on the rationale for using 4096. This helps maintainers decide whether to tune it for performance.
137-145
: Large file warning
This warning is valuable. Consider adding a user-configurable threshold to accommodate different environment needs.
215-231
: Possible parallelization
The loop writing each segment’s file content can be parallelized if the order doesn’t matter for performance gains.components/core/src/clp/clp/compression.cpp (2)
139-141
: Maintain DRY
This condition repeats in lines 169-171. Factor out an inline function or macro if duplication grows.
139-141
: Prefer “false == expr” check
Per coding guidelines, replace&& false == archive_writer.get_use_single_file_archive()
with&& (false == archive_writer.get_use_single_file_archive())
.- && false == archive_writer.get_use_single_file_archive()) + && (false == archive_writer.get_use_single_file_archive()))Also applies to: 169-171
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
61-61
: Boolean naming
m_use_single_file_archive
is clear, but consider naming itm_is_single_file_archive
orm_single_file_mode
for consistent convention.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
components/core/src/clp/clp/FileCompressor.cpp
(3 hunks)components/core/src/clp/clp/compression.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(3 hunks)components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
(1 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- components/core/src/clp/clp/FileCompressor.cpp
- components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
- components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (4)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/clp/compression.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (9)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (3)
19-21
: Confirm the semantic version scheme
You are using a bitwise composition of major/minor/patch. Confirm that it remains compatible with your internal versioning constraints discussed in the PR comments.
47-51
: Verify no external references to field names
The FileInfo
struct serializes fields as “n” and “o”. Ensure there are no external references expecting different property names.
53-73
: Serialization definitions
The fields in MultiFileArchiveMetadata
are properly included in MSGPACK_DEFINE_MAP
. Everything looks coherent and consistent with the usage.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)
100-112
: Potential concurrency considerations
update_offset
aggregates sizes by reading the filesystem. If future concurrency is introduced, ensure atomic or thread-safe operations around offset calculations.
117-147
: Good usage of sentinel
The sentinel FileInfo{"", offset}
is a handy approach for communicating total file size.
149-177
: Ensure consistent metadata
Data from ArchiveMetadata
is mirrored here. If new metadata fields are added in the future, they must be carefully synchronized.
197-213
: Handle partial read cases
In a multi-threaded or partial read scenario, validate that your loop handles truncated reads or concurrency issues from external processes.
components/core/src/clp/clp/compression.cpp (1)
110-110
: Check code style
The line updates the user config with archive_user_config.use_single_file_archive = command_line_args.get_use_single_file_archive();
. Everything looks consistent.
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
248-250
: Single-File archiving
Calling create_single_file_archive()
in the close()
method is logical. Ensure that any exceptions raised do not leave the archive in an inconsistent state.
void Archive::create_single_file_archive() { | ||
std::filesystem::path multi_file_archive_path = m_path; | ||
|
||
auto segment_ids | ||
= clp::streaming_archive::single_file_archive::get_segment_ids(m_next_segment_id - 1); | ||
|
||
if (false == m_local_metadata.has_value()) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} | ||
|
||
auto& multi_file_archive_metadata = m_local_metadata.value(); | ||
auto packed_metadata | ||
= clp::streaming_archive::single_file_archive::create_single_file_archive_metadata( | ||
multi_file_archive_metadata, | ||
multi_file_archive_path, | ||
segment_ids | ||
); | ||
|
||
clp::streaming_archive::single_file_archive::write_single_file_archive( | ||
multi_file_archive_path, | ||
packed_metadata, | ||
segment_ids | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider cleanup ordering
You remove the multi-file archive only after writing the single-file archive. Double-check that partial failures mid-operation do not remove the source prematurely.
sorry closing was by accident... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)
19-21
: Consider documenting the rationale for the block size value.Adding a comment explaining why 4096 was chosen as the block size would help future maintainers understand if this value needs adjustment.
namespace { -constexpr size_t cReadBlockSize = 4096; +// 4096 bytes is a common page size and provides good balance between memory usage and I/O performance +constexpr size_t cReadBlockSize = 4096;
195-211
: Enhance error handling with more context in error messages.The error handling could be improved by including more context when throwing exceptions.
} if (ErrorCode_Success != error_code) { - throw OperationFailed(error_code, __FILENAME__, __LINE__); + throw OperationFailed( + error_code, + __FILENAME__, + __LINE__, + fmt::format("Failed to read from file '{}': {}", + file_path.string(), + get_error_message(error_code)) + ); }
265-268
: Improve error messages for file operations.The error handling should include more context about the operation that failed.
if (std::filesystem::exists(single_file_archive_path)) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + fmt::format("Single-file archive '{}' already exists", + single_file_archive_path.string()) + ); } // ... (other code remains the same) try { std::filesystem::remove_all(multi_file_archive_path); } catch (std::filesystem::filesystem_error& e) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + fmt::format("Failed to clean up multi-file archive '{}': {}", + multi_file_archive_path.string(), + e.what()) + ); }Also applies to: 279-284
255-284
: Consider architectural improvements for better operability.Two suggestions for improving the implementation:
- Add logging for successful operations to aid in debugging and monitoring.
- Consider making the cleanup of the multi-file archive configurable, as there might be cases where keeping both versions is desired (e.g., for verification or backup purposes).
Example implementation:
void write_single_file_archive( std::filesystem::path const& multi_file_archive_path, std::stringstream const& packed_metadata, std::vector<std::string> const& segment_ids, bool cleanup_multi_file_archive = true // New parameter ) { SPDLOG_INFO("Creating single-file archive from '{}'", multi_file_archive_path.string()); // ... (existing implementation) ... archive_writer.close(); SPDLOG_INFO("Successfully created single-file archive '{}'", single_file_archive_path.string()); if (cleanup_multi_file_archive) { SPDLOG_INFO("Cleaning up multi-file archive '{}'", multi_file_archive_path.string()); try { std::filesystem::remove_all(multi_file_archive_path); SPDLOG_INFO("Successfully cleaned up multi-file archive"); } catch (std::filesystem::filesystem_error& e) { // ... (existing error handling) ... } } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
1-18
: LGTM! Well-organized includes and namespace declaration.
The includes are properly grouped and all necessary headers are present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (5)
101-113
: Enhance error handling with more specific information.The error handling could be improved by including the file path in the error message and preserving the original error code.
Consider this enhancement:
} catch (std::filesystem::filesystem_error const& e) { throw OperationFailed( ErrorCode_Failure, __FILENAME__, __LINE__, - fmt::format("Failed to get file size: {}", e.what()) + fmt::format("Failed to get file size for '{}': {} (error code: {})", + file_path.string(), e.what(), e.code().value()) ); }
138-145
: Enhance warning message with size information.The warning message could be more helpful by including the actual archive size for comparison.
Consider this enhancement:
if (offset > cFileSizeWarningThreshold) { SPDLOG_WARN( - "Single file archive size exceeded {}. " + "Single file archive size ({}) exceeded threshold ({}). " "The single-file archive format is not intended for large archives, " " consider using multi-file archive format instead.", + offset, cFileSizeWarningThreshold ); }
196-212
: Consider adding progress reporting for large file operations.For better observability during large file operations, consider adding progress reporting. Also, the error handling could be more specific.
Consider these enhancements:
auto write_archive_file(std::filesystem::path const& file_path, FileWriter& archive_writer) -> void { FileReader reader(file_path.string()); + auto total_size = std::filesystem::file_size(file_path); + uint64_t bytes_processed = 0; std::array<char, cReadBlockSize> read_buffer{}; while (true) { size_t num_bytes_read{}; ErrorCode const error_code = reader.try_read(read_buffer.data(), cReadBlockSize, num_bytes_read); if (ErrorCode_EndOfFile == error_code) { break; } if (ErrorCode_Success != error_code) { - throw OperationFailed(error_code, __FILENAME__, __LINE__); + throw OperationFailed( + error_code, + __FILENAME__, + __LINE__, + fmt::format("Failed to read from file: {}", file_path.string()) + ); } archive_writer.write(read_buffer.data(), num_bytes_read); + bytes_processed += num_bytes_read; + if (total_size > cFileSizeWarningThreshold) { + SPDLOG_DEBUG( + "Processing file {}: {:.1f}% ({}/{} bytes)", + file_path.filename().string(), + (bytes_processed * 100.0) / total_size, + bytes_processed, + total_size + ); + } } }
266-268
: Improve error handling with specific error messages.The error handling for file existence check and archive removal could be more informative.
Consider these enhancements:
if (std::filesystem::exists(single_file_archive_path)) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + fmt::format("Single-file archive already exists: {}", + single_file_archive_path.string()) + ); } // ... (other code) try { std::filesystem::remove_all(multi_file_archive_path); } catch (std::filesystem::filesystem_error& e) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + fmt::format("Failed to remove multi-file archive at '{}': {}", + multi_file_archive_path.string(), e.what()) + ); }Also applies to: 280-284
233-245
: Optimize vector allocation and add range validation.Consider pre-allocating the vector and adding a reasonable range check for segment IDs.
Consider these enhancements:
auto get_segment_ids(segment_id_t last_segment_id) -> std::vector<std::string> { std::vector<std::string> segment_ids; if (last_segment_id < 0) { return segment_ids; } + if (last_segment_id > cMaxSegmentId) { + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + fmt::format("Segment ID {} exceeds maximum allowed value {}", + last_segment_id, cMaxSegmentId) + ); + } + + segment_ids.reserve(last_segment_id + 1); for (size_t i = 0; i <= last_segment_id; ++i) { segment_ids.emplace_back(std::to_string(i)); } return segment_ids; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
1-17
: LGTM! Well-organized includes and namespace setup.
The includes are properly organized into system headers, third-party libraries, and project-specific headers. The organization follows good C++ practices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)
135-135
: Document the sentinel file info entryThe empty filename sentinel is used to mark the total size of all files. Consider adding a comment explaining why an empty filename is used as a sentinel and how it should be handled by the reader.
197-213
: Enhance error handling in file readingConsider adding more specific error information when throwing
OperationFailed
. The current implementation loses the original error code context.- throw OperationFailed(error_code, __FILENAME__, __LINE__); + throw OperationFailed( + error_code, + __FILENAME__, + __LINE__, + fmt::format("Failed to read from file: {}", file_path.string()) + );
234-246
: Improve type consistency and clarity in segment ID generationThe function mixes signed and unsigned types, and the negative check could be more explicit.
-auto get_segment_ids(segment_id_t last_segment_id) -> std::vector<std::string> { +auto get_segment_ids(segment_id_t const last_segment_id) -> std::vector<std::string> { std::vector<std::string> segment_ids; - if (last_segment_id < 0) { + if (last_segment_id < 0) { + SPDLOG_WARN("Negative segment ID provided: {}", last_segment_id); return segment_ids; } - for (size_t i = 0; i <= last_segment_id; ++i) { + auto const last_id = static_cast<size_t>(last_segment_id); + for (size_t i = 0; i <= last_id; ++i) { segment_ids.emplace_back(std::to_string(i)); }
1-287
: Overall implementation is well-structured and robustThe implementation demonstrates good separation of concerns, proper error handling, and clear documentation. The code is organized logically with helper functions in an anonymous namespace and a clean public interface.
Consider adding the following improvements in future iterations:
- Add unit tests to verify the archive format compatibility
- Consider implementing a dry-run mode to validate the archive structure before writing
- Add progress reporting for large archives
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (2)
7-7
: Consider standardizing header file extensionsThe codebase mixes
.h
and.hpp
extensions. Consider standardizing to.hpp
for consistency, as it's more specific to C++ headers.
47-51
: Use descriptive field names in FileInfoThe
FileInfo
structure uses abbreviated field names (n
,o
) which reduce code readability.Consider using more descriptive names:
struct FileInfo { - std::string n; - uint64_t o; - MSGPACK_DEFINE_MAP(n, o); + std::string name; + uint64_t offset; + MSGPACK_DEFINE_MAP(name, offset); };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (5)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
1-18
: LGTM! Well-organized includes
The includes are properly organized into logical groups (standard library, third-party, and project-specific) with clear separation between groups.
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (4)
13-21
: Well-structured version encoding
The version encoding aligns with the semantic versioning approach discussed in the PR comments, using bit shifting to combine major (0), minor (1), and patch (1) versions into a single uint32_t
.
38-45
: Document the purpose of unused fields and packed attribute
The SingleFileArchiveHeader
structure:
- Contains 6 unused fields without explanation
- Uses
__attribute__((packed))
which might cause alignment issues on some architectures
Please document:
- The purpose of reserving unused fields
- Why packing is necessary
- Any alignment requirements for different architectures
27-27
: Verify file size warning threshold
The 100MB threshold (cFileSizeWarningThreshold
) seems arbitrary. Please confirm if this value aligns with the private branch implementation and document the rationale.
✅ Verification successful
Let me gather more context about the usage of this threshold.
The file size warning threshold is correctly implemented and documented
The 100MB threshold is used to warn users when they're creating single-file archives that are too large. The implementation shows clear rationale in the warning message: single-file archives are not split and can consume excessive memory during decompression. Users are advised to use the multi-file archive format instead for large archives.
The warning message in writer.cpp
provides sufficient documentation:
"Single file archive size exceeded {}. "
"The single-file archive format is not intended for large archives, "
"consider using multi-file archive format instead."
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for similar thresholds in the codebase
rg -i "size.*threshold|threshold.*size" --type cpp
Length of output: 1094
Script:
#!/bin/bash
# Check the usage context of the warning threshold
rg -B 5 -A 5 "cFileSizeWarningThreshold" --type cpp
# Look for any documentation or comments about file size limits
rg -i "file.*size.*limit|size.*limit.*file" --type cpp
Length of output: 3399
53-81
: Verify metadata compatibility with private branch
The PR objectives mention differences in metadata formats between branches. Please confirm:
- All required fields from the private branch are included
- Data type changes are backward compatible
- The order of fields in
MSGPACK_DEFINE_MAP
maintains compatibility
✅ Verification successful
The previous ast-grep pattern might have failed due to syntax. Let's try different approaches to find metadata structures and their msgpack definitions in the codebase. This will help us verify the compatibility concerns.
Let's gather more information about version compatibility and metadata structures from the streaming archive implementation.
Based on the gathered information, I can now provide a final response about the metadata compatibility:
Metadata structures appear compatible with existing implementation
The metadata structures in the PR maintain compatibility with the existing codebase:
- All required fields are present and match the existing implementation in
ArchiveMetadata.hpp
- Data types are consistent with the current usage:
archive_format_version_t
for version trackingstd::string
for version identifiers and compression typeuint64_t
for size fields
- The order of fields in
MSGPACK_DEFINE_MAP
matches the struct field declarations, maintaining proper serialization - The version constants (
cArchiveFormatVersion
,cVariableEncodingMethodsVersion
,cVariablesSchemaVersion
) are properly referenced
The changes appear to be a refactoring that preserves the existing metadata format while organizing it into more structured types (MultiFileArchiveMetadata
and SingleFileArchiveMetadata
).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for similar metadata structures in the codebase
ast-grep --pattern 'struct $_ {
$$$
MSGPACK_DEFINE_MAP($$$);
}'
Length of output: 69
Script:
#!/bin/bash
# Search for metadata structures using ripgrep
rg -U "struct.*Metadata.*\{[\s\S]*?MSGPACK_DEFINE" -A 20
Length of output: 5375
Script:
#!/bin/bash
# Search for any MSGPACK_DEFINE usage
rg "MSGPACK_DEFINE" -A 5
Length of output: 3013
Script:
#!/bin/bash
# Look for any files containing 'Metadata' in their name
fd "Metadata" --type f
Length of output: 1265
Script:
#!/bin/bash
# Let's check the content of ArchiveMetadata files for any additional metadata structures
rg -U "struct.*\{[\s\S]*?\}" components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
Length of output: 92
Script:
#!/bin/bash
# Check for any version-related constants or definitions
rg "version" -i components/core/src/clp/streaming_archive/
Length of output: 5135
try { | ||
std::filesystem::remove_all(multi_file_archive_path); | ||
} catch (std::filesystem::filesystem_error& e) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling for cleanup operation
- The error message is not propagated when file removal fails
- Consider verifying if the archive was written successfully before removing the original files
archive_writer.close();
+ // Verify the archive was written successfully
+ if (!std::filesystem::exists(single_file_archive_path) ||
+ std::filesystem::file_size(single_file_archive_path) == 0) {
+ throw OperationFailed(
+ ErrorCode_Failure,
+ __FILENAME__,
+ __LINE__,
+ "Failed to write single-file archive"
+ );
+ }
try {
std::filesystem::remove_all(multi_file_archive_path);
} catch (std::filesystem::filesystem_error& e) {
- throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+ throw OperationFailed(
+ ErrorCode_Failure,
+ __FILENAME__,
+ __LINE__,
+ fmt::format("Failed to remove original archive: {}", e.what())
+ );
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
try { | |
std::filesystem::remove_all(multi_file_archive_path); | |
} catch (std::filesystem::filesystem_error& e) { | |
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | |
} | |
archive_writer.close(); | |
// Verify the archive was written successfully | |
if (!std::filesystem::exists(single_file_archive_path) || | |
std::filesystem::file_size(single_file_archive_path) == 0) { | |
throw OperationFailed( | |
ErrorCode_Failure, | |
__FILENAME__, | |
__LINE__, | |
"Failed to write single-file archive" | |
); | |
} | |
try { | |
std::filesystem::remove_all(multi_file_archive_path); | |
} catch (std::filesystem::filesystem_error& e) { | |
throw OperationFailed( | |
ErrorCode_Failure, | |
__FILENAME__, | |
__LINE__, | |
fmt::format("Failed to remove original archive: {}", e.what()) | |
); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
341-343
: Consider refactoring the nested conditions for better readability.The condition for file splitting now includes two nested checks that could be simplified.
Consider this alternative structure:
- if (get_data_size_of_dictionaries() >= m_target_data_size_of_dicts - && false == m_use_single_file_archive) - { + bool should_split = get_data_size_of_dictionaries() >= m_target_data_size_of_dicts + && !m_use_single_file_archive; + if (should_split) {
669-671
: Add error context to the metadata validation.The error message could be more descriptive about why the metadata is missing.
- throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + "Local metadata is missing when creating single-file archive" + );
681-685
: Consider adding progress logging for large archives.For large archives, it would be helpful to log the progress of the single-file archive creation process.
+ SPDLOG_INFO("Creating single-file archive at {}", multi_file_archive_path.string()); clp::streaming_archive::single_file_archive::write_single_file_archive( multi_file_archive_path, packed_metadata, segment_ids ); + SPDLOG_INFO("Successfully created single-file archive");components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)
24-24
: Consider documenting the rationale for the block size valueAdding a comment explaining why 4096 was chosen as the block size would help future maintainers understand if this value needs adjustment.
-constexpr size_t cReadBlockSize = 4096; +// 4KB block size chosen to match common page size for optimal I/O performance +constexpr size_t cReadBlockSize = 4096;
199-215
: Enhance error handling with descriptive messagesThe error handling could be more informative by including the file path and specific error details in the exception message.
} if (ErrorCode_Success != error_code) { - throw OperationFailed(error_code, __FILENAME__, __LINE__); + throw OperationFailed( + error_code, + __FILENAME__, + __LINE__, + fmt::format("Failed to read from file '{}': {}", + file_path.string(), + get_error_message(error_code)) + ); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/core/CMakeLists.txt
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
(1 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (5)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
16-24
: LGTM! Header includes and dependencies look good.
The addition of the single-file archive writer header and its dependencies is properly organized.
663-686
: Verify cleanup order in single-file archive creation.
The function creates a single-file archive after closing the multi-file archive. This ordering needs careful consideration for partial failures.
Let's verify the cleanup handling:
#!/bin/bash
# Search for error handling patterns around archive creation
rg -A 5 "create_single_file_archive|write_single_file_archive" --type cpp
# Look for cleanup/rollback mechanisms
ast-grep --pattern 'try {
$$$
create_single_file_archive();
$$$
} catch ($_) {
$$$
}'
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)
1-21
: LGTM! Well-organized includes and namespace structure.
The includes are properly grouped and contain all necessary headers for the implementation.
283-287
: Improve error handling for cleanup operation
This section needs better error handling as identified in a previous review.
236-248
: Add validation for segment ID range
Consider adding an upper bound check for last_segment_id
to prevent excessive memory allocation for very large segment IDs.
auto get_segment_ids(segment_id_t last_segment_id) -> std::vector<std::string> {
std::vector<std::string> segment_ids;
if (last_segment_id < 0) {
return segment_ids;
}
+ // Prevent excessive memory allocation
+ if (last_segment_id > cMaxSegmentId) {
+ throw OperationFailed(
+ ErrorCode_InvalidArgument,
+ __FILENAME__,
+ __LINE__,
+ fmt::format("Segment ID {} exceeds maximum allowed value {}",
+ last_segment_id, cMaxSegmentId)
+ );
+ }
+
for (size_t i = 0; i <= last_segment_id; ++i) {
src/clp/streaming_archive/single_file_archive/Defs.hpp | ||
src/clp/streaming_archive/single_file_archive/writer.cpp | ||
src/clp/streaming_archive/single_file_archive/writer.hpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Missing test coverage for the new single-file archive implementation
The new single-file archive files have been properly added to the CMake configuration, but I don't see any corresponding test files for this functionality. While there are several test files in the codebase:
- There's only one test file that mentions "Archive":
test-clp_s-end_to_end.cpp
- None of the existing test files appear to specifically cover the new single-file archive implementation
Recommended actions:
- Create unit tests for the new single-file archive writer implementation
- Consider adding tests in these areas:
- Basic writer functionality in
writer.cpp
- Edge cases and error handling
- Integration with the existing archive system
- Basic writer functionality in
🔗 Analysis chain
LGTM! The new single-file archive source files are properly integrated.
The files are correctly placed in a dedicated directory and follow the project's naming conventions.
Let's verify the test coverage for these new files:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for corresponding test files and test coverage for the new SFA implementation.
# Look for test files related to single-file archive
rg -l "single_file_archive" tests/
# Check for test cases using the new writer
rg -A 5 "TEST.*single.*file.*archive" tests/
Length of output: 175
Script:
#!/bin/bash
# Let's first find the test directory location
fd -t d "test" .
# Then search for single file archive related files and tests
rg -l "single_file_archive" .
# Look for test cases specifically
rg -A 5 "TEST.*single.*file.*archive" .
Length of output: 1270
Script:
#!/bin/bash
# Let's check the test files in the core component
fd -t f ".*test.*\.(cpp|hpp)$" components/core/tests/
# Look for any test files that might be testing the archive functionality
rg -l "Archive" components/core/tests/
# Check if there are any test files that include "writer" in their name or content
rg -l "writer" components/core/tests/
Length of output: 1828
High level comment: let's use auto func() -> [return type] {} for all methods. There are still a few methods using the old style signature |
To your questions:
As a quick conclusion without thinking about details, I don't think we should. GlobalMetadataDB is used by clp and clp package to get the list of available archives. Without GlobalMetadataDB, we have to check the archive directory to see what archives are available, but this is not feasible if the archive is written to S3. It remains a question that if we need all fields in the table for a single archive, but let's keep the GlobalMetadataDB for now. Let me know if you disagree.
@LinZhihao-723 can you give some input on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First post a small batch of review. Still omw to review the core changes
@@ -243,7 +243,9 @@ void FileCompressor::parse_and_encode_with_heuristic( | |||
|
|||
// Parse content from file | |||
while (m_message_parser.parse_next_message(true, reader, m_parsed_message)) { | |||
if (archive_writer.get_data_size_of_dictionaries() >= target_data_size_of_dicts) { | |||
if (archive_writer.get_data_size_of_dictionaries() >= target_data_size_of_dicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking, would it be cleaner if we add a new method with name like "should_split" to archive_writer, and embed this if logic into the method.
Now the same if statements have been duplicated at multiple places, which is inefficient and error prone since one change requires you to update multiple places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will wait on making this change until I get feedback from kirk. I may want to remove no split functionality even though it is done in private branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
@@ -135,7 +136,9 @@ bool compress( | |||
); | |||
} | |||
for (auto it = files_to_compress.cbegin(); it != files_to_compress.cend(); ++it) { | |||
if (archive_writer.get_data_size_of_dictionaries() >= target_data_size_of_dictionaries) { | |||
if (archive_writer.get_data_size_of_dictionaries() >= target_data_size_of_dictionaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for this
std::vector<FileInfo> const& file_infos, | ||
std::vector<std::string> const& segment_ids | ||
) -> std::stringstream { | ||
MultiFileArchiveMetadata archive_metadata{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why MultiFileArchiveMetadata is under pack_single_file_archive_metadata?
I feel the naming is a bit confusing and maybe can be improved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the naming is confusing. I added the name MultiFileArchiveMetadata to try and make less confusing but clearly didn't help lol.
The single file archive metadata section is confusing in that is not just the original archives metadata file. It is a superset. It includes the original archive metadata, but 2 other components (the num segments, and file info section). I added the name multi-file archive metadata to reference the metadata of the original archive (i.e. without the num segments and the file info section).
Another wrinkle, is that the metadata was chosen to be written in MsgPack format for sfa (i dont know why but I'm sure there was a reason). As a result, we don't just write the file, like we do with other archive files, it need to be serialized to MsgPack first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seriliazing to MsgPack first makes sense, but still we should find better naming for the variables.
I think ArchiveMetadata would be a better name compared to MultiFileArchiveMetadata.
Another possible way is to make BaseArchiveMetadata class, and let the SFA and "Normal Archive" (need a better name for sure) to extend on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there is already a ArchiveMetadata Class. Technically it is in a different namespace, so we could use the same name, but that might be confusing. This struct just takes the variables from ArchiveMetadata class that are actually written to disk and wraps them in a struct with the appropriate variable names (without "m_") for msgpack serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving as MultiFileArchiveMetadata per the reason i mentioned. Unless I hear otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kirkrodrigues any naming suggestion?
I think what we want to refer to is the "Original format archive that consists of multiple files"
@@ -330,7 +338,9 @@ void Archive::write_msg_using_schema(LogEventView const& log_view) { | |||
m_old_ts_pattern = timestamp_pattern; | |||
} | |||
} | |||
if (get_data_size_of_dictionaries() >= m_target_data_size_of_dicts) { | |||
if (get_data_size_of_dictionaries() >= m_target_data_size_of_dicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any plan on this?
This is interesting. I wasn't really thinking about the sfa would interact with clp-package. Perhaps they have already considered this when designing for clp-s. For now we can leave it in, as it is not essential for this PR or the reader PR. I am still wary of generating two files, as purpose of single-file archive is that it's a single file. Perhaps kirk will have more context when time allows. |
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
Outdated
Show resolved
Hide resolved
There are two configurations for global metadataDB. When using a "local file" as metadataDB, clp will generate one file per execution, regardless of the number of archives. so say you compress 10000 files, clp may generate 10 sfas, but only 1 global MetadataDB. When using package, the global MetadataDB will be a MySQL database running at the backend. In this way, no file will be generated, and CLP directly writes to the MySQL database. CLP knows it is a remote database if we pass in a db-config in the cmdline. I am 100% sure we need to update global MetadataDB when using with package. For non-package case, global metadataDB can still be useful since it contains time_range info of each archives, which can be used as a top level filter. I think we should agree on the format of global metadata for SFA, then decide how it should be stored. |
This is good info and good points. I'm now okay keeping the global metadataDB as I think it makes sense to keep compatibility with clp package. Note I think we will still have a requirement to decode archives from the private branch (this is the real long term goal of this work is to open sfa in the log viewer) . The private branch will not have a global MetadataDB. Maybe we treat reading a "lonely archive" (archive without metadataDB) as an exceptional path, and write code specifically for that, but in the general case we produce the global MetadataDB. |
Yeah, we might need to have a special path for "lonely archive". It should be able to reuse most of our exisitng code though. maybe we should create a separate PR for the work though. That said, we should have a concrete plan for the other PR so we don't have to revert changes we made in this PR later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another small batch of reviews
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (10)
components/core/src/clp/streaming_archive/writer/Archive.hpp (1)
51-51
: Document the new configuration option.The
use_single_file_archive
field lacks documentation in the UserConfig struct's documentation block. Please update the documentation to describe this option's purpose and impact.components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)
111-123
: Consider reserving vector capacity for performance.For large
next_segment_id
values, pre-allocating the vector's capacity would avoid multiple reallocations:auto get_segment_ids(segment_id_t next_segment_id) -> std::vector<std::string> { std::vector<std::string> segment_ids; + segment_ids.reserve(next_segment_id);
157-159
: Document the sentinel's purpose more explicitly.The empty filename sentinel is used to mark the total archive size, but this implementation detail should be documented more clearly.
- // Add sentinel indicating total size of all files. + // Add sentinel with empty filename to store the total size of all files. + // This sentinel is used during decompression to verify archive integrity. files.emplace_back(FileInfo{"", offset});
220-236
: Consider adding progress tracking for large files.For large files, it would be helpful to track and log the copy progress to provide feedback during long operations.
auto write_archive_file(std::filesystem::path const& file_path, FileWriter& archive_writer) -> void { FileReader reader(file_path.string()); + const auto total_size = std::filesystem::file_size(file_path); + size_t bytes_copied = 0; std::array<char, cReadBlockSize> read_buffer{}; while (true) { size_t num_bytes_read{}; ErrorCode const error_code = reader.try_read(read_buffer.data(), cReadBlockSize, num_bytes_read); if (ErrorCode_EndOfFile == error_code) { break; } if (ErrorCode_Success != error_code) { throw OperationFailed(error_code, __FILENAME__, __LINE__); } archive_writer.write(read_buffer.data(), num_bytes_read); + bytes_copied += num_bytes_read; + if (bytes_copied % (100 * cReadBlockSize) == 0) { + SPDLOG_DEBUG("Copying {}: {:.1f}% complete", + file_path.filename().string(), + (100.0 * bytes_copied) / total_size); + } } }
160-169
: Consider implementing archive splitting for large files.While the warning about large archives is helpful, consider implementing an automatic splitting mechanism for archives exceeding the size threshold. This would prevent excessive memory consumption during decompression.
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (2)
17-34
: Consider optimizing the exception class.While the implementation is good, consider these improvements:
- Make
m_message
const since it's not modified after construction- Pass the message parameter by const reference to avoid unnecessary string copies
class OperationFailed : public TraceableException { public: // Constructors OperationFailed( ErrorCode error_code, char const* const filename, int line_number, - std::string message = "streaming_archive::single_file_archive operation failed" + std::string const& message = "streaming_archive::single_file_archive operation failed" ) : TraceableException{error_code, filename, line_number}, m_message{std::move(message)} {} // Methods [[nodiscard]] auto what() const noexcept -> char const* override { return m_message.c_str(); } private: - std::string m_message; + std::string const m_message; };
36-49
: Enhance function declaration and documentation.Consider these improvements:
- Add [[nodiscard]] attribute since the function can throw an exception
- Pass path parameter by value since std::filesystem::path is optimized for pass-by-value
- Add parameter descriptions in the documentation
/** * Writes header, metadata and archive files in single-file format then * removes existing multi-file archive. * - * @param multi_file_archive_metadata - * @param multi_file_archive_path - * @param next_segment_id + * @param multi_file_archive_metadata The metadata from the multi-file archive + * @param multi_file_archive_path The path to the multi-file archive + * @param next_segment_id The next available segment ID * @throws OperationFailed if single-file archive path already exists. */ -auto write_single_file_archive( +[[nodiscard]] auto write_single_file_archive( ArchiveMetadata const& multi_file_archive_metadata, - std::filesystem::path const& multi_file_archive_path, + std::filesystem::path multi_file_archive_path, segment_id_t next_segment_id ) -> void;components/core/src/clp/clp/compression.cpp (1)
169-171
: Consider extracting duplicated condition.This condition is duplicated from lines 139-141. Consider extracting it into a helper function to improve maintainability.
+auto should_split_archive( + size_t data_size_of_dictionaries, + size_t target_data_size_of_dictionaries, + bool use_single_file_archive +) -> bool { + return data_size_of_dictionaries >= target_data_size_of_dictionaries + && false == use_single_file_archive; +} + // In the main code: -if (archive_writer.get_data_size_of_dictionaries() >= target_data_size_of_dictionaries - && false == archive_writer.get_use_single_file_archive()) +if (should_split_archive( + archive_writer.get_data_size_of_dictionaries(), + target_data_size_of_dictionaries, + archive_writer.get_use_single_file_archive()))components/core/src/clp/clp/CommandLineArguments.cpp (1)
376-379
: Consider enhancing the command-line option.The option description could be more detailed to help users understand:
- What a single-file archive is
- When to use it
- Any implications or trade-offs
- "single-file-archive", + "single-file-archive,s", po::bool_switch(&m_single_file_archive), - "Output archive as a single-file archive" + "Output archive as a single-file archive. This combines all archive files into " + "one file, which may be easier to transfer but cannot be split later."components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
249-251
: Consider enhancing error handling.The single-file archive creation is correctly placed after all metadata is written, but consider:
- Adding error handling for filesystem operations
- Logging the start and completion of the conversion
if (m_use_single_file_archive) { + SPDLOG_INFO("Converting to single-file archive: {}", m_path); create_single_file_archive(); + SPDLOG_INFO("Successfully converted to single-file archive: {}", m_path); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
components/core/src/clp/clp/CommandLineArguments.cpp
(1 hunks)components/core/src/clp/clp/CommandLineArguments.hpp
(3 hunks)components/core/src/clp/clp/compression.cpp
(3 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
(1 hunks)components/core/src/clp/streaming_archive/writer/Archive.cpp
(5 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/clp/CommandLineArguments.hpp
🧰 Additional context used
📓 Path-based instructions (6)
components/core/src/clp/clp/CommandLineArguments.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/clp/compression.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🪛 GitHub Actions: clp-lint
components/core/src/clp/streaming_archive/writer/Archive.hpp
[error] 197-197: Code formatting violation: incorrect formatting detected by clang-format in get_use_single_file_archive method
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
[error] 40-40: Code formatting violation: incorrect formatting detected by clang-format
[error] 125-125: Code formatting violation: incorrect formatting detected by clang-format
[error] 261-261: Code formatting violation: incorrect formatting detected by clang-format
[error] 277-282: Code formatting violation: incorrect formatting detected by clang-format in multiple lines including segment_ids declaration and packed_metadata initialization
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build (macos-latest)
🔇 Additional comments (9)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
285-288
: 🛠️ Refactor suggestionEnhance method documentation and consider error handling.
The
create_single_file_archive()
method's documentation should:
- Detail the steps involved in the conversion process
- Document potential error conditions
- Specify the cleanup process for the existing multi-file archive
- Document any requirements or preconditions
Consider adding error handling for scenarios where:
- The single-file archive creation fails but the multi-file archive is already removed
- Insufficient disk space for temporary operations
Line range hint
1-352
: Verify archive versioning scheme.Based on the PR objectives discussion, the archive versioning scheme should be updated to use semantic versioning with
uint32_t
format as:(major_version << 24) | (minor_version << 16) | (patch_version)
. This change isn't reflected in the current implementation.Run this script to locate version-related code:
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (4)
1-26
: Consider renaming the module to "packer" for clarity.The file contains a collection of helper methods rather than a class, making "packer" potentially more appropriate than "writer".
27-110
: Well-structured function declarations with comprehensive documentation.The use of modern C++ features (trailing return types, [[nodiscard]]) and thorough Doxygen documentation enhances code quality and maintainability.
🧰 Tools
🪛 GitHub Actions: clp-lint
[error] 40-40: Code formatting violation: incorrect formatting detected by clang-format
179-190
: Consider renaming MultiFileArchiveMetadata to ArchiveMetadata for clarity.The current naming might be confusing as it's used within the single-file archive context.
291-295
: Improve error handling for cleanup operation.The error message is not propagated when file removal fails.
components/core/src/clp/streaming_archive/single_file_archive/writer.hpp (1)
1-14
: LGTM! Well-organized includes.The includes are logically grouped and all necessary for the functionality.
components/core/src/clp/clp/compression.cpp (1)
139-141
: LGTM! Logical condition for archive splitting.The condition correctly prevents archive splitting when single-file archive is enabled.
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
62-62
: LGTM! Proper initialization.The member variable is correctly initialized from the user configuration.
@@ -341,6 +349,7 @@ class Archive { | |||
GlobalMetadataDB* m_global_metadata_db; | |||
|
|||
bool m_print_archive_stats_progress; | |||
bool m_use_single_file_archive; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize the new member variable in the constructor.
The m_use_single_file_archive
member variable should be initialized in the constructor to prevent undefined behaviour.
Archive()
: m_segments_dir_fd(-1),
m_compression_level(0),
m_global_metadata_db(nullptr),
m_old_ts_pattern(nullptr),
- m_schema_file_path() {}
+ m_schema_file_path(),
+ m_use_single_file_archive(false) {}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
bool m_use_single_file_archive; | |
Archive() | |
: m_segments_dir_fd(-1), | |
m_compression_level(0), | |
m_global_metadata_db(nullptr), | |
m_old_ts_pattern(nullptr), | |
m_schema_file_path(), | |
m_use_single_file_archive(false) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with rabbit, please provide a default initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haiqi96 Thank you for confirming. I've noted this preference for future reviews.
✏️ Learnings added
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.446Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I didn't do it since it wasn't done for other members. For now I added a default initializer not in constructor but where declared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
@@ -193,6 +194,8 @@ | |||
return m_logtype_dict.get_data_size() + m_var_dict.get_data_size(); | |||
} | |||
|
|||
[[nodiscard]] auto get_use_single_file_archive() const -> bool { return m_use_single_file_archive; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the code formatting violation.
The clang-format tool has detected incorrect formatting. Please run clang-format to fix the formatting:
- [[nodiscard]] auto get_use_single_file_archive() const -> bool { return m_use_single_file_archive; }
+ [[nodiscard]] auto get_use_single_file_archive() const -> bool {
+ return m_use_single_file_archive;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
[[nodiscard]] auto get_use_single_file_archive() const -> bool { return m_use_single_file_archive; } | |
[[nodiscard]] auto get_use_single_file_archive() const -> bool { | |
return m_use_single_file_archive; | |
} |
🧰 Tools
🪛 GitHub Actions: clp-lint
[error] 197-197: Code formatting violation: incorrect formatting detected by clang-format in get_use_single_file_archive method
std::memcpy(&header.magic, cUnstructuredSfaMagicNumber.data(), sizeof(header.magic)); | ||
archive_writer.write(reinterpret_cast<char const*>(&header), sizeof(header)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add size validation for magic number copy.
The memcpy operation assumes the magic number size matches the header field size. Add a static_assert to ensure this assumption holds:
+ static_assert(sizeof(header.magic) == cUnstructuredSfaMagicNumber.size(),
+ "Magic number size mismatch");
std::memcpy(&header.magic, cUnstructuredSfaMagicNumber.data(), sizeof(header.magic));
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::memcpy(&header.magic, cUnstructuredSfaMagicNumber.data(), sizeof(header.magic)); | |
archive_writer.write(reinterpret_cast<char const*>(&header), sizeof(header)); | |
static_assert(sizeof(header.magic) == cUnstructuredSfaMagicNumber.size(), | |
"Magic number size mismatch"); | |
std::memcpy(&header.magic, cUnstructuredSfaMagicNumber.data(), sizeof(header.magic)); | |
archive_writer.write(reinterpret_cast<char const*>(&header), sizeof(header)); |
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
void Archive::create_single_file_archive() { | ||
if (false == m_local_metadata.has_value()) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} | ||
|
||
auto const& multi_file_archive_metadata = m_local_metadata.value(); | ||
|
||
clp::streaming_archive::single_file_archive::write_single_file_archive( | ||
multi_file_archive_metadata, | ||
m_path, | ||
m_next_segment_id | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance single-file archive creation.
Consider these improvements:
- Add error handling for filesystem operations
- Add cleanup on failure
- Add progress indication for large archives
- Add validation of the created single-file archive
void Archive::create_single_file_archive() {
if (false == m_local_metadata.has_value()) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
auto const& multi_file_archive_metadata = m_local_metadata.value();
+ auto single_file_archive_path = m_path + ".sfa";
+
+ try {
+ SPDLOG_INFO("Creating single-file archive: {}", single_file_archive_path);
clp::streaming_archive::single_file_archive::write_single_file_archive(
multi_file_archive_metadata,
- m_path,
+ single_file_archive_path,
m_next_segment_id
);
+
+ // Validate the created archive
+ if (!validate_single_file_archive(single_file_archive_path)) {
+ throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+ }
+
+ // Rename to final path only after validation
+ std::filesystem::rename(single_file_archive_path, m_path);
+
+ SPDLOG_INFO("Successfully created single-file archive: {}", m_path);
+ } catch (const std::exception& e) {
+ // Clean up on failure
+ std::filesystem::remove(single_file_archive_path);
+ throw;
+ }
}
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)
213-214
:⚠️ Potential issueAdd size validation for magic number copy.
The memcpy operation assumes the magic number size matches the header field size.
291-295
:⚠️ Potential issueImprove error handling for cleanup operation.
The error message is not propagated when file removal fails.
🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (3)
164-171
: Consider using a more descriptive warning message.The warning message could be more specific about the actual size threshold.
if (offset > cFileSizeWarningThreshold) { SPDLOG_WARN( - "Single file archive size exceeded {}. " + "Single file archive size exceeded {} bytes ({}GB). " "The single-file archive format is not intended for large archives, " " consider using multi-file archive format instead.", - cFileSizeWarningThreshold + cFileSizeWarningThreshold, + cFileSizeWarningThreshold / (1024 * 1024 * 1024) ); }
226-238
: Consider using std::copy_n for better performance.The current implementation reads and writes in small chunks, which could be optimized.
- std::array<char, cReadBlockSize> read_buffer{}; - while (true) { - size_t num_bytes_read{}; - ErrorCode const error_code - = reader.try_read(read_buffer.data(), cReadBlockSize, num_bytes_read); - if (ErrorCode_EndOfFile == error_code) { - break; - } - if (ErrorCode_Success != error_code) { - throw OperationFailed(error_code, __FILENAME__, __LINE__); - } - archive_writer.write(read_buffer.data(), num_bytes_read); - } + auto file_size = std::filesystem::file_size(file_path); + std::vector<char> buffer(file_size); + size_t bytes_read{}; + auto error_code = reader.try_read(buffer.data(), file_size, bytes_read); + if (error_code != ErrorCode_Success && error_code != ErrorCode_EndOfFile) { + throw OperationFailed(error_code, __FILENAME__, __LINE__); + } + archive_writer.write(buffer.data(), bytes_read);
269-271
: Improve error message for existing file.The error message when the single-file archive already exists could be more descriptive.
if (std::filesystem::exists(single_file_archive_path)) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + fmt::format("Single-file archive already exists: {}", single_file_archive_path.string()) + ); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
(1 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- components/core/src/clp/streaming_archive/writer/Archive.hpp
- components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (1)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
181-192
: Consider renaming MultiFileArchiveMetadata.Based on previous discussions, this struct name is confusing and could be renamed to ArchiveMetadata or BaseArchiveMetadata.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
archive_writer.write(packed_metadata.str().data(), packed_metadata.str().size()); | ||
} | ||
|
||
auto write_archive_file(std::filesystem::path const& file_path, FileWriter& archive_writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a function specfic for "archive_file". Any chance we can put it into one of the utils.cpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure i'll look for somewhere to put. I'll see if I can also replace clp-s function with the util function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, If no other file is using it, I'm not sure it makes sense to add it to the generic utils. Sure the idea of copying a file into another file is generic, but not sure where else it is used in codebase. I looked at using this in clp-s sfa code, but they use their own clp-s fileReader class... So basically i'm saying since it isnt used anywhere else, i think it should stay here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't entirely agree with you, but sure let's leave the code as it is.
// TODO: The following fields are used in single-file archive; however, they are not | ||
// currently part of multi-file archive metadata. Modifying multi-file archive metadata | ||
// disk format is potentially a breaking change and not currently required. | ||
std::string m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a high level question is. how are those constants related to multi-file-archive?
Technically, since they are not used by anything else, they can also be directly defined under struct "single_archive_metadata"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally these fields are written to multi-file archive metadata. However, since this is actually changing multi-file archives, I didn't want to make the change until we here from @kirkrodrigues. If there is no plan to add these to the multi-file metadata, perhaps we can move them into single file archive files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I didn't fully answer your question, but I asked kirk about these fields earlier and supposedly they are also relevant to multi file archive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked over your comments. Will work on implementing them later today
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
archive_writer.write(packed_metadata.str().data(), packed_metadata.str().size()); | ||
} | ||
|
||
auto write_archive_file(std::filesystem::path const& file_path, FileWriter& archive_writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure i'll look for somewhere to put. I'll see if I can also replace clp-s function with the util function
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
// TODO: The following fields are used in single-file archive; however, they are not | ||
// currently part of multi-file archive metadata. Modifying multi-file archive metadata | ||
// disk format is potentially a breaking change and not currently required. | ||
std::string m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally these fields are written to multi-file archive metadata. However, since this is actually changing multi-file archives, I didn't want to make the change until we here from @kirkrodrigues. If there is no plan to add these to the multi-file metadata, perhaps we can move them into single file archive files
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (2)
106-119
:⚠️ Potential issueAdd overflow protection to prevent archive corruption.
The function should check for potential overflow when adding file sizes to the offset.
Apply this diff to add overflow protection:
auto get_file_size_and_update_offset(std::filesystem::path const& file_path, uint64_t& offset) -> void { try { auto size = std::filesystem::file_size(file_path); + if (size > std::numeric_limits<uint64_t>::max() - offset) { + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + "File size would cause offset overflow" + ); + } offset += size; } catch (std::filesystem::filesystem_error const& e) {
234-269
:⚠️ Potential issueImprove error handling for cleanup operation.
The implementation should verify successful archive creation before removing the original files.
Apply this diff to improve error handling:
archive_writer.close(); + // Verify the archive was written successfully + if (!std::filesystem::exists(single_file_archive_path) || + std::filesystem::file_size(single_file_archive_path) == 0) { + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + "Failed to write single-file archive" + ); + } try { std::filesystem::remove_all(multi_file_archive_path); } catch (std::filesystem::filesystem_error& e) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + fmt::format("Failed to remove original archive: {}", e.what()) + ); }
🧹 Nitpick comments (6)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (5)
7-7
: Consider standardizing header file extensions.The codebase mixes
.h
and.hpp
extensions. For consistency, consider standardizing on.hpp
for C++ headers.-#include "../../Defs.h" +#include "../../Defs.hpp"
28-28
: Document the rationale for the file size warning threshold.Consider adding a comment explaining why 100MB was chosen as the warning threshold and what actions are triggered when exceeded.
39-39
: Document the purpose of the unused array.Add a comment explaining why we reserve 6 unused slots in the header and their intended future use.
41-46
: Document the binary layout requirements for SingleFileArchiveHeader.Add a comment explaining:
- Why the struct needs to be packed
- The expected alignment and padding requirements
- The total size in bytes
55-60
: Consider adding size estimation methods to SingleFileArchiveMetadata.Since this structure represents the metadata section whose size needs to be known in advance (for
metadata_size
in the header), consider adding methods to estimate its serialized size.components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
136-141
: Document the compatibility considerations for future reference.The TODO comment explains why these fields are not part of multi-file archive metadata. Consider adding this information to the project documentation to help future maintainers understand the design decision.
Additionally, since these fields are relevant to both archive types (as mentioned in past discussions), consider creating a tracking issue for eventually adding them to multi-file archive metadata.
Would you like me to help create a documentation PR or tracking issue for this?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(4 hunks)components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
(1 hunks)components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/single_file_archive/writer.hpp
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: filter-relevant-changes
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (12)
components/core/src/clp/streaming_archive/single_file_archive/Defs.hpp (1)
16-22
: LGTM! Versioning implementation aligns with semantic versioning requirements.The version encoding follows the discussed scheme:
(major_version << 24) | (minor_version << 16) | (patch_version)
.components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)
7-7
: LGTM! Well-structured includes and constant definition.The new include is properly placed, and the compression type constant follows C++ best practices using
constexpr
andstring_view
.Also applies to: 11-11, 14-16
87-98
: LGTM! Well-implemented getter methods following modern C++ practices.The methods make good use of:
[[nodiscard]]
attribute- Trailing return types
- Const references for efficiency
108-120
: Verify the impact of excluded fields in MsgPack serialization.Some member variables are not included in the serialization map:
m_dynamic_uncompressed_size
m_dynamic_compressed_size
m_creation_idx
m_creator_id_len
Please verify that excluding these fields won't affect compatibility between formats.
✅ Verification successful
Excluded fields are correctly omitted from MsgPack serialization
The excluded fields serve different purposes:
- Dynamic sizes are runtime-only values used for temporary calculations
- Creator ID length and creation index are specific to the multi-file archive format, as noted in the code comments
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Search for usage of excluded fields in the codebase # Check for usage of dynamic sizes in archive reading/writing rg -A 5 "m_dynamic_(un)?compressed_size" --type cpp # Check for usage of creation index and creator ID length rg -A 5 "m_creation_idx|m_creator_id_len" --type cppLength of output: 17287
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp (8)
1-22
: LGTM! Well-organized includes and namespace structure.The includes are properly organized and grouped by their sources (standard library, third-party, and project headers). The namespace choice appropriately reflects the module's purpose.
24-105
: LGTM! Well-documented function declarations with clear error handling.The function declarations are well-documented with clear purpose, parameters, and error handling expectations. The use of [[nodiscard]] attribute is appropriate for functions returning values.
121-156
: LGTM! Well-implemented file info collection with appropriate size warning.The function correctly builds the file info list and includes a helpful warning message for large archives that guides users towards using the multi-file archive format instead.
158-173
: Consider using ArchiveMetadata directly instead of intermediate struct.The conversion between ArchiveMetadata and MultiFileArchiveMetadata appears redundant. Consider refactoring to use ArchiveMetadata directly with MsgPack's key renaming feature (available since v2.1.0).
175-190
: LGTM! Proper header writing with size validation.The implementation correctly writes the header and includes appropriate size validation for the magic number.
192-195
: LGTM! Clean metadata writing implementation.The function correctly writes the packed metadata to the archive.
197-213
: LGTM! Robust file copying with proper error handling.The implementation correctly handles file copying with appropriate block size and error checking. While the function could potentially be moved to utils, its current location is justified as it's specific to archive operations.
215-231
: LGTM! Correct implementation of sequential file writing.The function properly handles writing both static archive files and segment files in the correct order.
I made requested changes and retested. I managed to remove MultiFileArchiveMetadata struct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
51-51
: Add documentation for the new configuration option.The
use_single_file_archive
member lacks documentation in the UserConfig struct. Please add documentation to maintain consistency with other members and improve API usability.bool print_archive_stats_progress; + /** Whether to write the archive in single-file format */ bool use_single_file_archive;
287-290
: Enhance method documentation with throw specifications.The method documentation should specify potential exceptions that could be thrown during file I/O operations. This is particularly important as this method handles archive creation and file removal operations.
/** * Writes archive to disk in single-file format then removes existing multi-file archive. + * @throw Archive::OperationFailed if writing the single-file archive fails + * @throw FileWriter::OperationFailed if file operations fail */ auto create_single_file_archive() -> void;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp/streaming_archive/writer/Archive.hpp
(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/writer/Archive.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/core/src/clp/streaming_archive/writer/Archive.hpp (1)
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (2)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
197-199
: LGTM! Well-implemented getter method.The implementation follows modern C++ best practices with appropriate use of [[nodiscard]], const-qualification, and trailing return type syntax.
354-354
: Move initialization to constructor for consistency.Based on the class's initialization style and previous feedback, member variables should be initialized in the constructor rather than using in-class initializers.
- bool m_use_single_file_archive{false}; + bool m_use_single_file_archive; Archive() : m_segments_dir_fd(-1), m_compression_level(0), m_global_metadata_db(nullptr), m_old_ts_pattern(nullptr), - m_schema_file_path() {} + m_schema_file_path(), + m_use_single_file_archive(false) {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this is the last batch of comments
components/core/src/clp/streaming_archive/single_file_archive/writer.cpp
Show resolved
Hide resolved
@@ -330,7 +338,9 @@ void Archive::write_msg_using_schema(LogEventView const& log_view) { | |||
m_old_ts_pattern = timestamp_pattern; | |||
} | |||
} | |||
if (get_data_size_of_dictionaries() >= m_target_data_size_of_dicts) { | |||
if (get_data_size_of_dictionaries() >= m_target_data_size_of_dicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any plan on this?
archive_writer.write(packed_metadata.str().data(), packed_metadata.str().size()); | ||
} | ||
|
||
auto write_archive_file(std::filesystem::path const& file_path, FileWriter& archive_writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't entirely agree with you, but sure let's leave the code as it is.
/** | ||
* Iterates over files in the multi-file archive copying their contents to the single-file archive. | ||
* Skips metadata file since already written in `write_archive_metadata`. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
* Iterates over files in the multi-file archive copying their contents to the single-file archive. | |
* Skips metadata file since already written in `write_archive_metadata`. | |
* | |
/** | |
* Iterates over files in the multi-file archive and copies their contents to the single-file archive. | |
* Skips metadata file, which should be handled by `write_archive_metadata` method. | |
* |
auto write_single_file_archive( | ||
ArchiveMetadata const& multi_file_archive_metadata, | ||
std::filesystem::path const& multi_file_archive_path, | ||
segment_id_t next_segment_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "last_segment_id" as suggested in my previous review.
@@ -135,7 +136,9 @@ bool compress( | |||
); | |||
} | |||
for (auto it = files_to_compress.cbegin(); it != files_to_compress.cend(); ++it) { | |||
if (archive_writer.get_data_size_of_dictionaries() >= target_data_size_of_dictionaries) { | |||
if (archive_writer.get_data_size_of_dictionaries() >= target_data_size_of_dictionaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for this
try { | ||
std::filesystem::remove_all(multi_file_archive_path); | ||
} catch (std::filesystem::filesystem_error& e) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to see if an exception is throw here, what will user see?
There are cases where the exception is caught and rethrown by some upper level caller, and as a result it's really hard to root cause the issue just by looking at the exception text. Want to avoid this case.
@@ -650,6 +660,20 @@ void Archive::update_metadata() { | |||
} | |||
} | |||
|
|||
void Archive::create_single_file_archive() { | |||
if (false == m_local_metadata.has_value()) { | |||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for exception
* | ||
* @param multi_file_archive_metadata | ||
* @param multi_file_archive_path | ||
* @param next_segment_id ID of the next segment to be created in the archive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to update the docstring here for next_segment_id -> last_segment_id
/** | ||
* Combines file section metadata, multi-file archive metadata, and the number of segments into | ||
* single-file archive metadata. Once combined, serializes the metadata into MsgPack format. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** | |
* Combines file section metadata, multi-file archive metadata, and the number of segments into | |
* single-file archive metadata. Once combined, serializes the metadata into MsgPack format. | |
* | |
/** | |
* Combines file section metadata, multi-file archive metadata, and the number of segments into | |
* single-file archive metadata and serializes the metadata into MsgPack format. | |
* |
* Writes header, metadata and archive files in single-file format then | ||
* removes existing multi-file archive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Writes header, metadata and archive files in single-file format then | |
* removes existing multi-file archive. | |
* Writes the header, metadata and archive files into a single-file format then | |
* removes deletes the multi-file archive. |
Description
Reimplementation of unstructured single-file-archive writer from private branch into open source. Reader will be next after this PR is merged.
Open source sfa implementation is "backwards compatible" with private branch (i.e. open source can read archives compressed by private branch, and private branch can read archives compressed by open source. "Backwards compatible" is in quotes as archive metadata and metadataDB have diverged in open source and private branch. However, if the metadata and metadataDB are slightly altered to conform, then the format is backwards compatible.
In addition to the review, I have one high level question and one minor question.
The remainder of the description goes into details about implementation and differences.
High Level Question
Minor Question
Code structure
I decided to implement single file archive writer as a set of functions rather than a class. The single-file archive writer shares common private member variables with the regular archive writer, and I felt it didn't make sense to have a new class with copied members.
Nonetheless, I moved the single-file archive writer methods into their own file/namespace to prevent further bloating of writer/Archive.cpp. To start reviewing, I would look at the main entry point which is create_single_file_archive() function in writer/Archive.cpp .
Implementation differences vs. private branch
I decided to use MsgPack libraries directly instead of nlohmann/json MsgPack integration so the code is more similar to #563. Therefore, someone familiar with clp-s sfa implementation, should be more familiar with clp sfa implementation. As a result, there is no single-file metadata class like private branch, and just a MsgPack struct. Nonetheless, during testing, the msgpack libraries and nlohmann/json were both able to read each others serialized metadata.
Differences in multi-file metadata format
There was a small change (but i could be mistaken myself) in private branch where the archive version (in metadata not header) was changed from uint16_t to uint32_t. I decided for now to leave it as uint16_t in open source, and as a result, the metadata formats are not compatible. There are a multitude of options here. @LinZhihao-723 let us know your thoughts on this.
Differences in Metadata DB
The open source has a new field "BeginMessageIx" which is not in private branch. The private branch has "UtcOffsets" which is not in open source. As a result, clp will throw an error when opening database during decompression.
Javascript sfa reader
Note that despite the differences in metadata format and metadataDB, I believe a javascript sfa reader in log viewer would be able to read both open source and private branch without code modifications. Likely, javascript is less sensitive to integer type differences, and will not throw an error if the DBfield is missing as long as its not used.
Validation performed
I ported reader code from private branch "as is" (most code identical with a few minor changes) to open source. Effectively it is private branch reader without "UtcOffsets" MetadataDB field. I then modified open source writer to use uint32_t for archive version in metadata.
Next, I compressed Loghub/Zookeeper(one segment) and Loghub/HDFS2(multiple segments) with open source sfa writer, and was able to decompress with "private branch reader". I tested decompress files were the same with diff.
Summary by CodeRabbit
Release Notes
New Features
--single-file-archive
command-line option.Improvements
Technical Enhancements