-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic metadata #357
Basic metadata #357
Conversation
288446f
to
a6bf6a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I like the PR a few missing/inconsistent things:
- unknown chunks are missing from the report
- it is very difficult to link Tasks to the Chunks they originate from (I can probably figure out from the path, but a more explicit reference would be better)
- I would also assign an ID to each task, so linking subtasks and tasks would be much easier, instead of just using the path again
- Error reports should be assigned to the chunk/file somehow, now it is difficult to see which chunk or file generated the report, usually it is possible to figure out from the error text the path and find the corresponding chunk/file, but explicit linking would be better (eg: ExtractCommandFailedReport, but could be anything other error report as well)
--json-report
currently overwrites the file if it already exists, I would only do it if -f is configured, otherwise I would error out- in the Magic report I would store the magic_mime as well (though this could go to a separate PR)
- in the StatReport for symlinks I would store the symlink target as well (this could be a separate PR as well)
Nice job!
Just a conversation starter around the UX of the CLI: For me it'd me more straightforward to just call the option |
In full agreement with vlaci here. A single |
c8b6079
to
f5e4bce
Compare
Thanks for the reviews!
The latter is WIP, though might work. |
1cde579
to
d1a2bff
Compare
cc8e856
to
e8d3adb
Compare
e8d3adb
to
587c130
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have yet to fully understand the last two tests. Also, the last two fail one on my machine as well (the output order is different).
I am thinking about how it could be structured in an other way
3a6dc7b
to
342d983
Compare
This object holds an association of `Task`s and their respective `TaskResult`s (and of course `Reports`). With this change we are able to reconstruct what report is for what task, and also what subtasks are coming from what tasks after the processing is finished. This is a useful basis for metadata support, as any additional information can later be added as reports.
It can only process one file, so it is a more meaningful name
The output directory structure is expected to reflect the input structure. Having a single input file and a single extract directory is simple. However supporting directories as input has problems: subdirectories should be there in the output, but we also create directories for extraction, thus it is very easy to craft an input that has a conflicting output (usually an unblob output is one such input).
A pair of "surrogateescape" decode and encode was used to get from byte to bytes, with an internal utf-8 representation. (read more about surrogateescape in https://peps.python.org/pep-0383/) However the utf-8 representation was not used at all. Although this simplification is not strictly necessary (the results before and after must be the same) it took some time reading it and understanding how the encoding works during handling a problem that manifested in a failed command run (#356)
Quite unexpectedly, there is a kind of report, that has content in bytes: ExtractCommandFailedReport, as the stdout/stderr streams are in bytes. It has made the metadata reporting fail in one case, so the custom JSON encoding has been changed to make sensible output when running into an unknown object, but never fail.
Importing from conftest caused an ImportError in test_cli.py, when the second conftest.py is created. conftest.py is a pytest magic module, it is most commonly used for defining fixtures. The fixtures defined there are made available to tests via dependency injection in parameters, not via importing them. TestHandler was moved to tests/test_cli.py from tests/conftest.py
Uniqueness is guaranteed inside a single run
12afc4e
to
61a12d0
Compare
The output the tests work with is big, and as a result the tests are more fragile than usual.
61a12d0
to
5a7ada4
Compare
resolves #327
The mapping between the objects in the
report.json
and the objects in #327 is:TaskResult
[Task
+StatReport
+FileMagicReport
] <->FilesystemObject
ChunkReport
,UnknownChunkReport
<->Chunk