Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic metadata #357

Merged
merged 27 commits into from
May 9, 2022
Merged

Basic metadata #357

merged 27 commits into from
May 9, 2022

Conversation

e3krisztian
Copy link
Contributor

@e3krisztian e3krisztian commented Apr 28, 2022

resolves #327

unblob --report report.json

The mapping between the objects in the report.json and the objects in #327 is:

  • TaskResult[Task + StatReport + FileMagicReport] <-> FilesystemObject
  • ChunkReport, UnknownChunkReport <-> Chunk

Copy link
Contributor

@martonilles martonilles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I like the PR a few missing/inconsistent things:

  • unknown chunks are missing from the report
  • it is very difficult to link Tasks to the Chunks they originate from (I can probably figure out from the path, but a more explicit reference would be better)
  • I would also assign an ID to each task, so linking subtasks and tasks would be much easier, instead of just using the path again
  • Error reports should be assigned to the chunk/file somehow, now it is difficult to see which chunk or file generated the report, usually it is possible to figure out from the error text the path and find the corresponding chunk/file, but explicit linking would be better (eg: ExtractCommandFailedReport, but could be anything other error report as well)
  • --json-report currently overwrites the file if it already exists, I would only do it if -f is configured, otherwise I would error out
  • in the Magic report I would store the magic_mime as well (though this could go to a separate PR)
  • in the StatReport for symlinks I would store the symlink target as well (this could be a separate PR as well)

Nice job!

@vlaci
Copy link
Contributor

vlaci commented Apr 29, 2022

Just a conversation starter around the UX of the CLI: For me it'd me more straightforward to just call the option --report, then if we are to have multiple output formats, we could add a --report-format=json option for example. It'd have the downside that it is harder to support outputting multiple report formats concurrently but I don't really see a case to handle multiple formats for a single run like unblob --report-json foo.json --report-csv foo.csv anyway.

@qkaiser
Copy link
Contributor

qkaiser commented Apr 29, 2022

In full agreement with vlaci here. A single --report option is sufficient, we can add output format specification later if users actually request it, and if so we should emit a single type of report per run (JSON, CSV, XML).

@e3krisztian e3krisztian force-pushed the basic-metadata branch 2 times, most recently from c8b6079 to f5e4bce Compare April 29, 2022 15:04
@e3krisztian
Copy link
Contributor Author

Thanks for the reviews!
I have made the following changes:

  • --json-report -> --report (but the output will be json)
  • the report file is not overwritten unless --force is also specified
  • introduce some id to link reports and chunks and tasks

The latter is WIP, though might work.

@e3krisztian e3krisztian force-pushed the basic-metadata branch 4 times, most recently from 1cde579 to d1a2bff Compare May 4, 2022 13:06
@e3krisztian e3krisztian requested a review from martonilles May 4, 2022 15:02
@e3krisztian e3krisztian force-pushed the basic-metadata branch 5 times, most recently from cc8e856 to e8d3adb Compare May 5, 2022 18:21
@qkaiser qkaiser added this to the v2.0 - metadata extraction milestone May 6, 2022
@qkaiser qkaiser added the enhancement New feature or request label May 6, 2022
Copy link
Contributor

@vlaci vlaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have yet to fully understand the last two tests. Also, the last two fail one on my machine as well (the output order is different).

I am thinking about how it could be structured in an other way

@e3krisztian e3krisztian force-pushed the basic-metadata branch 2 times, most recently from 3a6dc7b to 342d983 Compare May 9, 2022 13:29
László Vaskó and others added 22 commits May 9, 2022 15:31
This object holds an association of `Task`s and their respective
`TaskResult`s (and of course `Reports`). With this change we are able
to reconstruct what report is for what task, and also what subtasks
are coming from what tasks after the processing is finished.

This is a useful basis for metadata support, as any additional
information can later be added as reports.
It can only process one file, so it is a more meaningful name
The output directory structure is expected to reflect the input structure.

Having a single input file and a single extract directory is simple.

However supporting directories as input has problems:
subdirectories should be there in the output, but we also create
directories for extraction, thus it is very easy to craft an input that
has a conflicting output (usually an unblob output is one such input).
A pair of "surrogateescape" decode and encode was used to get
from byte to bytes, with an internal utf-8 representation.
(read more about surrogateescape in https://peps.python.org/pep-0383/)
However the utf-8 representation was not used at all.

Although this simplification is not strictly necessary
(the results before and after must be the same) it took
some time reading it and understanding how the encoding works
during handling a problem that manifested in a failed command run
(#356)
Quite unexpectedly, there is a kind of report, that has content in
bytes: ExtractCommandFailedReport, as the stdout/stderr streams are
in bytes.

It has made the metadata reporting fail in one case, so the custom
JSON encoding has been changed to make sensible output when running
into an unknown object, but never fail.
Importing from conftest caused an ImportError in test_cli.py, when the
second conftest.py is created.

conftest.py is a pytest magic module, it is most commonly used
for defining fixtures. The fixtures defined there are made available
to tests via dependency injection in parameters, not via importing them.

TestHandler was moved to tests/test_cli.py from tests/conftest.py
Uniqueness is guaranteed inside a single run
@e3krisztian e3krisztian force-pushed the basic-metadata branch 2 times, most recently from 12afc4e to 61a12d0 Compare May 9, 2022 14:38
The output the tests work with is big, and as a result the tests are
more fragile than usual.
@e3krisztian e3krisztian merged commit cb9f6cd into main May 9, 2022
@e3krisztian e3krisztian deleted the basic-metadata branch May 9, 2022 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Basic metadata on chunks & extracted files
4 participants