Basic metadata #357

e3krisztian · 2022-04-28T11:49:08Z

resolves #327

unblob --report report.json

The mapping between the objects in the report.json and the objects in #327 is:

TaskResult[Task + StatReport + FileMagicReport] <-> FilesystemObject
ChunkReport, UnknownChunkReport <-> Chunk

martonilles

Generally I like the PR a few missing/inconsistent things:

unknown chunks are missing from the report
it is very difficult to link Tasks to the Chunks they originate from (I can probably figure out from the path, but a more explicit reference would be better)
I would also assign an ID to each task, so linking subtasks and tasks would be much easier, instead of just using the path again
Error reports should be assigned to the chunk/file somehow, now it is difficult to see which chunk or file generated the report, usually it is possible to figure out from the error text the path and find the corresponding chunk/file, but explicit linking would be better (eg: ExtractCommandFailedReport, but could be anything other error report as well)
--json-report currently overwrites the file if it already exists, I would only do it if -f is configured, otherwise I would error out
in the Magic report I would store the magic_mime as well (though this could go to a separate PR)
in the StatReport for symlinks I would store the symlink target as well (this could be a separate PR as well)

Nice job!

vlaci · 2022-04-29T09:13:58Z

Just a conversation starter around the UX of the CLI: For me it'd me more straightforward to just call the option --report, then if we are to have multiple output formats, we could add a --report-format=json option for example. It'd have the downside that it is harder to support outputting multiple report formats concurrently but I don't really see a case to handle multiple formats for a single run like unblob --report-json foo.json --report-csv foo.csv anyway.

qkaiser · 2022-04-29T09:25:47Z

In full agreement with vlaci here. A single --report option is sufficient, we can add output format specification later if users actually request it, and if so we should emit a single type of report per run (JSON, CSV, XML).

e3krisztian · 2022-04-29T16:50:02Z

Thanks for the reviews!
I have made the following changes:

--json-report -> --report (but the output will be json)
the report file is not overwritten unless --force is also specified
introduce some id to link reports and chunks and tasks

The latter is WIP, though might work.

unblob/processing.py

unblob/cli.py

unblob/identifiers.py

tests/test_report.py

unblob/models.py

unblob/processing.py

tests/test_report.py

vlaci

I have yet to fully understand the last two tests. Also, the last two fail one on my machine as well (the output order is different).

I am thinking about how it could be structured in an other way

tests/test_report.py

This object holds an association of `Task`s and their respective `TaskResult`s (and of course `Reports`). With this change we are able to reconstruct what report is for what task, and also what subtasks are coming from what tasks after the processing is finished. This is a useful basis for metadata support, as any additional information can later be added as reports.

It can only process one file, so it is a more meaningful name

…e directory

The output directory structure is expected to reflect the input structure. Having a single input file and a single extract directory is simple. However supporting directories as input has problems: subdirectories should be there in the output, but we also create directories for extraction, thus it is very easy to craft an input that has a conflicting output (usually an unblob output is one such input).

A pair of "surrogateescape" decode and encode was used to get from byte to bytes, with an internal utf-8 representation. (read more about surrogateescape in https://peps.python.org/pep-0383/) However the utf-8 representation was not used at all. Although this simplification is not strictly necessary (the results before and after must be the same) it took some time reading it and understanding how the encoding works during handling a problem that manifested in a failed command run (#356)

Quite unexpectedly, there is a kind of report, that has content in bytes: ExtractCommandFailedReport, as the stdout/stderr streams are in bytes. It has made the metadata reporting fail in one case, so the custom JSON encoding has been changed to make sensible output when running into an unknown object, but never fail.

…ir_for

Importing from conftest caused an ImportError in test_cli.py, when the second conftest.py is created. conftest.py is a pytest magic module, it is most commonly used for defining fixtures. The fixtures defined there are made available to tests via dependency injection in parameters, not via importing them. TestHandler was moved to tests/test_cli.py from tests/conftest.py

Uniqueness is guaranteed inside a single run

The output the tests work with is big, and as a result the tests are more fragile than usual.

e3krisztian requested review from janos-gonye, martonilles and vlaci April 28, 2022 12:04

e3krisztian force-pushed the basic-metadata branch from 288446f to a6bf6a5 Compare April 28, 2022 12:20

martonilles requested changes Apr 28, 2022

View reviewed changes

e3krisztian force-pushed the basic-metadata branch 2 times, most recently from c8b6079 to f5e4bce Compare April 29, 2022 15:04

e3krisztian force-pushed the basic-metadata branch 4 times, most recently from 1cde579 to d1a2bff Compare May 4, 2022 13:06

e3krisztian requested a review from martonilles May 4, 2022 15:02

vlaci reviewed May 5, 2022

View reviewed changes

e3krisztian force-pushed the basic-metadata branch 5 times, most recently from cc8e856 to e8d3adb Compare May 5, 2022 18:21

qkaiser added this to the v2.0 - metadata extraction milestone May 6, 2022

qkaiser assigned e3krisztian May 6, 2022

qkaiser added the enhancement New feature or request label May 6, 2022

vlaci mentioned this pull request May 6, 2022

Rework chunk creation and processing workflow #369

Open

e3krisztian force-pushed the basic-metadata branch from e8d3adb to 587c130 Compare May 6, 2022 17:14

vlaci reviewed May 9, 2022

View reviewed changes

tests/test_report.py Show resolved Hide resolved

e3krisztian force-pushed the basic-metadata branch 2 times, most recently from 3a6dc7b to 342d983 Compare May 9, 2022 13:29

TaskResult: Add task

1c6c930

László Vaskó and others added 22 commits May 9, 2022 15:31

Report: make it possible to add non-error related reports

3b1676c

processing: store some metadata in reports

e8e5ba4

processing: remove multiple input files support

517bb77

processing: process_files -> process_file

f1c2620

It can only process one file, so it is a more meaningful name

remove unused function get_existing_extract_dirs

72178f4

refactor/rename variables to better differentiate extraction and carv…

1f6c0b2

…e directory

processing: _process_one_file function is no longer needed

5d2a082

write metadata result

0abb454

CLI: optionally create JSON report file

829d750

CLI: also get help with -h

ab97550

Report recognized chunks

ebf1f28

refactor: get_extract_dir_for_input -> ExtractionConfig.get_extract_d…

9c4b792

…ir_for

existing JSON report output file prevents run without --force

60e3d6e

Generate unique identifiers

b17742b

Uniqueness is guaranteed inside a single run

Add id-s to chunks, and chunk_id-s to tasks

1601c99

feat(metadata): report unknown chunks

6cff605

github: run tests verbosely

ded0daf

e3krisztian force-pushed the basic-metadata branch 2 times, most recently from 12afc4e to 61a12d0 Compare May 9, 2022 14:38

Add tests for the processing report

5a7ada4

The output the tests work with is big, and as a result the tests are more fragile than usual.

e3krisztian force-pushed the basic-metadata branch from 61a12d0 to 5a7ada4 Compare May 9, 2022 14:56

vlaci approved these changes May 9, 2022

View reviewed changes

martonilles approved these changes May 9, 2022

View reviewed changes

e3krisztian merged commit cb9f6cd into main May 9, 2022

e3krisztian deleted the basic-metadata branch May 9, 2022 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic metadata #357

Basic metadata #357

e3krisztian commented Apr 28, 2022 •

edited

Loading

martonilles left a comment •

edited

Loading

vlaci commented Apr 29, 2022

qkaiser commented Apr 29, 2022

e3krisztian commented Apr 29, 2022

vlaci left a comment •

edited

Loading

Basic metadata #357

Basic metadata #357

Conversation

e3krisztian commented Apr 28, 2022 • edited Loading

martonilles left a comment • edited Loading

Choose a reason for hiding this comment

vlaci commented Apr 29, 2022

qkaiser commented Apr 29, 2022

e3krisztian commented Apr 29, 2022

vlaci left a comment • edited Loading

Choose a reason for hiding this comment

e3krisztian commented Apr 28, 2022 •

edited

Loading

martonilles left a comment •

edited

Loading

vlaci left a comment •

edited

Loading