-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be able to only emit information about Chunks #15
Comments
We need to have a clear definition of what constitute metadata information and how deep we go without extraction. high level metadata w/ in-memory matchingHere we should report high level metadata on chunks, for up to chunk metadata
in-memory chunk objectsThis means that we need to feed back objects to unblob that represents the carved out chunk in-memory, without writing the chunk to disk. We need to implement a way to block unblob from seeking before or after the start of the chunk and the end of the chunk, respectively. Then for each identified chunk, we need to be able to extract it in-memory, creating an in-memory file object for each file and directory within the archive. Do we really have the ability to represent all the files within an archive in memory as they are extracted and feeding them to unblob recursively ? If we plan to do this, we must steer away from using external tools for extraction. From my understanding, doing everything in memory means that we cannot rely on external extractors such as subprocess calls to 7z. Thus, we not only need to identify start and end offsets, but also implement the extraction or rely on external python libraries to do so. They may contain bugs, they may be unmaintained. At this point, I would strongly suggest that we keep using external extractors and focus engineering efforts on supporting as much formats as possible while building strong foundations (automated testing, linting, clean code). We will still be able to emit metadata files, although at the expense of extracting on disk. When we reach the point of reliably supporting all the file formats we wanted to initially support, then we can add the ability of extracting in-memory to each format. Unmaintained project that looks like good inspiration for this: https://github.com/barneygale/pathlab output formatI propose that we rely on JSON format to represent metadata.
{
"files":[
{
"path": "/tmp/fruits.img",
"gid": 1000,
"uid": 1000,
"perms": 777,
"size": 3200,
"modification_date": 123456789,
"access_date": 123456789,
"creation_date": 123456789,
"chunks": [
{
"start_offset": 0,
"end_offset": 1000,
"type": "zip",
"entropy": 0,14,
"tags": [],
"files": [
{
"path": "XXXXX",
"chunks": []
}
]
},
{
"start_offset": 1001,
"end_offset": 2000,
"type": "unknown",
"entropy": 0,99,
"tags":["ENCRYPTED"]
},
{
"start_offset": 2001,
"end_offset": 3200,
"type": "tar",
"entropy": 0,23,
"tags":[],
"files": []
}
]
}
]
} |
The use-case is that quick "peek" into a file, see what's inside the first layer, not having to extract a 100Gb file, because we don't know what's are in the other layers. We don't need to do it recursively. Just spit out some kind of information without extracting anything. |
We agreed not to implement the second strategy we originally planned, because we want to be feature-complete on parity with binwalk as soon as possible. The metadata extraction will still be needed, so I renamed this issue to reflect that. |
This probably a duplicate of #16 |
Show metadata about found Chunks.
The text was updated successfully, but these errors were encountered: