Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be able to only emit information about Chunks #15

Closed
kissgyorgy opened this issue Nov 22, 2021 · 4 comments
Closed

Be able to only emit information about Chunks #15

kissgyorgy opened this issue Nov 22, 2021 · 4 comments

Comments

@kissgyorgy
Copy link
Contributor

kissgyorgy commented Nov 22, 2021

Show metadata about found Chunks.

@qkaiser
Copy link
Contributor

qkaiser commented Nov 23, 2021

We need to have a clear definition of what constitute metadata information and how deep we go without extraction.

high level metadata w/ in-memory matching

Here we should report high level metadata on chunks, for up to recurse level, without extracting files on disk.

chunk metadata

  • chunk identified format (unknown, zip, tar, etc)
  • chunk start offset
  • chunk end offset
  • chunk entropy
  • chunk tags (packed, obfuscated, encrypted), based on heuristics

in-memory chunk objects

This means that we need to feed back objects to unblob that represents the carved out chunk in-memory, without writing the chunk to disk. We need to implement a way to block unblob from seeking before or after the start of the chunk and the end of the chunk, respectively.

Then for each identified chunk, we need to be able to extract it in-memory, creating an in-memory file object for each file and directory within the archive. Do we really have the ability to represent all the files within an archive in memory as they are extracted and feeding them to unblob recursively ? If we plan to do this, we must steer away from using external tools for extraction.

From my understanding, doing everything in memory means that we cannot rely on external extractors such as subprocess calls to 7z. Thus, we not only need to identify start and end offsets, but also implement the extraction or rely on external python libraries to do so. They may contain bugs, they may be unmaintained. At this point, I would strongly suggest that we keep using external extractors and focus engineering efforts on supporting as much formats as possible while building strong foundations (automated testing, linting, clean code). We will still be able to emit metadata files, although at the expense of extracting on disk. When we reach the point of reliably supporting all the file formats we wanted to initially support, then we can add the ability of extracting in-memory to each format.

Unmaintained project that looks like good inspiration for this: https://github.com/barneygale/pathlab

output format

I propose that we rely on JSON format to represent metadata.

  • offsets MUST be represented in decimal notation.
  • UID/GID MUST be represented in decimal notation
  • permission MUST be represented in octal notation
  • size MUST be expressed in bytes using decimal notation
  • entropy MUST be represented as a ratio between 0 and 1 inclusive
  • entropy MUST be calculated over ranges of TO_BE_DEFINED bytes
  • dates MUST be represented as timestamps in milliseconds
  • dates MUST defaults to -1 if unknown/unavailable
{
"files":[
{
  "path": "/tmp/fruits.img",
  "gid": 1000,
  "uid": 1000,
  "perms": 777,
  "size": 3200,
  "modification_date": 123456789,
  "access_date": 123456789,
  "creation_date": 123456789,
  "chunks": [
    {
      "start_offset": 0,
      "end_offset": 1000,
      "type": "zip",
      "entropy": 0,14,
      "tags": [],
      "files": [
        {
          "path": "XXXXX",
          "chunks": []
        }
      ]
    },
    {
      "start_offset": 1001,
      "end_offset": 2000,
      "type": "unknown",
      "entropy": 0,99,
      "tags":["ENCRYPTED"]
    },
    {
      "start_offset": 2001,
      "end_offset": 3200,
      "type": "tar",
      "entropy": 0,23,
      "tags":[],
      "files": []
    }
  ]
}
]
}

@kissgyorgy
Copy link
Contributor Author

kissgyorgy commented Nov 23, 2021

The use-case is that quick "peek" into a file, see what's inside the first layer, not having to extract a 100Gb file, because we don't know what's are in the other layers. We don't need to do it recursively.

Just spit out some kind of information without extracting anything.

@kissgyorgy kissgyorgy changed the title Be able to only emit information about Chunks, not extract it Be able to only emit information about Chunks Dec 6, 2021
@kissgyorgy
Copy link
Contributor Author

We agreed not to implement the second strategy we originally planned, because we want to be feature-complete on parity with binwalk as soon as possible. The metadata extraction will still be needed, so I renamed this issue to reflect that.

@martonilles
Copy link
Contributor

This probably a duplicate of #16

@martonilles martonilles removed this from the v2.0 - metadata extraction milestone Apr 5, 2022
vlaci pushed a commit that referenced this issue Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants