Be able to only emit information about Chunks #15

kissgyorgy · 2021-11-22T12:25:04Z

Show metadata about found Chunks.

qkaiser · 2021-11-23T09:24:41Z

We need to have a clear definition of what constitute metadata information and how deep we go without extraction.

high level metadata w/ in-memory matching

Here we should report high level metadata on chunks, for up to recurse level, without extracting files on disk.

chunk metadata

chunk identified format (unknown, zip, tar, etc)
chunk start offset
chunk end offset
chunk entropy
chunk tags (packed, obfuscated, encrypted), based on heuristics

in-memory chunk objects

This means that we need to feed back objects to unblob that represents the carved out chunk in-memory, without writing the chunk to disk. We need to implement a way to block unblob from seeking before or after the start of the chunk and the end of the chunk, respectively.

Then for each identified chunk, we need to be able to extract it in-memory, creating an in-memory file object for each file and directory within the archive. Do we really have the ability to represent all the files within an archive in memory as they are extracted and feeding them to unblob recursively ? If we plan to do this, we must steer away from using external tools for extraction.

From my understanding, doing everything in memory means that we cannot rely on external extractors such as subprocess calls to 7z. Thus, we not only need to identify start and end offsets, but also implement the extraction or rely on external python libraries to do so. They may contain bugs, they may be unmaintained. At this point, I would strongly suggest that we keep using external extractors and focus engineering efforts on supporting as much formats as possible while building strong foundations (automated testing, linting, clean code). We will still be able to emit metadata files, although at the expense of extracting on disk. When we reach the point of reliably supporting all the file formats we wanted to initially support, then we can add the ability of extracting in-memory to each format.

Unmaintained project that looks like good inspiration for this: https://github.com/barneygale/pathlab

output format

I propose that we rely on JSON format to represent metadata.

offsets MUST be represented in decimal notation.
UID/GID MUST be represented in decimal notation
permission MUST be represented in octal notation
size MUST be expressed in bytes using decimal notation
entropy MUST be represented as a ratio between 0 and 1 inclusive
entropy MUST be calculated over ranges of TO_BE_DEFINED bytes
dates MUST be represented as timestamps in milliseconds
dates MUST defaults to -1 if unknown/unavailable

{
"files":[
{
  "path": "/tmp/fruits.img",
  "gid": 1000,
  "uid": 1000,
  "perms": 777,
  "size": 3200,
  "modification_date": 123456789,
  "access_date": 123456789,
  "creation_date": 123456789,
  "chunks": [
    {
      "start_offset": 0,
      "end_offset": 1000,
      "type": "zip",
      "entropy": 0,14,
      "tags": [],
      "files": [
        {
          "path": "XXXXX",
          "chunks": []
        }
      ]
    },
    {
      "start_offset": 1001,
      "end_offset": 2000,
      "type": "unknown",
      "entropy": 0,99,
      "tags":["ENCRYPTED"]
    },
    {
      "start_offset": 2001,
      "end_offset": 3200,
      "type": "tar",
      "entropy": 0,23,
      "tags":[],
      "files": []
    }
  ]
}
]
}

kissgyorgy · 2021-11-23T10:14:18Z

The use-case is that quick "peek" into a file, see what's inside the first layer, not having to extract a 100Gb file, because we don't know what's are in the other layers. We don't need to do it recursively.

Just spit out some kind of information without extracting anything.

kissgyorgy · 2021-12-06T13:17:54Z

We agreed not to implement the second strategy we originally planned, because we want to be feature-complete on parity with binwalk as soon as possible. The metadata extraction will still be needed, so I renamed this issue to reflect that.

martonilles · 2022-03-22T12:55:24Z

This probably a duplicate of #16

Update flake.lock

kissgyorgy added this to the v2.0 - more in depth extraction milestone Nov 24, 2021

kukovecz mentioned this issue Nov 26, 2021

Recognize all file big chunks #59

Merged

kissgyorgy changed the title ~~Be able to only emit information about Chunks, not extract it~~ Be able to only emit information about Chunks Dec 6, 2021

martonilles removed this from the v2.0 - metadata extraction milestone Apr 5, 2022

kissgyorgy closed this as completed Jan 24, 2023

vlaci pushed a commit that referenced this issue Feb 1, 2025

Merge pull request #15 from onekey-sec/update_flake_lock_action

885493c

Update flake.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be able to only emit information about Chunks #15

Be able to only emit information about Chunks #15

kissgyorgy commented Nov 22, 2021 •

edited

Loading

qkaiser commented Nov 23, 2021 •

edited

Loading

kissgyorgy commented Nov 23, 2021 •

edited

Loading

kissgyorgy commented Dec 6, 2021

martonilles commented Mar 22, 2022

Be able to only emit information about Chunks #15

Be able to only emit information about Chunks #15

Comments

kissgyorgy commented Nov 22, 2021 • edited Loading

qkaiser commented Nov 23, 2021 • edited Loading

high level metadata w/ in-memory matching

chunk metadata

in-memory chunk objects

output format

kissgyorgy commented Nov 23, 2021 • edited Loading

kissgyorgy commented Dec 6, 2021

martonilles commented Mar 22, 2022

kissgyorgy commented Nov 22, 2021 •

edited

Loading

qkaiser commented Nov 23, 2021 •

edited

Loading

kissgyorgy commented Nov 23, 2021 •

edited

Loading