Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for Bitnami cataloguer #3341

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

juan131
Copy link

@juan131 juan131 commented Oct 16, 2024

Description

This PR adds supports for a new Bitnami cataloguer so Syft is able to recognize Bitnami SBOMs and properly detect the packages available on Bitnami images.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (please discuss with the team first; Syft is 1.0 software and we won't accept breaking changes without going to 2.0)
  • Documentation (updates the documentation)
  • Chore (improve the developer experience, fix a test flake, etc, without changing the visible behavior of Syft)
  • Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

willmurphyscode and others added 6 commits October 16, 2024 16:04
Bitnami images have spdx SBOMs at predictable paths, and Syft could more
accurately identify the software in these images by scanning those
SBOMs. Start work on this by forking the sbom-cataloger as a new
bitnami-cataloger.

Signed-off-by: Will Murphy <[email protected]>
@github-actions github-actions bot added the json-schema Changes the json schema label Oct 17, 2024
@juan131 juan131 marked this pull request as ready for review October 17, 2024 10:26
@juan131
Copy link
Author

juan131 commented Oct 17, 2024

cc @willmurphyscode

@wagoodman wagoodman self-assigned this Nov 12, 2024
@willmurphyscode
Copy link
Contributor

Hi all! I just wanted to check in and see if there's anything I can do to help here.

My understanding is that the current state of this PR is that @juan131 is going to implement the changes discussed in #3065 (comment). Is that everyone else's understanding? Is there anything you all need from me? Thanks!

@wagoodman wagoodman removed their assignment Dec 17, 2024
@juan131
Copy link
Author

juan131 commented Jan 15, 2025

Yes @willmurphyscode! That's the current status

Sorry for the long delay. I was on marriage leave almost all December and then Christmas came, so I've been out for a while. Let me resume the work where I left it

@willmurphyscode
Copy link
Contributor

Hi @juan131! I just testing this, but I'm seeing something surprising:

❯ SYFT_PACKAGE_EXCLUDE_BINARY_OVERLAP_BY_OWNERSHIP=true go run ./cmd/syft -q bitnami/postgresql:14 | rg -e NAME -e postgres
NAME                    VERSION                  TYPE           
postgresql              14.15                    binary          
postgresql              14.15.0-9                bitnami         


❯ SYFT_PACKAGE_EXCLUDE_BINARY_OVERLAP_BY_OWNERSHIP=false go run ./cmd/syft -q bitnami/postgresql:14 | rg -e NAME -e postgres
NAME                    VERSION                  TYPE           
postgresql              14.15                    binary          
postgresql              14.15.0-9                bitnami         

I expected those two outputs to be different, namely that in the first output, only the bitnami postgresql package would appear.

Would you mind taking a look at this? Does your understanding of what's meant to happen match mine? (Also while you're pushing changes, there are a couple conflicts with main and the JSON schema moved again.)

@juan131
Copy link
Author

juan131 commented Jan 24, 2025

@willmurphyscode as I reported at #3065 (comment), the problem seems to be that binary packages doen't implement the OwnedFiles interface.

@willmurphyscode
Copy link
Contributor

Hi @juan131! Thanks for that comment.

We've pushed a partial fix that needs some modification before it can be merged. More on that below.

@wagoodman and I did some digging on this branch to understand why SYFT_PACKAGE_EXCLUDE_BINARY_OVERLAP_BY_OWNERSHIP=true wasn't causing deduplication. It turns out that binaryPkg does not need to implement FileOwner, but what does need to happen is that the bitnamiPkg.OwnedFiles needs to return the full path to the binary.

At https://github.com/anchore/syft/blob/main/internal/relationship/by_file_ownership.go#L87-L131, we are adding file overlap relationships in the situation where the parent (which must be a file owner) and the child (which does not need to be a file owner) if the parent owns the path that is evidence of the child. (Even non-file-owner packages have a location that Syft uses as the evidence of the package.

The reason why the deduplication wasn't occurring is that on the bitnami postgres package, only /opt/bitnami/postgresql, was returned, but file ownership expects files, not directories, so /opt/bitnami/postgresql/bin/postgresql need to be returned in order for the deduplication to occur.

On this branch, we added a commit that simply makes the packages from the Bitnami SPDX JSON own everything under, for example /opt/bitnami/postgresql. This works, but it makes every package in the SDPX JSON own every path, so we need some way of subsetting down which things own which paths, or maybe we just make the "main one" (in this case postgres) own all the paths, and the others don't own any. Does your SPDX SBOM give us a good way to assign file ownership more correctly than what we're doing? The best idea we've had so far is to discover a "main" package by looking at the documentDescribes field on the SDPX SBOM, and make that main package own all the files, and make none of the other packages own any files. How does that sounds to you all?

I'm happy to sync on this next week, since I know there's been some back and forth and we're all eager to get this in.

@juan131
Copy link
Author

juan131 commented Jan 27, 2025

@willmurphyscode thanks so much for digging on the issue and pushing the partial fix! 😊

... but file ownership expects files, not directories, so /opt/bitnami/postgresql/bin/postgresql need to be returned in order for the deduplication to occur.
On this branch, we added a commit that simply makes the packages from the Bitnami SPDX JSON own everything under, for example /opt/bitnami/postgresql

I can see that SPDX files are also reported as files belonging to the Bitnami package. I guess, we'll have to improve the glob expression to avoid that, right? I mean, it makes little sense to consider the SPDX (e.g. /opt/bitnami/postgresql/.spdx-postgresql.spdx) part of the package.

we need some way of subsetting down which things own which paths, or maybe we just make the "main one" (in this case postgres) own all the paths, and the others don't own any. Does your SPDX SBOM give us a good way to assign file ownership more correctly than what we're doing?

I see what you mean but, unfortunately, our SPDX files do not help on identifying what files under the directory belongs to a specific package. For instance, taking as an example the .spdx-apache.spdx file included on this diff, we have packages like the one below (modsecurity2) and it's not possible to identify what files under /opt/bitnami/apache` are part of it:

        {
            "SPDXID": "SPDXRef-modsecurity2",
            "name": "modsecurity2",
            "versionInfo": "2.9.7",
            "downloadLocation": "https://github.com/SpiderLabs/ModSecurity/releases/download/v2.9.7/modsecurity-2.9.7.tar.gz",
            "licenseConcluded": "Apache-2.0",
            "licenseDeclared": "Apache-2.0",
            "filesAnalyzed": false,
            "externalRefs": [
                {
                    "referenceCategory": "SECURITY",
                    "referenceType": "cpe23Type",
                    "referenceLocator": "cpe:2.3:*:trustwave:modsecurity:2.9.7:*:*:*:*:*:*:*"
                },
                {
                    "referenceCategory": "PACKAGE-MANAGER",
                    "referenceType": "purl",
                    "referenceLocator": "pkg:bitnami/[email protected]?arch=arm64&distro=debian-12"
                }
            ],
            "copyrightText": "NOASSERTION"
        }

The best idea we've had so far is to discover a "main" package by looking at the documentDescribes field on the SDPX SBOM, and make that main package own all the files, and make none of the other packages own any files. How does that sounds to you all?

I don't think we have other alternatives. My concern here is: would that imply that we're unable to detect duplicates on these other secondary packages?

@willmurphyscode
Copy link
Contributor

I don't think we have other alternatives. My concern here is: would that imply that we're unable to detect duplicates on these other secondary packages?

It would make it so that other packages do not deduplicate. This will probably be most annoying for binaries that Syft has a binary classifier for, and for files that are scannable in their own right (e.g. JARs, binaries built by Go, binaries that have ELF metadata notes identifying the package they belong to, probably some others).

I did see that some files are listed in the Bitnami SPDX. In the Postgres image we were looking at, these looked like they were JARs, for example /opt/bitnami/postgresql/share/pljava/pljava-1.6.8.jar listed in the SPDX as indirectly owned by pljava:

    {
      "spdxElementId": "SPDXRef-pljava",
      "relatedSpdxElement": "SPDXRef-Package-4afd881b870cf8ab",
      "relationshipType": "CONTAINS"
    },
... snip ...
    {
      "spdxElementId": "SPDXRef-Package-4afd881b870cf8ab",
      "relatedSpdxElement": "SPDXRef-File-28d7473df5d8af04-pljava",
      "relationshipType": "CONTAINS"
    },

I'm not sure what the intermediate package is telling me. The PURL for the overall pljava package is pkg:bitnami/[email protected]?arch=arm64&distro=debian-12 and the PURL for the intermediate one ispkg:maven/org.postgresql/[email protected].

So the relationship seems to look like this:

Postgresql (main package) CONTAINS -> pljava CONTAINS -> pljava (weird intermediate package) CONTAINS -> a pljava JAR file

It seems like maybe the best path forward is:

  1. For files explicitly declared in SPDX JSON, follow ownership tree upwards and make them owned by the lowest package that contains them and isn't the main package.
  2. For the "main" (i.e. "documentDescribes") SPDX package, make it own all files under /opt/bitnami/PRODUCT_NAME that are not owned by another package because of step 1.

Does that seem like a reasonable approach @wagoodman @juan131 ?

It seems like the PROs are:

  • good deduplication on the main package
  • capturing what little file ownership info is in the SPDX document accurately
    and the CONs are:
  • not being able to deduplicate some sub-packages, for example if we had a binary classifier for psql CLI, it would be reported (probably at slightly different versions) by both bitnami cataloger and binary classifier cataloger.

@juan131
Copy link
Author

juan131 commented Jan 30, 2025

@willmurphyscode your approach makes sense to me, I'll implement it!!

By the way, do you agree on adapting the glob expression to avoid SPDX files themselves be listed as part of the "main" package?

I'm not sure what the intermediate package is telling me. The PURL for the overall pljava package is pkg:bitnami/[email protected]?arch=arm64&distro=debian-12 and the PURL for the intermediate one is pkg:maven/org.postgresql/[email protected]

Let me figure out why we're including that "intermediate" package you mentioned with some colleague.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
json-schema Changes the json schema
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Support Bitnami embedded SBOMs
3 participants