Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement resilient and memory sparing mirror process #279

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

EugenMayer
Copy link
Contributor

@EugenMayer EugenMayer commented Feb 17, 2025

I would love to work on a patch to improve the situation:

Seperate the processes to

  1. do the full fetch/year updates. Use the current approach, using 120 days slices. This heavily reduces the memory and increases reliability. New: each year file will be checked for c-time, if it is not older then 8 days and size > 0, skip it. Thus we only update the year files every 8 days (filling up with all the entries we yet used modified for). One can force-update the years if providing the --forceUpdate, off by default
  2. the modified-file update: re-implement it using a full-fetch of all years, but only a "recent" changed slice of "most recent 8 days or if existing, since the last c-time of the modified file". This task no always fetches those >=8 days, in 1 API call, for all years, and replaces the current modified file. We can do this with a all-years fetch since the expected amount of items considered to the full fetch is a magnitude lower. The current 8 days slice is 24666 CVEs

Those process run one after the other
With this, we combine the positive effects of both: Reduce memory hogging and speedy regular modified updates.

@jeremylong what are your thoughts?

Stats:

innitial population

  • years are outdated, thus we fetch 8 days for each year, means> 4 (slices) x 23 years = 92 API requests fetching the most recent 8 days for each year
  • 1 api call for modified
    = 93 API calls

after the first cache population, lets say i run the next day:

  • all years are skipped (zero API requests)
  • modified file is fetched fetching 24hours of recent changes in 1 api request
    = 1 API request

after 9 days passed

  • years are outdated, thus we fetch 8 days for each year, means> 4 (slices) x 23 years = 92 API requests fetching the most recent 8 days for each year
  • 1 api call for modified
    = 93 API calls

Thus the more expensive api calls are triggered every 8 days.

Further options:

  • implement optimized "year update" process where we no longer fetch per year, but for all years, but less the lets say 4 days (if the cache was not older) - this would reduce the "after 9 days passed) run to 2 API calls with slightly more memory usage. But i think this optimization might just not be worth it.

EugenMayer and others added 5 commits February 10, 2025 06:29
jeremylong#259)

* Reimplement caching to use a per-year scope
* Add forgiveness if a year fails, continue to the next one
* Reimplement how the lastUpdated date is used and stored per year
* Add lockfile
* fix: preserve modified entries if year fails
* polish docs, add exit code

---------

Co-authored-by: Jeremy Long <[email protected]>
* fix: bugfix in yearly cache update

* fix: ensure sorted

* fix: updated modified date in properties
* feat: retry nvd cache on failure

* fix: updated per peer review
@EugenMayer
Copy link
Contributor Author

EugenMayer commented Feb 17, 2025

I found out, that the slice-size is something and makes the NVD API crash. For example:

https://services.nvd.nist.gov/rest/json/cves/2.0?pubStartDate=2021-01-01T00%3A00%3A00Z&pubEndDate=2021-04-28T00%3A00%3A00Z&resultsPerPage=200&startIndex=0

is a 115 days slice that does consistently crash.

https://services.nvd.nist.gov/rest/json/cves/2.0?pubStartDate=2021-01-01T00%3A00%3A00Z&pubEndDate=2021-03-28T00%3A00%3A00Z&resultsPerPage=200&startIndex=0

So 30 days less, works. Also using it to 'cross the 04/28' works

https://services.nvd.nist.gov/rest/json/cves/2.0?pubStartDate=2021-03-01T00%3A00%3A00Z&pubEndDate=2021-04-28T00%3A00%3A00Z&resultsPerPage=200&startIndex=0

so the API, maybe under load, decides to simply not handle requests of a specific size. This crash is consistent since 6 hours.

I thus implemented a new parameter for the cli maxDaysOfYearRange.
By default, it is 115 (for the good days), I tried 90 and finally could finish my mirror sync.

This is most probably one of those things that make the current main implementation so fragile - consider the API load status, the API will not return more items then X and fail with an 503 if a request still would return so many items. 'Fetch them all' is just nothing that works, neither for the memory nor on the API, like expected

@jeremylong
Copy link
Owner

NVD is likely under load right now because of the error and users not upgrading and instead just trying again. It should level off soon.

@EugenMayer
Copy link
Contributor Author

EugenMayer commented Feb 17, 2025

This is not an exception at all. This is the case for a long long time (weeks)- it was the motivation for my work in the first place. And several days around the first implementation the API was down regularly.

This happens day in day out IMHO.

@jeremylong
Copy link
Owner

If you are running into this daily - you are doing something wrong. Once you build the cache don't throw it away. For instance: https://github.com/dependency-check/DependencyCheck_Builder/actions/workflows/cache.yml has been running for a long time and has only had a few times where it had issues.

@EugenMayer EugenMayer changed the base branch from scratch/memory-usage-per-year to main February 18, 2025 07:41
@EugenMayer
Copy link
Contributor Author

EugenMayer commented Feb 18, 2025

The current status of this pr, aka 'api resilience and memory' does finally work with the current ODC like gradle.

Features are:

  • reduce memory by fetching year-wise [memory]
  • years are saved one by one, thus continuing with the last year that failed is possible. [resilience]
  • If one year fails, it will continue from this year the next try instead of restarting from scratch [resilience]
  • the process will not re-fetch a year, if the cache is not older then 8 days (or --forceUpdate has been used). The 8 days 'diff' is fetched via the modified-changes process [resilience]
  • modified-changes aren not re-fetched if not older than 1 hour (or --forceUpdate has been used) [resilience]
  • docker: restart mirroring if the process fails. Restarting will debounce (1min * retryAttempt). [resilience]
  • docker: You can define how often the restart should happen using MAX_MIRROR_RETRIES [resilience]
  • you modify the max-days-fetched slice using maxDaysOfYearRange - the NVD API will block requests with larger values earlier, so reducing it can help 'getting through it' [resilience]

Drawbacks;

  • It uses more API calls (since we are doing smaller chunks to improve resilience/progress and memory footprint) which can prolong the process. AFAICs doing a full fetch takes 30 minutes more (since one API requests takes about 30-60s on NVDs side), an incremental 'recent-modified' is as fast as before, and all 8 days, refetching the year diff, will take approx. 10 minutes. Effectively though, on a full fetch, the current process of gigantic fetches + 'succeed all or fail all' does often not finish at all, since the NVD API throughs random 503/403 (those 12 retries are not helping, they happen way to fast - the error will persist) errors more often then not - restarting the entire process.

IMHO this approach finishes more reliable and deterministic, even with NVD API errors, while it can be slower then the current process, if the NVD API does not fail at all. For me, neither from the dev machine nor from our DC k8s cluster, I have never seen a run that finished without NVD failing to answer (including the 12 retries). So IMHO for me, betting on that horse does not make sense.

You can try it yourself using ghcr.io/eugenmayer/vulnz:7.3.0-1 or the chart or build the cli version yourself from this branch.

@EugenMayer EugenMayer changed the title Optimize modified cache rebuild api calls Implement resilient and memory sparing mirror process Feb 20, 2025
@EugenMayer
Copy link
Contributor Author

@jeremylong am i right to assume that you are not considering merging this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants