Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement caching to use a per-year scope fixes #258 #259

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

EugenMayer
Copy link
Contributor

@EugenMayer EugenMayer commented Jan 27, 2025

Reimplement NVD caching to reduce the memory footprint and improve performance - fixes #258

  • stop holding all entries for all years in the memory at the same time. Instead, load entries for one year only and save them into a file, clearing up memory. Continue with the next year
  • use publish data range fetching to ensure we fetch all entries for one year
  • ensure our date-range respects API limits (max 120days) and fetch in queues so we can share on entire year
  • refactor loop to improve readability
  • reduce how often we travel across the lists, dedupe or sort them
  • ensure publicationDate filters can only be applied once (should have been the case for all filters)
  • implement lockfile so mirror/cron processes cannot run at the same time
  • be forgiving if fetching of a year fails - continue with the next year (hope for continue to later fill the gap)
  • using withLastModifiedFilter on per year basis, depending the last year update. cacheProperties.lastModifiedDate is no longer used for that at all. We know use the lastModified date of the cache file as the date-marker instead of an calculated value. Simpler and ensures the last write time is what we respect.
  • fix bug with validation cron tab in docker image

This PR reduces the memory load by a magnitude, in terms, we hold as maximum of 'all entries of a year' in the memory. The current process holds 'all entries of all year' and thus it will use more and more memory each year passing.

Stats

The current max-memory usage with the new implementation is about 560MB, while the old implementation did not finish with a limit of 7GB

If someone wants to test it, use the docker image

export APIKEY=123
docker run --name vulnz -e NVD_API_KEY=$APIKEY -e DELAY=100 ghcr.io/eugenmayer/vulnz:v7.3.0-9

@EugenMayer EugenMayer changed the title Reimplement caching to use a per-year scope Reimplement caching to use a per-year scope fixes #258 Jan 27, 2025
Copy link

@DanielRaapDev DanielRaapDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code suggestions.

@EugenMayer
Copy link
Contributor Author

@DanielRaapDev thanks for the review. I added the continue feature, sorry, you have been quicker with the review - I was not quiet finished yet. Hope you do not mind.

@EugenMayer EugenMayer force-pushed the task/efficient-nvd-cache branch 2 times, most recently from 557e324 to 7a42496 Compare January 27, 2025 20:46
- Add forgiveness if a year fails, continue to the next one
- Reimplement how the lastUpdated date is used and stored per year
- Add lockfile
@EugenMayer EugenMayer force-pushed the task/efficient-nvd-cache branch from 08b12b4 to 2714129 Compare January 27, 2025 23:51
@EugenMayer
Copy link
Contributor Author

EugenMayer commented Jan 28, 2025

@jeremylong any chance to get feedback of your side? I know this is not a compact PR, but there was no way to keep this approach smaller.

Currently, the mirror does build the files just fine, als re-update basted on lastModified per years works fine, but running the gradle task to update fails without any hint. It downloads the json archives and then just fails:

> Task :core:dependencyCheckUpdate
Caching disabled for task ':core:dependencyCheckUpdate' because:
  Caching has not been enabled for the task
Task ':core:dependencyCheckUpdate' is not up-to-date because:
  Task has not declared any outputs despite executing actions.
Setting default auxiliaries to "ODC"
setting defaultCompositeCacheAttributes to [ useLateral = true, useRemote = true, useDisk = true, maxObjs = 0, maxSpoolPerRun = -1, diskUsagePattern = UPDATE, spoolChunkSize = 2 ]
setting defaultElementAttributes to [ IS_LATERAL = false, IS_SPOOL = true, IS_REMOTE = false, IS_ETERNAL = false, MaxLifeSeconds = 86400, IdleTime = 1800, CreateTime = 1738046646090, LastAccessTime = 1738046646090, getTimeToLiveSeconds() = 86400, createTime = 1738046646090 ]
initialized MemoryCache for CENTRAL
Constructed cache with name [CENTRAL] and cache attributes [ useLateral = true, useRemote = true, useDisk = true, maxObjs = 0, maxSpoolPerRun = -1, diskUsagePattern = UPDATE, spoolChunkSize = 2 ]
No cache event logger defined for auxiliary [jcs.auxiliary.ODC]
Using standard serializer [org.apache.commons.jcs3.utils.serialization.StandardSerializer@88fa4d] for auxiliary [jcs.auxiliary.ODC]
Region [CENTRAL] : Set maxKeySize to: "1,000,000"
Region [CENTRAL] : Cache file root directory: /home/em/.gradle/dependency-check-data/11.0/cache
Region [CENTRAL] : Indexed Disk Cache is alive.
initialized MemoryCache for POM
Constructed cache with name [POM] and cache attributes [ useLateral = true, useRemote = true, useDisk = true, maxObjs = 0, maxSpoolPerRun = -1, diskUsagePattern = UPDATE, spoolChunkSize = 2 ]
No cache event logger defined for auxiliary [jcs.auxiliary.ODC]
Using standard serializer [org.apache.commons.jcs3.utils.serialization.StandardSerializer@26b3929d] for auxiliary [jcs.auxiliary.ODC]
Region [POM] : Set maxKeySize to: "1,000,000"
Region [POM] : Cache file root directory: /home/em/.gradle/dependency-check-data/11.0/cache
Region [POM] : Indexed Disk Cache is alive.
initialized MemoryCache for NODEAUDIT
Constructed cache with name [NODEAUDIT] and cache attributes [ useLateral = true, useRemote = true, useDisk = true, maxObjs = 0, maxSpoolPerRun = -1, diskUsagePattern = UPDATE, spoolChunkSize = 2 ]
No cache event logger defined for auxiliary [jcs.auxiliary.ODC]
Using standard serializer [org.apache.commons.jcs3.utils.serialization.StandardSerializer@3c718cde] for auxiliary [jcs.auxiliary.ODC]
Region [NODEAUDIT] : Set maxKeySize to: "1,000,000"
Region [NODEAUDIT] : Cache file root directory: /home/em/.gradle/dependency-check-data/11.0/cache
Region [NODEAUDIT] : Indexed Disk Cache is alive.
Parsed regions [CENTRAL, POM, NODEAUDIT]
Finished configuration in 18 ms.
Checking for updates
NVD API Cache requires several updates; this could take a couple of minutes.

Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2008.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2007.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2006.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2005.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2004.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2003.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2025.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2002.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2024.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2023.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2022.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2021.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2020.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2019.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2018.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2017.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2016.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2015.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2014.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-2013.json.gz
Download Started for NVD Cache - https://nvd-mirror.<censored>/nvdcve-modified.json.gz

> Task :core:dependencyCheckUpdate FAILED

[Incubating] Problems report is available at: file:///home/em/Development/projects/dw/core/build/reports/problems/problems-report.html

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':core:dependencyCheckUpdate'.
> temporal

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --debug option to get more log output.
> Get more help at https://help.gradle.org.

Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

For more on this, please refer to https://docs.gradle.org/8.12/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.

BUILD FAILED in 38s
1 actionable task: 1 executed

IMHO i assume io.github.jeremylong.vulnz.cli.commands.CveCommand#createRecentlyChangedCacheFile which is creating nvdcve-modified.json.gz and nvdcve-meta could be somehow the cause - the way the file was implented before was fairly magical, it was rather hard extracting the logic behind it - i might have made a mistake there.

This is how my current cache folder looks like

-rw-r--r--    1 mirror   mirror        1165 Jan 28 06:38 cache.properties
-rw-r--r--    1 mirror   mirror      769456 Jan 28 06:01 nvdcve-2002.json.gz
-rw-r--r--    1 mirror   mirror         152 Jan 28 06:01 nvdcve-2002.meta
-rw-r--r--    1 mirror   mirror      639892 Jan 28 06:01 nvdcve-2003.json.gz
-rw-r--r--    1 mirror   mirror         152 Jan 28 06:01 nvdcve-2003.meta
-rw-r--r--    1 mirror   mirror     1173114 Jan 28 06:01 nvdcve-2004.json.gz
-rw-r--r--    1 mirror   mirror         152 Jan 28 06:01 nvdcve-2004.meta
-rw-r--r--    1 mirror   mirror     2148085 Jan 28 06:01 nvdcve-2005.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:01 nvdcve-2005.meta
-rw-r--r--    1 mirror   mirror     2951160 Jan 28 06:02 nvdcve-2006.json.gz
-rw-r--r--    1 mirror   mirror         153 Jan 28 06:02 nvdcve-2006.meta
-rw-r--r--    1 mirror   mirror     3240103 Jan 28 06:02 nvdcve-2007.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:02 nvdcve-2007.meta
-rw-r--r--    1 mirror   mirror     3236516 Jan 28 06:02 nvdcve-2008.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:02 nvdcve-2008.meta
-rw-r--r--    1 mirror   mirror     4277943 Jan 28 06:02 nvdcve-2009.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:02 nvdcve-2009.meta
-rw-r--r--    1 mirror   mirror     3544284 Jan 28 06:03 nvdcve-2010.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:03 nvdcve-2010.meta
-rw-r--r--    1 mirror   mirror     3547111 Jan 28 06:03 nvdcve-2011.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:03 nvdcve-2011.meta
-rw-r--r--    1 mirror   mirror     3880228 Jan 28 06:03 nvdcve-2012.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:03 nvdcve-2012.meta
-rw-r--r--    1 mirror   mirror     4470300 Jan 28 06:04 nvdcve-2013.json.gz
-rw-r--r--    1 mirror   mirror         143 Jan 28 06:04 nvdcve-2013.meta
-rw-r--r--    1 mirror   mirror     4226698 Jan 28 06:05 nvdcve-2014.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:05 nvdcve-2014.meta
-rw-r--r--    1 mirror   mirror     2857702 Jan 28 06:06 nvdcve-2015.json.gz
-rw-r--r--    1 mirror   mirror         143 Jan 28 06:06 nvdcve-2015.meta
-rw-r--r--    1 mirror   mirror     3427774 Jan 28 06:07 nvdcve-2016.json.gz
-rw-r--r--    1 mirror   mirror         143 Jan 28 06:07 nvdcve-2016.meta
-rw-r--r--    1 mirror   mirror     6535173 Jan 28 06:11 nvdcve-2017.json.gz
-rw-r--r--    1 mirror   mirror         143 Jan 28 06:11 nvdcve-2017.meta
-rw-r--r--    1 mirror   mirror     7339653 Jan 28 06:15 nvdcve-2018.json.gz
-rw-r--r--    1 mirror   mirror         143 Jan 28 06:15 nvdcve-2018.meta
-rw-r--r--    1 mirror   mirror     8110987 Jan 28 06:18 nvdcve-2019.json.gz
-rw-r--r--    1 mirror   mirror         141 Jan 28 06:18 nvdcve-2019.meta
-rw-r--r--    1 mirror   mirror     9187846 Jan 28 06:22 nvdcve-2020.json.gz
-rw-r--r--    1 mirror   mirror         143 Jan 28 06:22 nvdcve-2020.meta
-rw-r--r--    1 mirror   mirror    12715910 Jan 28 06:25 nvdcve-2021.json.gz
-rw-r--r--    1 mirror   mirror         145 Jan 28 06:25 nvdcve-2021.meta
-rw-r--r--    1 mirror   mirror    13404740 Jan 28 06:29 nvdcve-2022.json.gz
-rw-r--r--    1 mirror   mirror         145 Jan 28 06:29 nvdcve-2022.meta
-rw-r--r--    1 mirror   mirror    15715163 Jan 28 06:33 nvdcve-2023.json.gz
-rw-r--r--    1 mirror   mirror         145 Jan 28 06:33 nvdcve-2023.meta
-rw-r--r--    1 mirror   mirror    18557076 Jan 28 06:37 nvdcve-2024.json.gz
-rw-r--r--    1 mirror   mirror         144 Jan 28 06:37 nvdcve-2024.meta
-rw-r--r--    1 mirror   mirror     1044726 Jan 28 06:38 nvdcve-2025.json.gz
-rw-r--r--    1 mirror   mirror         141 Jan 28 06:38 nvdcve-2025.meta
-rw-r--r--    1 mirror   mirror     1312431 Jan 28 06:38 nvdcve-modified.json.gz
-rw-r--r--    1 mirror   mirror         154 Jan 28 06:38 nvdcve-modified.meta

with cache.properties looking like this

cat cache.properties 
#Tue Jan 28 06:38:01 UTC 2025
lastModifiedDate.2009=2025-01-28T06\:02\:46Z
prefix=nvdcve-
lastModifiedDate.2002=2025-01-28T06\:01\:01Z
lastModifiedDate.2024=2025-01-27T18\:15\:36Z
lastModifiedDate.2023=2025-01-27T21\:53\:06Z
lastModifiedDate.2004=2025-01-28T06\:01\:31Z
lastModifiedDate.2003=2025-01-28T06\:01\:17Z
lastModifiedDate.2025=2025-01-28T06\:15\:31Z
lastModifiedDate.2006=2025-01-28T06\:02\:00Z
lastModifiedDate.2005=2025-01-28T06\:01\:43Z
lastModifiedDate.2008=2025-01-28T06\:02\:27Z
lastModifiedDate.2007=2025-01-28T06\:02\:12Z
lastModifiedDate.2011=2025-01-28T06\:03\:32Z
lastModifiedDate.2010=2025-01-28T06\:03\:17Z
lastModifiedDate=2025-01-28T06\:15\:31Z
lastModifiedDate.2013=2025-01-14T19\:29\:55Z
lastModifiedDate.2012=2025-01-28T06\:03\:50Z
lastModifiedDate.2015=2025-01-23T17\:42\:32Z
lastModifiedDate.2014=2025-01-28T06\:05\:05Z
lastModifiedDate.2017=2025-01-14T19\:29\:55Z
lastModifiedDate.2016=2025-01-23T17\:55\:06Z
lastModifiedDate.2019=2024-11-21T04\:52\:04Z
lastModifiedDate.2018=2025-01-27T18\:15\:16Z
lastModifiedDate.2020=2025-01-15T21\:15\:09Z
lastModifiedDate.2022=2024-11-27T16\:07\:37Z
lastModifiedDate.2021=2024-12-24T02\:00\:01Z

and nvdcve-modified.meta

cat nvdcve-modified.meta 
lastModifiedDate:2025-01-28T06:38:01.700273731Z[UTC]
size:10098677
gzSize:1312431
sha256:6b30da7a2a2d1c705226c9dc4e3f3c11f0481025e96e200eca223967854ac25f

You can try the cache out yourself using

export NVD_API_KEY=123
docker run --name vulnz -e NVD_API_KEY=$NVD_API_KEY -e DELAY=100 ghcr.io/eugenmayer/vulnz:v7.3.0-9

@jeremylong
Copy link
Owner

Unless I'm missing something - the PR will cause issues for the NVD API as it pulls the entire set of data every time it runs? You've removed the bit of code that only pulls the recently modified CVEs.

My thought on fixing the problem is to write the data to disk in batches instead of at the end. It will be more disk I/O - as the same file could be read and written multiple times during the execution, but it would save from having the entire dataset in memory at one point in time.

Copy link
Owner

@jeremylong jeremylong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall - this will help improve the performance. However, it appears to be putting in requests for updates that aren't required. In fact, if only a small batch of updates needs to be performed we are calling the API once for every year instead of just once to collect less than 1000 entries.

@EugenMayer
Copy link
Contributor Author

Overall - this will help improve the performance. However, it appears to be putting in requests for updates that aren't required. In fact, if only a small batch of updates needs to be performed we are calling the API once for every year instead of just once to collect less than 1000 entries.

Thanks for reviewing!

LastUpdated

I think, we might have a misunderstanding here - on either side. I try to explain my idea, please correct it and hopefully we will be on the same page after a while :)

The old implementation used a centralized 'lastUpdated' date. It was correct for this implementation, since it was not forgiving - thus either all years succeed or none. This itself is a huge issue with NVD. They have so many service outages, connection issues and backend timeouts, that it regularly happens that a year or two can not be fetched. Restarting the entire years fetch, which takes about 1-1.5 hours to complete, just asks for the same trouble to re-apear, just on different years that time.
So IMHO, skipping a year if it fails, continue with the others, is something that is critical.
Currently, a single issue will cause a complete re-fetch the next time.

That said, we end up having lastUpdated dates per year, no longer for the entire cache. The date is derived from the 'lastWrite' on the years cache file (json.gz), so c-time. Based on this date, the year is individually re-fetched using this last-updated date as a filter. This is the reason why

  • the centralized filter definition for the build has been removed
  • for every year, we remove the old lastUpdated filters on the API
  • and we set a specific lastUpdated filter for each year

This works as expected in all of my cases.

Addinial requests due to per-year fetching

There are 3 extra requests per year with the new implementations, if we think about having a proper cache already and now the intend is to update based on lastUpdated. the reason for that is that the publishDate filter has a max-range of 120 by API limits, thus we will need to do 4 requests instead of only 1 to fetch all updates for one year. It is rediculous API design on NVD side though. The the old implementation did not provide any publication filter, thus fetching all 'lastUpdatedSince' entries for all years in one go. This is judged 'not costly'. Doing the same, but trying for 'only those in 2025 is then judged 'to costly, please use a max of 120 days as range'. Well but it is what it is, we cant change that.

So currently, we fetch 23 years overall, so we end up with 69 requests more in comperison. Is this significant? Time based, yes it is. Currently the API takes about 15-30 seconds to respond, even for empty responses e.g. 'Get me all updates that are newer then Jan 28 2025, withing the first 120 days of 2025' will return no updates, still the request takes as long. This is just their API, it is slow and very fragile.

So in the end, worse case scenario, 30s per request, we end up with an extra of 30 minutes. This is significant IMHO, but not a huge bummer for a mirror that updates nightly. The mirror just needs to finish the task reliably, and this got improved a lot. I would love to reduce the amount of requests here, but the API just does not allow it - and the speed it 'responds' with is rediculous at best.

This said, a real use case of yesterday:

E.g. for yesterday, I wanted to fetch the cache using the old implementation to have reference data to compare with. I was not being able to finish this once, over the entire day. Neither I could finish it the day before (which was one motivation to start re-implementing it).

With the new implementation, I re-fetched the entire cache several times that day without bigger issues. I had to rerun it sometimes, but since it just fills up the years, the outage rarely blocked entirely.

So to sum it up, it might be worth the sacrifice of time to end up having a more reliable and less resource hungry process.

For the latter, it is pure additive, so next year, with a average of 160k entries per year, we might end in need of another 400mb just for that (or just within this year alone). IMHO I do not think that this 'pure additive' approach is suistainbale.


Sorry for the long reply, hopefully this explain the idea/pov and motivation

@jeremylong
Copy link
Owner

Thanks for the explanation. Let me think about this more. My more naive implementation is #260

@EugenMayer
Copy link
Contributor Author

IMHO, I do not think that the current implementation will hold for too much longer. E.g. i expect NVD to fix there API, blocking the current 'get any number of vuln that have been changed since the last 120 days'. That request will sooner or later require some scoping or the 120 days limitation on pubDate will be removed (since it's nonsense when you have a max page limit anyway).

@EugenMayer
Copy link
Contributor Author

EugenMayer commented Jan 29, 2025

How should we continue here @jeremylong ? I see a couple of open things

  • decide if we use the new year approach and thus the 'per year lastUpdated' implementation. This will then decide how I should go about your review comments (which AFAICS only happen to touch this very field and might become absolete.
  • I still have an issue that the gradle task that fetches the index does fail as described in Reimplement caching to use a per-year scope fixes #258 #259 (comment)

What is your current idea? Did I miss something / do you have reached some kind of time-span for going on with this? Thanks for your time!

@jeremylong
Copy link
Owner

We are likely going with your approach - as you've made a lot of really great points. It will be a few days before I have enough free time to do additional testing.

@EugenMayer
Copy link
Contributor Author

EugenMayer commented Jan 29, 2025

Great and thanks!

I would love to help out with the testing, I was not intending to just throw it over the fence and let your do the grinding. Right now, i'am failing to understand the gradle plugin 'protocol' usage of fetching the cache or at least making it verbose, so I understand why it fails.

Another way would be to create the cache files using the old and new clients and then diff them. Something I try to do, but I simply fail to ever complete the cache index building using the old client. As soon as I manage to do that, I will be able to compare the data and validate its correctness and hopefully find the cause for the gradle plugin issue.

Thanks for being open minded and taking part in the process. Very appritated

Copy link
Owner

@jeremylong jeremylong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay - I've been away. After another review I've had a couple of thoughts/concerns. Overall I'm happy with the approach.

*/
private Integer processRequest(NvdCveClientBuilder apiBuilder, CacheProperties properties) {
// will hold all entries that have been changed within the last 7 days across all years
List<DefCveItem> recentlyChangedEntries = new ArrayList<>();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the modified CVE cache file exists - it should be loaded here instead of a new empty array. Otherwise, the cache could end up losing modified entries, breaking the update process for things like ODC. If a year fails to process it will not be included in the recently modified.

Something like this (note that this was written in GH not an IDE and has not been tested):

List<DefCveItem> recentlyChangedEntries = new ArrayList<>();
Path cacheFilePath = buildCacheTargetFileForYear(properties, "modified");
if (Files.isRegularFile(cacheFilePath)) {
    CvesNvdPojo existing = loadExistingCacheAndConfigureApi(apiBuilder, cacheFilePath);
    existing.getVulnerabilities().stream()
                        .filter(cve -> ChronoUnit.DAYS.between(cve.getCve().getLastModified(), ZonedDateTime.now()) <= 7)
                        .forEach(cve -> recentlyChangedEntries.add(cve);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'am not really sure i understand.

There is

                CvesNvdPojo existingCacheData = loadExistingCacheAndConfigureApi(apiBuilder, currentYear,
                        cacheFilePath);

which always pre-loads the existing cache, if it exists, before fetching (and overriding) the result with new items from the API.

                CvesNvdPojo existingCacheData = loadExistingCacheAndConfigureApi(apiBuilder, currentYear,
                        cacheFilePath);
                CvesNvdPojo cvesForYear = aggregateCvesForYear(currentYear, apiBuilder);

                // merge old with new data. It is intended to add old items to the new ones, thus we keep newly fetched
                // items with the same ID from the API, not overriding from old cache
                if (existingCacheData != null) {
                    LOG.info("INFO Fetched #{} updated entries for already existing local cache with #{} entries",
                            cvesForYear.vulnerabilities.size(), existingCacheData.vulnerabilities.size());
                    cvesForYear.vulnerabilities.addAll(existingCacheData.vulnerabilities);
                } else {
                    LOG.info("INFO Fetched #{} new entries for year {}", cvesForYear.vulnerabilities.size(),
                            currentYear);
                }

Did i missunderstood your point somehow?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm specifically talking about the "modified" bucket - which is not in the 2002-2025 range. If while cycling through the years there is a failure, any modified entries for that year will be lost. We can do slightly better if we preload the "modified" list.

I just realized that this has the potential to cause an increase the memory usage - so maybe collect all the modified entries and at the end read in the entries on disk before writing them out again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying. Could you, explain semantically, what this modified bucket is for and how it should be used / what it should be used. I'am asking because this is the only part of the code where i struggled to understand the usage, value and intention.

Maybe understanding it will help me getting the right perspective and understand your intention. I currently assume i struggle because i miss this background information. Thanks!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from https://nvd.nist.gov/vuln/data-feeds#JSON_FEED

How to Keep Up-to-Date with the NVD Data Using the Traditional Feeds
The vulnerability feeds provide CVE® data organized by the first four digits of a CVE® identifier (except for the 2002 feeds which include vulnerabilities prior to and including "CVE-2002-"). If you are locally mirroring NVD data, either the APIs or the data feeds may be used to stay synchronized. After performing a one-time import of the complete data set using the compressed JSON vulnerability feeds, the "modified" feeds should be used to keep up-to-date.

Each feed is updated nightly, but only if the content of that feed has changed. For example, the 2004 feeds will be updated only if there is an addition or modification to any vulnerability with a starting CVE® identifier of "CVE-2004-". The "recent" feeds are a list of recently published vulnerabilities and the "modified" feeds are a list of recently published and modified vulnerabilities. The "recent" and "modified" feeds only contain vulnerabilities changed within the previous eight days.

So my mistake in that I've been using 7 days; we should bump that to 8 days.

The majority of clients using the data feed generated by vulnz are ODC clients. Once the local database is created it will only request the modified JSON unless you go beyond a 7 day window since the last update.

}
trap remove_lockfile SIGHUP SIGINT SIGQUIT SIGABRT SIGALRM SIGTERM SIGTSTP

java $JAVA_OPT -jar /usr/local/bin/vulnz cve $DELAY_ARG $DEBUG_ARG $MAX_RETRY_ARG $MAX_RECORDS_PER_PAGE_ARG $CONTINUE_ARG --cache --directory /usr/local/apache2/htdocs
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we look for an exit code and if it failed for a single year - try kicking it off again automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the idea generally, but IMHO we should try this in a separate PR, what do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a separate PR. We just need to consider outputting the correct exit code that we can look for here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vulnz mirror - not working reliably - improve memory and prformance
3 participants