-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement caching to use a per-year scope fixes #258 #259
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code suggestions.
...nts/src/main/java/io/github/jeremylong/openvulnerability/client/nvd/NvdCveClientBuilder.java
Outdated
Show resolved
Hide resolved
vulnz/src/main/java/io/github/jeremylong/vulnz/cli/commands/CveCommand.java
Outdated
Show resolved
Hide resolved
vulnz/src/main/java/io/github/jeremylong/vulnz/cli/commands/CveCommand.java
Outdated
Show resolved
Hide resolved
vulnz/src/main/java/io/github/jeremylong/vulnz/cli/commands/CveCommand.java
Outdated
Show resolved
Hide resolved
vulnz/src/main/java/io/github/jeremylong/vulnz/cli/commands/CveCommand.java
Show resolved
Hide resolved
@DanielRaapDev thanks for the review. I added the continue feature, sorry, you have been quicker with the review - I was not quiet finished yet. Hope you do not mind. |
557e324
to
7a42496
Compare
- Add forgiveness if a year fails, continue to the next one - Reimplement how the lastUpdated date is used and stored per year - Add lockfile
08b12b4
to
2714129
Compare
@jeremylong any chance to get feedback of your side? I know this is not a compact PR, but there was no way to keep this approach smaller. Currently, the mirror does build the files just fine, als re-update basted on lastModified per years works fine, but running the gradle task to update fails without any hint. It downloads the json archives and then just fails:
IMHO i assume This is how my current cache folder looks like -rw-r--r-- 1 mirror mirror 1165 Jan 28 06:38 cache.properties
-rw-r--r-- 1 mirror mirror 769456 Jan 28 06:01 nvdcve-2002.json.gz
-rw-r--r-- 1 mirror mirror 152 Jan 28 06:01 nvdcve-2002.meta
-rw-r--r-- 1 mirror mirror 639892 Jan 28 06:01 nvdcve-2003.json.gz
-rw-r--r-- 1 mirror mirror 152 Jan 28 06:01 nvdcve-2003.meta
-rw-r--r-- 1 mirror mirror 1173114 Jan 28 06:01 nvdcve-2004.json.gz
-rw-r--r-- 1 mirror mirror 152 Jan 28 06:01 nvdcve-2004.meta
-rw-r--r-- 1 mirror mirror 2148085 Jan 28 06:01 nvdcve-2005.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:01 nvdcve-2005.meta
-rw-r--r-- 1 mirror mirror 2951160 Jan 28 06:02 nvdcve-2006.json.gz
-rw-r--r-- 1 mirror mirror 153 Jan 28 06:02 nvdcve-2006.meta
-rw-r--r-- 1 mirror mirror 3240103 Jan 28 06:02 nvdcve-2007.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:02 nvdcve-2007.meta
-rw-r--r-- 1 mirror mirror 3236516 Jan 28 06:02 nvdcve-2008.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:02 nvdcve-2008.meta
-rw-r--r-- 1 mirror mirror 4277943 Jan 28 06:02 nvdcve-2009.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:02 nvdcve-2009.meta
-rw-r--r-- 1 mirror mirror 3544284 Jan 28 06:03 nvdcve-2010.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:03 nvdcve-2010.meta
-rw-r--r-- 1 mirror mirror 3547111 Jan 28 06:03 nvdcve-2011.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:03 nvdcve-2011.meta
-rw-r--r-- 1 mirror mirror 3880228 Jan 28 06:03 nvdcve-2012.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:03 nvdcve-2012.meta
-rw-r--r-- 1 mirror mirror 4470300 Jan 28 06:04 nvdcve-2013.json.gz
-rw-r--r-- 1 mirror mirror 143 Jan 28 06:04 nvdcve-2013.meta
-rw-r--r-- 1 mirror mirror 4226698 Jan 28 06:05 nvdcve-2014.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:05 nvdcve-2014.meta
-rw-r--r-- 1 mirror mirror 2857702 Jan 28 06:06 nvdcve-2015.json.gz
-rw-r--r-- 1 mirror mirror 143 Jan 28 06:06 nvdcve-2015.meta
-rw-r--r-- 1 mirror mirror 3427774 Jan 28 06:07 nvdcve-2016.json.gz
-rw-r--r-- 1 mirror mirror 143 Jan 28 06:07 nvdcve-2016.meta
-rw-r--r-- 1 mirror mirror 6535173 Jan 28 06:11 nvdcve-2017.json.gz
-rw-r--r-- 1 mirror mirror 143 Jan 28 06:11 nvdcve-2017.meta
-rw-r--r-- 1 mirror mirror 7339653 Jan 28 06:15 nvdcve-2018.json.gz
-rw-r--r-- 1 mirror mirror 143 Jan 28 06:15 nvdcve-2018.meta
-rw-r--r-- 1 mirror mirror 8110987 Jan 28 06:18 nvdcve-2019.json.gz
-rw-r--r-- 1 mirror mirror 141 Jan 28 06:18 nvdcve-2019.meta
-rw-r--r-- 1 mirror mirror 9187846 Jan 28 06:22 nvdcve-2020.json.gz
-rw-r--r-- 1 mirror mirror 143 Jan 28 06:22 nvdcve-2020.meta
-rw-r--r-- 1 mirror mirror 12715910 Jan 28 06:25 nvdcve-2021.json.gz
-rw-r--r-- 1 mirror mirror 145 Jan 28 06:25 nvdcve-2021.meta
-rw-r--r-- 1 mirror mirror 13404740 Jan 28 06:29 nvdcve-2022.json.gz
-rw-r--r-- 1 mirror mirror 145 Jan 28 06:29 nvdcve-2022.meta
-rw-r--r-- 1 mirror mirror 15715163 Jan 28 06:33 nvdcve-2023.json.gz
-rw-r--r-- 1 mirror mirror 145 Jan 28 06:33 nvdcve-2023.meta
-rw-r--r-- 1 mirror mirror 18557076 Jan 28 06:37 nvdcve-2024.json.gz
-rw-r--r-- 1 mirror mirror 144 Jan 28 06:37 nvdcve-2024.meta
-rw-r--r-- 1 mirror mirror 1044726 Jan 28 06:38 nvdcve-2025.json.gz
-rw-r--r-- 1 mirror mirror 141 Jan 28 06:38 nvdcve-2025.meta
-rw-r--r-- 1 mirror mirror 1312431 Jan 28 06:38 nvdcve-modified.json.gz
-rw-r--r-- 1 mirror mirror 154 Jan 28 06:38 nvdcve-modified.meta with cache.properties looking like this
and
You can try the cache out yourself using export NVD_API_KEY=123
docker run --name vulnz -e NVD_API_KEY=$NVD_API_KEY -e DELAY=100 ghcr.io/eugenmayer/vulnz:v7.3.0-9 |
Unless I'm missing something - the PR will cause issues for the NVD API as it pulls the entire set of data every time it runs? You've removed the bit of code that only pulls the recently modified CVEs. My thought on fixing the problem is to write the data to disk in batches instead of at the end. It will be more disk I/O - as the same file could be read and written multiple times during the execution, but it would save from having the entire dataset in memory at one point in time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall - this will help improve the performance. However, it appears to be putting in requests for updates that aren't required. In fact, if only a small batch of updates needs to be performed we are calling the API once for every year instead of just once to collect less than 1000 entries.
vulnz/src/main/java/io/github/jeremylong/vulnz/cli/commands/CveCommand.java
Show resolved
Hide resolved
vulnz/src/main/java/io/github/jeremylong/vulnz/cli/commands/CveCommand.java
Show resolved
Hide resolved
vulnz/src/main/java/io/github/jeremylong/vulnz/cli/commands/CveCommand.java
Show resolved
Hide resolved
...nts/src/main/java/io/github/jeremylong/openvulnerability/client/nvd/NvdCveClientBuilder.java
Show resolved
Hide resolved
...nts/src/main/java/io/github/jeremylong/openvulnerability/client/nvd/NvdCveClientBuilder.java
Show resolved
Hide resolved
Thanks for reviewing! LastUpdatedI think, we might have a misunderstanding here - on either side. I try to explain my idea, please correct it and hopefully we will be on the same page after a while :) The old implementation used a centralized 'lastUpdated' date. It was correct for this implementation, since it was not forgiving - thus either all years succeed or none. This itself is a huge issue with NVD. They have so many service outages, connection issues and backend timeouts, that it regularly happens that a year or two can not be fetched. Restarting the entire years fetch, which takes about 1-1.5 hours to complete, just asks for the same trouble to re-apear, just on different years that time. That said, we end up having lastUpdated dates per year, no longer for the entire cache. The date is derived from the 'lastWrite' on the years cache file (json.gz), so c-time. Based on this date, the year is individually re-fetched using this last-updated date as a filter. This is the reason why
This works as expected in all of my cases. Addinial requests due to per-year fetchingThere are 3 extra requests per year with the new implementations, if we think about having a proper cache already and now the intend is to update based on lastUpdated. the reason for that is that the publishDate filter has a max-range of 120 by API limits, thus we will need to do 4 requests instead of only 1 to fetch all updates for one year. It is rediculous API design on NVD side though. The the old implementation did not provide any publication filter, thus fetching all 'lastUpdatedSince' entries for all years in one go. This is judged 'not costly'. Doing the same, but trying for 'only those in 2025 is then judged 'to costly, please use a max of 120 days as range'. Well but it is what it is, we cant change that. So currently, we fetch 23 years overall, so we end up with 69 requests more in comperison. Is this significant? Time based, yes it is. Currently the API takes about 15-30 seconds to respond, even for empty responses e.g. 'Get me all updates that are newer then Jan 28 2025, withing the first 120 days of 2025' will return no updates, still the request takes as long. This is just their API, it is slow and very fragile. So in the end, worse case scenario, 30s per request, we end up with an extra of 30 minutes. This is significant IMHO, but not a huge bummer for a mirror that updates nightly. The mirror just needs to finish the task reliably, and this got improved a lot. I would love to reduce the amount of requests here, but the API just does not allow it - and the speed it 'responds' with is rediculous at best. This said, a real use case of yesterday: E.g. for yesterday, I wanted to fetch the cache using the old implementation to have reference data to compare with. I was not being able to finish this once, over the entire day. Neither I could finish it the day before (which was one motivation to start re-implementing it). With the new implementation, I re-fetched the entire cache several times that day without bigger issues. I had to rerun it sometimes, but since it just fills up the years, the outage rarely blocked entirely. So to sum it up, it might be worth the sacrifice of time to end up having a more reliable and less resource hungry process. For the latter, it is pure additive, so next year, with a average of 160k entries per year, we might end in need of another 400mb just for that (or just within this year alone). IMHO I do not think that this 'pure additive' approach is suistainbale. Sorry for the long reply, hopefully this explain the idea/pov and motivation |
Thanks for the explanation. Let me think about this more. My more naive implementation is #260 |
IMHO, I do not think that the current implementation will hold for too much longer. E.g. i expect NVD to fix there API, blocking the current 'get any number of vuln that have been changed since the last 120 days'. That request will sooner or later require some scoping or the 120 days limitation on pubDate will be removed (since it's nonsense when you have a max page limit anyway). |
How should we continue here @jeremylong ? I see a couple of open things
What is your current idea? Did I miss something / do you have reached some kind of time-span for going on with this? Thanks for your time! |
We are likely going with your approach - as you've made a lot of really great points. It will be a few days before I have enough free time to do additional testing. |
Great and thanks! I would love to help out with the testing, I was not intending to just throw it over the fence and let your do the grinding. Right now, i'am failing to understand the gradle plugin 'protocol' usage of fetching the cache or at least making it verbose, so I understand why it fails. Another way would be to create the cache files using the old and new clients and then diff them. Something I try to do, but I simply fail to ever complete the cache index building using the old client. As soon as I manage to do that, I will be able to compare the data and validate its correctness and hopefully find the cause for the gradle plugin issue. Thanks for being open minded and taking part in the process. Very appritated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay - I've been away. After another review I've had a couple of thoughts/concerns. Overall I'm happy with the approach.
*/ | ||
private Integer processRequest(NvdCveClientBuilder apiBuilder, CacheProperties properties) { | ||
// will hold all entries that have been changed within the last 7 days across all years | ||
List<DefCveItem> recentlyChangedEntries = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the modified CVE cache file exists - it should be loaded here instead of a new empty array. Otherwise, the cache could end up losing modified entries, breaking the update process for things like ODC. If a year fails to process it will not be included in the recently modified.
Something like this (note that this was written in GH not an IDE and has not been tested):
List<DefCveItem> recentlyChangedEntries = new ArrayList<>();
Path cacheFilePath = buildCacheTargetFileForYear(properties, "modified");
if (Files.isRegularFile(cacheFilePath)) {
CvesNvdPojo existing = loadExistingCacheAndConfigureApi(apiBuilder, cacheFilePath);
existing.getVulnerabilities().stream()
.filter(cve -> ChronoUnit.DAYS.between(cve.getCve().getLastModified(), ZonedDateTime.now()) <= 7)
.forEach(cve -> recentlyChangedEntries.add(cve);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'am not really sure i understand.
There is
CvesNvdPojo existingCacheData = loadExistingCacheAndConfigureApi(apiBuilder, currentYear,
cacheFilePath);
which always pre-loads the existing cache, if it exists, before fetching (and overriding) the result with new items from the API.
CvesNvdPojo existingCacheData = loadExistingCacheAndConfigureApi(apiBuilder, currentYear,
cacheFilePath);
CvesNvdPojo cvesForYear = aggregateCvesForYear(currentYear, apiBuilder);
// merge old with new data. It is intended to add old items to the new ones, thus we keep newly fetched
// items with the same ID from the API, not overriding from old cache
if (existingCacheData != null) {
LOG.info("INFO Fetched #{} updated entries for already existing local cache with #{} entries",
cvesForYear.vulnerabilities.size(), existingCacheData.vulnerabilities.size());
cvesForYear.vulnerabilities.addAll(existingCacheData.vulnerabilities);
} else {
LOG.info("INFO Fetched #{} new entries for year {}", cvesForYear.vulnerabilities.size(),
currentYear);
}
Did i missunderstood your point somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm specifically talking about the "modified" bucket - which is not in the 2002-2025 range. If while cycling through the years there is a failure, any modified entries for that year will be lost. We can do slightly better if we preload the "modified" list.
I just realized that this has the potential to cause an increase the memory usage - so maybe collect all the modified entries and at the end read in the entries on disk before writing them out again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarifying. Could you, explain semantically, what this modified bucket is for and how it should be used / what it should be used. I'am asking because this is the only part of the code where i struggled to understand the usage, value and intention.
Maybe understanding it will help me getting the right perspective and understand your intention. I currently assume i struggle because i miss this background information. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from https://nvd.nist.gov/vuln/data-feeds#JSON_FEED
How to Keep Up-to-Date with the NVD Data Using the Traditional Feeds
The vulnerability feeds provide CVE® data organized by the first four digits of a CVE® identifier (except for the 2002 feeds which include vulnerabilities prior to and including "CVE-2002-"). If you are locally mirroring NVD data, either the APIs or the data feeds may be used to stay synchronized. After performing a one-time import of the complete data set using the compressed JSON vulnerability feeds, the "modified" feeds should be used to keep up-to-date.Each feed is updated nightly, but only if the content of that feed has changed. For example, the 2004 feeds will be updated only if there is an addition or modification to any vulnerability with a starting CVE® identifier of "CVE-2004-". The "recent" feeds are a list of recently published vulnerabilities and the "modified" feeds are a list of recently published and modified vulnerabilities. The "recent" and "modified" feeds only contain vulnerabilities changed within the previous eight days.
So my mistake in that I've been using 7 days; we should bump that to 8 days.
The majority of clients using the data feed generated by vulnz are ODC clients. Once the local database is created it will only request the modified JSON unless you go beyond a 7 day window since the last update.
} | ||
trap remove_lockfile SIGHUP SIGINT SIGQUIT SIGABRT SIGALRM SIGTERM SIGTSTP | ||
|
||
java $JAVA_OPT -jar /usr/local/bin/vulnz cve $DELAY_ARG $DEBUG_ARG $MAX_RETRY_ARG $MAX_RECORDS_PER_PAGE_ARG $CONTINUE_ARG --cache --directory /usr/local/apache2/htdocs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we look for an exit code and if it failed for a single year - try kicking it off again automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the idea generally, but IMHO we should try this in a separate PR, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with a separate PR. We just need to consider outputting the correct exit code that we can look for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reimplement NVD caching to reduce the memory footprint and improve performance - fixes #258
withLastModifiedFilter
on per year basis, depending the last year update.cacheProperties.lastModifiedDate
is no longer used for that at all. We know use the lastModified date of the cache file as the date-marker instead of an calculated value. Simpler and ensures the last write time is what we respect.This PR reduces the memory load by a magnitude, in terms, we hold as maximum of 'all entries of a year' in the memory. The current process holds 'all entries of all year' and thus it will use more and more memory each year passing.
Stats
The current max-memory usage with the new implementation is about
560MB
, while the old implementation did not finish with a limit of7GB
If someone wants to test it, use the docker image