Would it be a good idea to use a filesystem-backed persistent cache to minimize API usage? #72

zbalkan · 2024-03-11T09:14:12Z

I used this solution in my wtfis-Wazuh integration and it works smoothly.

import diskcache

def __query_with_cache(target: str, config: Config, cache_dir: str = './') -> Optional[dict]:

    # Check if private IP or not
    if is_private(target=target):
        __debug(f"The target IP is in private range: {target}")
        return None

    # Create path for cache if not exists
    if os.path.exists(cache_dir) is False:
        os.makedirs(cache_dir, 0o700)

    __debug("Opening cache")
    with diskcache.Cache(directory=cache_dir) as cache:

        # Enable stats if not enabled on the first run
        cache.stats(enable=True)
        # Expire old items first
        cache.expire()

        __debug("Checking cache")
        cache_result: Optional[str] = cache.get(target)  # type: ignore

        if cache_result:
            __debug("Found the value in cache")
            return dict(json.loads(cache_result))

        else:
            __debug("Cache miss. Querying APIs...")

            # Initiate resolver
            resolver = Resolver(target, config)

            # Fetch data
            resolver.fetch()

            # Get result
            export = resolver.export()

            if export:
                __debug("Adding the response to cache")
                cache.add(target, json.dumps(export, sort_keys=True))
            else:
                return None

To make the code above understandable, I must give some context. I had to make wtfis a library that outputs JSON results for that integration. So, the external script can just call the library methods. I, first, stripped away all the UI related code, then created a wrapper class called Resolver, which includes the generate_entity_handler method inside. Then a fetch and an export method were added as the main interface to the library.

In wtfis you used environment variables stored in .env.wtfis file. In order to be able to integrate smoothly, I first created a class called Config, that I can pass to Resolver. One can use any method to create this Config class. In my case, Wazuh initiates the Python script with a bash script, along with arguments. So, I read the arguments, initiate the Config class instance, and pass it to the Resover along with the target IP or domain name.

These two methods are the interface of the wtfis library. Everything else was moved under wtfis.internal.

The code above then reads the SQLite-backed cache. I am using the defaults for cache settings. But it is possible to customize parameters, choose a different strategy, and have a shorter lifetime for cache.

The idea is to minimize the API usage. It may help in the long term.

The text was updated successfully, but these errors were encountered:

pirxthepilot · 2024-03-22T02:34:01Z

Hey @zbalkan , this is awesome! Really cool how you were able to repurpose wtfis! :)

I think this is a good idea, but my concern is mostly the additional overhead in maintaining this feature. Some questions that come to mind:

Does this require a major refactor to accommodate the caching feature?
Can this be an optional feature (at least for now), where the user needs to explicitly enable it? And ideally also modular so that if there is a bug in caching, it won't affect wtfis as long as caching is turned off.
How much additional work (if any) is needed every time we add a new data source, compared with if there was no caching feature
How easy or hard it is to write unit tests?

Thanks!

zbalkan · 2024-03-22T07:50:00Z

It will be a major refactor. Not too hard but it would be difficult than my current solution. Because, I export the json in the end. So, I have a single point to read and write. But in existing wtfis code, UI is also involved. In many places. There must be multiple points to read and write to cache. Still it's doable.
I suggest to keep caching optional too.
In each client, there must be a method to check the cache. And if there's no result (cache miss), the client will run as it does right now. But, the response must be written back to the cache again. It should not be much work. To make the cache accessible, I consider writing a singleton, so that each client will have access via a standard interface.
Testing is hard but not too complex. Let's see. I can start working on a POC in mid-Aprill.

pirxthepilot · 2024-03-23T05:31:49Z

Thanks @zbalkan !

pirxthepilot added the enhancement New feature or request label Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would it be a good idea to use a filesystem-backed persistent cache to minimize API usage? #72

Would it be a good idea to use a filesystem-backed persistent cache to minimize API usage? #72

zbalkan commented Mar 11, 2024

pirxthepilot commented Mar 22, 2024

zbalkan commented Mar 22, 2024 •

edited

Loading

pirxthepilot commented Mar 23, 2024

Would it be a good idea to use a filesystem-backed persistent cache to minimize API usage? #72

Would it be a good idea to use a filesystem-backed persistent cache to minimize API usage? #72

Comments

zbalkan commented Mar 11, 2024

pirxthepilot commented Mar 22, 2024

zbalkan commented Mar 22, 2024 • edited Loading

pirxthepilot commented Mar 23, 2024

zbalkan commented Mar 22, 2024 •

edited

Loading