Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would it be a good idea to use a filesystem-backed persistent cache to minimize API usage? #72

Open
zbalkan opened this issue Mar 11, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@zbalkan
Copy link
Contributor

zbalkan commented Mar 11, 2024

I used this solution in my wtfis-Wazuh integration and it works smoothly.

import diskcache

def __query_with_cache(target: str, config: Config, cache_dir: str = './') -> Optional[dict]:

    # Check if private IP or not
    if is_private(target=target):
        __debug(f"The target IP is in private range: {target}")
        return None

    # Create path for cache if not exists
    if os.path.exists(cache_dir) is False:
        os.makedirs(cache_dir, 0o700)

    __debug("Opening cache")
    with diskcache.Cache(directory=cache_dir) as cache:

        # Enable stats if not enabled on the first run
        cache.stats(enable=True)
        # Expire old items first
        cache.expire()

        __debug("Checking cache")
        cache_result: Optional[str] = cache.get(target)  # type: ignore

        if cache_result:
            __debug("Found the value in cache")
            return dict(json.loads(cache_result))

        else:
            __debug("Cache miss. Querying APIs...")

            # Initiate resolver
            resolver = Resolver(target, config)

            # Fetch data
            resolver.fetch()

            # Get result
            export = resolver.export()

            if export:
                __debug("Adding the response to cache")
                cache.add(target, json.dumps(export, sort_keys=True))
            else:
                return None

To make the code above understandable, I must give some context. I had to make wtfis a library that outputs JSON results for that integration. So, the external script can just call the library methods. I, first, stripped away all the UI related code, then created a wrapper class called Resolver, which includes the generate_entity_handler method inside. Then a fetch and an export method were added as the main interface to the library.

image

In wtfis you used environment variables stored in .env.wtfis file. In order to be able to integrate smoothly, I first created a class called Config, that I can pass to Resolver. One can use any method to create this Config class. In my case, Wazuh initiates the Python script with a bash script, along with arguments. So, I read the arguments, initiate the Config class instance, and pass it to the Resover along with the target IP or domain name.

image

These two methods are the interface of the wtfis library. Everything else was moved under wtfis.internal.

The code above then reads the SQLite-backed cache. I am using the defaults for cache settings. But it is possible to customize parameters, choose a different strategy, and have a shorter lifetime for cache.

The idea is to minimize the API usage. It may help in the long term.

@pirxthepilot
Copy link
Owner

Hey @zbalkan , this is awesome! Really cool how you were able to repurpose wtfis! :)

I think this is a good idea, but my concern is mostly the additional overhead in maintaining this feature. Some questions that come to mind:

  • Does this require a major refactor to accommodate the caching feature?
  • Can this be an optional feature (at least for now), where the user needs to explicitly enable it? And ideally also modular so that if there is a bug in caching, it won't affect wtfis as long as caching is turned off.
  • How much additional work (if any) is needed every time we add a new data source, compared with if there was no caching feature
  • How easy or hard it is to write unit tests?

Thanks!

@zbalkan
Copy link
Contributor Author

zbalkan commented Mar 22, 2024

  • It will be a major refactor. Not too hard but it would be difficult than my current solution. Because, I export the json in the end. So, I have a single point to read and write. But in existing wtfis code, UI is also involved. In many places. There must be multiple points to read and write to cache. Still it's doable.
  • I suggest to keep caching optional too.
  • In each client, there must be a method to check the cache. And if there's no result (cache miss), the client will run as it does right now. But, the response must be written back to the cache again. It should not be much work. To make the cache accessible, I consider writing a singleton, so that each client will have access via a standard interface.
  • Testing is hard but not too complex. Let's see. I can start working on a POC in mid-Aprill.

@pirxthepilot
Copy link
Owner

Thanks @zbalkan !

@pirxthepilot pirxthepilot added the enhancement New feature or request label Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants