Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk space usage #100

Open
mcg opened this issue Jul 14, 2021 · 1 comment
Open

Disk space usage #100

mcg opened this issue Jul 14, 2021 · 1 comment

Comments

@mcg
Copy link

mcg commented Jul 14, 2021

Building out a POC for possibly using gonymizer. It appears that during a dump/process/upload run, it requires three times the disk space. Storage for the dump, then the intermediary partial files and for resultant file as the partials are combined.

Is this correct and anyway to avoid using this much storage?

@junkert
Copy link
Collaborator

junkert commented Sep 21, 2023

There definitely is, but it would require a major refactor to the application.

The project was built using a smaller (< 100GB) database so we did not build in space constraints into the design of this application. One of our objectives for this application was to anonymize the database only through files (outside the DB) and then make it easy to copy where ever we liked (laptop, staging, etc).

A common design that exists for anonymization you will see elsewhere is to anonymize data inside the database where the real data exists and then only dump the temporary anonymized data tables to a file, and finally removing the anonymized temporary tables after. This method, however, does create load and can take up 2x the space constraint on disk, but could severely impact the database performance on the main host (depending on hardware). It is possible to offload this increase in load and space to a replica instead.

One way I think we could improve disk space usage, but sacrifice some CPU, is to have the option is to compress all input and output files as they are being written to disk and every time we read from disk. I feel this could be an improvement without having to redesign the application.

How big is the database you are looking to anonymize?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants