Disk space usage #100

mcg · 2021-07-14T13:54:05Z

Building out a POC for possibly using gonymizer. It appears that during a dump/process/upload run, it requires three times the disk space. Storage for the dump, then the intermediary partial files and for resultant file as the partials are combined.

Is this correct and anyway to avoid using this much storage?

junkert · 2023-09-21T23:16:50Z

There definitely is, but it would require a major refactor to the application.

The project was built using a smaller (< 100GB) database so we did not build in space constraints into the design of this application. One of our objectives for this application was to anonymize the database only through files (outside the DB) and then make it easy to copy where ever we liked (laptop, staging, etc).

A common design that exists for anonymization you will see elsewhere is to anonymize data inside the database where the real data exists and then only dump the temporary anonymized data tables to a file, and finally removing the anonymized temporary tables after. This method, however, does create load and can take up 2x the space constraint on disk, but could severely impact the database performance on the main host (depending on hardware). It is possible to offload this increase in load and space to a replica instead.

One way I think we could improve disk space usage, but sacrifice some CPU, is to have the option is to compress all input and output files as they are being written to disk and every time we read from disk. I feel this could be an improvement without having to redesign the application.

How big is the database you are looking to anonymize?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk space usage #100

Disk space usage #100

mcg commented Jul 14, 2021

junkert commented Sep 21, 2023

Disk space usage #100

Disk space usage #100

Comments

mcg commented Jul 14, 2021

junkert commented Sep 21, 2023