You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Building out a POC for possibly using gonymizer. It appears that during a dump/process/upload run, it requires three times the disk space. Storage for the dump, then the intermediary partial files and for resultant file as the partials are combined.
Is this correct and anyway to avoid using this much storage?
The text was updated successfully, but these errors were encountered:
There definitely is, but it would require a major refactor to the application.
The project was built using a smaller (< 100GB) database so we did not build in space constraints into the design of this application. One of our objectives for this application was to anonymize the database only through files (outside the DB) and then make it easy to copy where ever we liked (laptop, staging, etc).
A common design that exists for anonymization you will see elsewhere is to anonymize data inside the database where the real data exists and then only dump the temporary anonymized data tables to a file, and finally removing the anonymized temporary tables after. This method, however, does create load and can take up 2x the space constraint on disk, but could severely impact the database performance on the main host (depending on hardware). It is possible to offload this increase in load and space to a replica instead.
One way I think we could improve disk space usage, but sacrifice some CPU, is to have the option is to compress all input and output files as they are being written to disk and every time we read from disk. I feel this could be an improvement without having to redesign the application.
How big is the database you are looking to anonymize?
Building out a POC for possibly using gonymizer. It appears that during a dump/process/upload run, it requires three times the disk space. Storage for the dump, then the intermediary partial files and for resultant file as the partials are combined.
Is this correct and anyway to avoid using this much storage?
The text was updated successfully, but these errors were encountered: