-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repozo incremental recover #403
base: master
Are you sure you want to change the base?
Conversation
4be516b
to
f62057c
Compare
@vpelletier , maybe you wish to review as you made the original work ? @perrinjerome, you're the last contributor to repozo, could you review this PR ? Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a change log entry ?
I just looked at the diff this time, I will try to actually try running repozo with this patch later
Hello, I would like to explain more my reasoning about the new behavior, as a change of default can be surprising. A good practice with a backup/recovery plan is to check that the backed-up data can be restored on a remote service. That's why we recover the Delta.fs every day, to check that the latest .deltafs increment (which is the only new backed-up file every day, as the other .deltafs and the original .fs are already synchronised on the remote site) is valid. From this observation, as when we import the new increment, we already have the recovered Delta.fs from the previous day, it sounds a waste of resource to delete it, and rebuild it from 0. If we could simply recover the new increment on the existing Delta.fs, then its sum would be checked, proving its validity once and for all. And we don't need to check its validity every day, as a data corruption is most likely to happen during the write process or the network copy. Also, I believe the time saved to not restore a full Data.fs is welcome, as it allows to decrease the time-to-recovery in case of activation of the disaster recovery plan, or simply to create backups more often, to decrease the quantity of lost data in a production incident. Please feel free to ask me more questions. Regards, Nicolas |
I have added an entry. But I'm not sure about the wording. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like this a lot. I've two small fixes to suggest.
src/ZODB/scripts/repozo.py
Outdated
log('Target file smaller than full backup, ' | ||
'falling back to a full recover.') | ||
return do_full_recover(options, repofiles) | ||
check_startpos = int(previous_chunk[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About checking the already-restored file, which is a new concept in this tool (and the corner stone of this new feature), should it be under control of --with-verify
? Should only the last chunk be checked when --quick
is also provided ?
IMHO repozo should always check the MD5, and only the last chunk, except when the main action is --verify
(which then should only be needed for full-output checks).
This is the kind of implementation decisions I was free to make as long as my implementation was separate, but becomes harder to decide once both tools are merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it, --with-verify
should work the same in the full and incremental implementations, which means it checks the size and md5sum of each .deltafs after having copied them. So in the incremental mode, it only verifies the new increments.
I kept the verification of the last chunk in incremental mode, as it's pretty inexpensive, and it follows the spirit of the fallback to full-recover if something is wrong. As incremental mode is the new default, I like that it works in an opportunistic way: if it is sure it can be faster and correct, then it will work incrementally. If not, it works the same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read correctly:
When doing backup, the logic to decide if an incremental backup is possible depends of --quick
:
- with
--quick
, first compare sizes and if sizes match, compare checksums for last increment - without
--quick
, compare full checksums
If we don't like the current logic to decide if an incremental restore is possible ( compare sizes and if sizes match, compare checksums for last restored increment ), I feel it would make sense to base the logic on --quick
as well, because this is very "symmetric". That said, the current approach of verifying the checksum of the last increment seems fine.
During restore, if --with-verify
is passed, with this patch, we verify only what is restored (ie. everything during a full and only the increments during an incremental), this also seems good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As both your reasonings converge, I have implemented @perrinjerome 's suggestion about the --quick
parameter. The 1st implementation, where only the last chunk is checked, become the behavior with --quick
. The default behavior becomes to verify all chunks already restored. See df61185
src/ZODB/scripts/repozo.py
Outdated
with open(options.output, 'r+b') as outfp: | ||
outfp.seek(0, 2) | ||
initial_length = outfp.tell() | ||
with open(datfile) as fp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't reopening the index risk disobeying find_files
logic ? Especially, the --date
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the question. find_files
do not work with the index file, it only lists (and sorts) files in the repository directory. It works with file names only. recover_repofiles
(sub-call of do_incremental_recover
) creates a dict from the data in the index file, and do not care about the position in the file.
Sorry for the long list of "fixup!" commits, I didn't think it would get that long... To implement the feedback received in this MR, and to prevent an error on windows because a same file was opened twice, I had to rework deeply the function I have added more assertions in acadc7a, as well as a new step where I delete an old .deltafs already recovered, to prove the correctness of the code, and that it doesn't fall back silently to the full-recovery mode. I hope it will help you trust the rewriting of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried and it really seems good.
I have a suggestion about the output during restore, if I have a repository with these files:
backups/2024-10-29-14-16-48.fs
backups/2024-10-29-14-17-29.deltafs
backups/2024-10-29-14-17-44.deltafs
and I have a recovered file until 2024-10-29-14-17-29.deltafs
, when I run restore a second time, only 2024-10-29-14-17-44.deltafs
should be restored incrementally, but the output is confusing:
$ repozo -v --recover --repository backups/ -o recovered/mydata.fs
looking for files between last full backup and 2024-10-29-14-17-46...
files needed to recover state as of 2024-10-29-14-17-46:
backups/2024-10-29-14-16-48.fs
backups/2024-10-29-14-17-29.deltafs
backups/2024-10-29-14-17-44.deltafs
Recovering (incrementally) file to recovered/mydata.fs
Recovered 181 bytes, md5: 1cc0425a2866a20eed96571e1cdedc71
Restoring index file backups/2024-10-29-14-17-44.index to recovered/mydata.fs.index
files needed to recover state ... is slightly incorrect, these are the files needed for a full backup, for an incremental backup only backups/2024-10-29-14-17-44.deltafs
is needed.
What we could do easily is also list the files that will actually be restored, with a patch like this
diff --git a/src/ZODB/scripts/repozo.py b/src/ZODB/scripts/repozo.py
index 21e38289..b3e86a36 100755
--- a/src/ZODB/scripts/repozo.py
+++ b/src/ZODB/scripts/repozo.py
@@ -801,6 +801,9 @@ def do_incremental_recover(options, repofiles):
assert first_file_to_restore > 0, (
first_file_to_restore, options.repository, fn, filename, repofiles)
+ log('remaining files needed to recover incrementally:')
+ for f in repofiles[first_file_to_restore:]:
+ log('\t%s', f)
temporary_output_file = options.output + '.part'
os.rename(options.output, temporary_output_file)
with open(temporary_output_file, 'r+b') as outfp:
the output would become:
$ repozo -v --recover --repository backups/ -o recovered/mydata.fs
looking for files between last full backup and 2024-10-29-14-17-46...
files needed to recover state as of 2024-10-29-14-17-46:
backups/2024-10-29-14-16-48.fs
backups/2024-10-29-14-17-29.deltafs
backups/2024-10-29-14-17-44.deltafs
Recovering (incrementally) file to recovered/mydata.fs
remaining files needed to recover:
backups/2024-10-29-14-17-44.deltafs
Recovered 181 bytes, md5: 1cc0425a2866a20eed96571e1cdedc71
Restoring index file backups/2024-10-29-14-17-44.index to recovered/mydata.fs.index
we could maybe do better if we change in find_files
, but this looks enough.
src/ZODB/scripts/repozo.py
Outdated
log('Target file smaller than full backup, ' | ||
'falling back to a full recover.') | ||
return do_full_recover(options, repofiles) | ||
check_startpos = int(previous_chunk[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read correctly:
When doing backup, the logic to decide if an incremental backup is possible depends of --quick
:
- with
--quick
, first compare sizes and if sizes match, compare checksums for last increment - without
--quick
, compare full checksums
If we don't like the current logic to decide if an incremental restore is possible ( compare sizes and if sizes match, compare checksums for last restored increment ), I feel it would make sense to base the logic on --quick
as well, because this is very "symmetric". That said, the current approach of verifying the checksum of the last increment seems fine.
During restore, if --with-verify
is passed, with this patch, we verify only what is restored (ie. everything during a full and only the increments during an incremental), this also seems good to me.
…on to the implementation of the incremental recover
Which allows to recover a zodb filestorage by only appending the missing chunks from the latest recovered file, instead of always recovering from zero. Based on the work of @vpelletier (incpozo).
The default behavior becomes to check the integrality of the existing recovered Data.fs before incrementally restore on it. The quick option allows to verify only the latest chunck previously recovered on the Data.fs before restoring incrementally. This saves many reads.
As find_files logs all the files needed to recover the state from 0, add a log to show which files will be effectively used in the case of an incremental recovery.
a3aea49
to
3b4abd9
Compare
Hello, Sorry, it's 2 months later, and I just found out about "pending" comments in github, and how to submit them using the hidden button (it seems I'm not the only one to have been confused: https://github.com/orgs/community/discussions/10369) So, a summary is much needed I believe... I have implemented everyone's recommendations. During this PR, a big change was the inclusion of a "--quick" flag for the incremental recovery. In my first implementation the code would only check the latest chunk of the already recovered data.fs, and then decide if it would go for an incremental recovery (the new default), or a full recovery. Based on @perrinjerome's and @vpelletier's advices, the default would be to check all the known chunks from the .dat file against the already recovered data.fs instead of only the latest. But the "check only the latest" behavior still exists, behind the "--quick" flag. I have merged all my commits and I believe this PR is ready to merge. I'm sorry again if I gave the impression to abandon this PR and ignoring your comments. This is not the case, and it is due to my clumsiness with the github interface... Regards, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried again the current version of the patch and it seems fine. @mgedmin can you take another look now that what you raised have been addressed ?
Like for the --backup mode of repozo, add an incremental feature for the recovery of a backup to a Data.fs, which allows to only append the latest backup(s) increment(s) (.deltafs) to a previously recovered Data.fs, instead of trashing it and restarted the recovery process from scratch.
This feature becomes the new default behavior, but the new flag "--full" allows to fall back to the previous behavior.
A few checks are done while trying to recover incrementally (ie: on size, or on the latest increment checksum), and the code automatically falls back to the full-recovery mode if they fail. This would happen for exemple if the production data has been packed after the previous recovery.
The need for such feature arose from our own production use, where we create delta backups of a file storage every day, send them to a stand-by server, and rebuild the ZODB there (still every day). When the ZODB is bigger than 1Tb, the full recovery can take several hours, whereas the incremental recovery would take a few minutes only (often even less)