-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file difference check function #202
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Unfortunately, ETag on S3 is not always MD5 of the file contents.
ETag can be used to check the resource is unchanged, but can't be used to check two resources have same contents, in general.
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
ETag
The entity tag represents a specific version of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
- Objects created through the AWS Management Console or by the PUT Object, POST Object, or Copy operation:
- Objects encrypted by SSE-S3 or plaintext have ETags that are an MD5 digest of their data.
- Objects encrypted by SSE-C or SSE-KMS have ETags that are not an MD5 digest of their object data.
- Objects created by either the Multipart Upload or Part Copy operation have ETags that are not MD5 digests, regardless of the method of encryption.
Ref: definitions of ETag: https://datatracker.ietf.org/doc/html/rfc7232#section-2.3
@at-wat If you have any good ideas, I would appreciate it if you could reply to your comments 🙏 |
Sorry for the very late commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gompei Sorry for the huge delay!
The updated algorithm seems check file size and timestamp and I think it's basically same as what the current implementation does at:
Lines 544 to 564 in baf6e27
for sourceInfo := range sourceFileChan { | |
destInfo, ok := destFiles[sourceInfo.name] | |
// source is necessary to sync if | |
// 1. The dest doesn't exist | |
// 2. The dest doesn't have the same size as the source | |
// 3. The dest is older than the source | |
if !ok || sourceInfo.size != destInfo.size || sourceInfo.lastModified.After(destInfo.lastModified) { | |
c <- &fileOp{fileInfo: sourceInfo} | |
} | |
if ok { | |
destInfo.existsInSource = true | |
} | |
} | |
if del { | |
for _, destInfo := range destFiles { | |
if !destInfo.existsInSource { | |
// The source doesn't exist | |
c <- &fileOp{fileInfo: destInfo, op: opDelete} | |
} | |
} | |
} |
Is there differences from the current implementation?
Hello:wave:
This PR implements a function to check the difference between source and destination files based on the files present in the source.
With this function, the following process is possible.