Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix single object cases for sync #740

Merged
merged 5 commits into from
Jul 12, 2024

Conversation

tarikozyurtt
Copy link
Contributor

@tarikozyurtt tarikozyurtt commented Jul 10, 2024

The problem is mainly caused by the compareObjects function inside command/sync.go, where s5cmd compares source and destination paths and extracts files that are present only in the source or destination path (while also counting nested folders or rather name with its prefixes) along with common objects. If they both are non-objects, like wildcard expression, prefix, or bucket, getting relative paths of files with src.URL.Relative() results in compatible and comparable paths. In this case, no problem is present, at least within the scope of this issue.

However, when an object is selected as the source, it is not assigned a relative path using func (u *url.URL) SetRelative(base *url.URL), so the src.URL.Relative() function returns its absolute path.

Let's say the source file has an absolute path of folder/foo.txt. The algorithm compares folder/foo.txt with s3://bucket/remoteFolder/ and looks for the item s3://bucket/remoteFolder/folder/foo.txt. If it does not match, except for the edge case where there is a duplicate item inside the searched path, the files never match.

While copying files, s5cmd does not use relative paths, so foo.txt is written to the intended path in the remote. However, this happens during every sync operation, as they do not match.

Problem solved by taking path of source object as its name. This made algorithm to simply look for matches in destination, a file named foo.txt as intended.

This PR adds new test cases to the sync command. Previously, tests failed to capture sync command cases where the source is an object in a prefix, not an object directly in a bucket, or not multiple objects like a wildcard or prefix expression.

If an object is in the s3://bucket/ path, its relative path is the same as its absolute path, so they match during comparison. This prevented copying the file every time. The new test cases cover all scenarios.
Resolves: #676.

@tarikozyurtt tarikozyurtt requested a review from a team as a code owner July 10, 2024 15:31
@tarikozyurtt tarikozyurtt requested review from ilkinulas and seruman and removed request for a team July 10, 2024 15:31
@tarikozyurtt tarikozyurtt changed the title fix: fix nested directory cases for object sync fix: fix single object cases for object sync Jul 11, 2024
@tarikozyurtt tarikozyurtt changed the title fix: fix single object cases for object sync fix: fix single object cases for sync Jul 11, 2024
Copy link
Member

@seruman seruman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix.

all my reviews are on the style side.

e2e/sync_test.go Outdated Show resolved Hide resolved
e2e/sync_test.go Outdated Show resolved Hide resolved
e2e/sync_test.go Outdated Show resolved Hide resolved
e2e/sync_test.go Outdated Show resolved Hide resolved
e2e/sync_test.go Outdated Show resolved Hide resolved
command/sync.go Outdated Show resolved Hide resolved
@tarikozyurtt
Copy link
Contributor Author

Fixed styles and naming.

@ilkinulas ilkinulas merged commit 787bc88 into peak:master Jul 12, 2024
13 checks passed
tarikozyurtt added a commit to tarikozyurtt/s5cmd that referenced this pull request Jul 12, 2024
The problem is mainly caused by the `compareObjects` function inside
`command/sync.go`, where `s5cmd` compares source and destination paths
and extracts files that are present only in the source or destination
path (while also counting nested folders or rather name with its
prefixes) along with common objects. If they both are non-objects, like
wildcard expression, prefix, or bucket, getting relative paths of files
with `src.URL.Relative()` results in compatible and comparable paths. In
this case, no problem is present, at least within the scope of this
issue.

However, when an object is selected as the source, it is not assigned a
relative path using `func (u *url.URL) SetRelative(base *url.URL)`, so
the `src.URL.Relative()` function returns its absolute path.

Let's say the source file has an absolute path of `folder/foo.txt`. The
algorithm compares `folder/foo.txt` with `s3://bucket/remoteFolder/` and
looks for the item `s3://bucket/remoteFolder/folder/foo.txt`. If it does
not match, except for the edge case where there is a duplicate item
inside the searched path, the files never match.

While copying files, `s5cmd` does not use relative paths, so `foo.txt`
is written to the intended path in the remote. However, this happens
during every sync operation, as they do not match.

Problem solved by taking path of source object as its name. This made
algorithm to simply look for matches in destination, a file named
`foo.txt` as intended.

This PR adds new test cases to the sync command. Previously, tests
failed to capture sync command cases where the source is an object in a
prefix, not an object directly in a bucket, or not multiple objects like
a wildcard or prefix expression.

If an object is in the `s3://bucket/` path, its relative path is the
same as its absolute path, so they match during comparison. This
prevented copying the file every time. The new test cases cover all
scenarios.
Resolves: peak#676.
@tarikozyurtt tarikozyurtt deleted the sync-object-fail-tests branch July 26, 2024 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sync command copies object everytime
3 participants