Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Criteria for choosing the file to be kept #4

Open
danilobellini opened this issue Jun 2, 2016 · 0 comments
Open

Criteria for choosing the file to be kept #4

danilobellini opened this issue Jun 2, 2016 · 0 comments

Comments

@danilobellini
Copy link

Today rmdupe is hardcoded to use stat -c%Z for sorting files (curage and otherage variables). But that gives us a timestamp that is changed even when a file is just renamed.

AFAIK, not even touch can change the "last status change" timestamp, only access and modification timestamps. But even moving, renaming or rsyncing data (with rsync -av) from a media to another changes such status, so I'd argue that information doesn't really show which file is older. A copy would be older than a renamed/moved original.

Here I changed these 2 lines to use stat -c%Y instead. That gets the last data modification timestamp, which I think that should be the default, or at least the first thing to be seen in such a comparison. Copying with cp doesn't copy the data modification timestamp, it uses the copying time unless its --preserve parameter is used. That timestamp recognizes what are really duplicate copies, and a touch -m allow me to control things a little bit.

I think there should be more than one comparison level for finding which file should be kept when duplicates are found. Also, there should be an option to change the criterion in every comparison level. "Older" would be just the default criterion (BTW, that would keep the rmdupe --help description). The comparison order might be different for every level, so I think that the --old should be --inv with the description "inverts the ascending/descending order for every comparison criterion (remove oldest duplicates instead of newest when used with the defaults)".

If the "data modification" timestamps are identical, there should be a second level comparison. On such a level, the "status change" timestamp makes some sense, but I still think using other stuff like the file name would be better. The last comparison levels can get weird information like the inode number followed by the device id, just to avoid randomness. Nevertheless, these n-th level comparison would be calling:

  • stat -c%n file name
  • stat -c%u file owner ID
  • stat -c%U file owner name
  • stat -c%g file group ID
  • stat -c%G file group name
  • stat -c%i inode number
  • stat -c%d device number
  • stat -c%W file birth timestamp (is this one useful for anyone?)
  • stat -c%X last access timestamp
  • stat -c%Y last modification timestamp
  • stat -c%Z last status change timestamp

These are the ones that IMHO makes sense as some comparison level. There are other criteria that makes sense (e.g. the name lengths), but most cases that needs a specific comparison scheme don't need anything beyond the criteria above. The only thing that is still missing is whether the comparison should keep the file with the smallest or biggest value on each criterion. A solution would be (1) sorting, (2) keeping only the first on the sorting result, and (3) allowing a "reversed sorting" for each criterion by using an extra suffix.

The stat format parameter that changes above has only a single char in nuUgGidWXYZ, and for choosing the criteria order these would be enough. As every criterion could be reversed, there's 2 solutions I thought: (1) using r as a suffix (luckily, %r and %R aren't stat format parameters), or (2) using a or d to denote ascending or descending order for sorting, telling the user that the first file after sorting is the only one kept. On (2) every criterion would need exact 2 chars and "d" has 2 different meanings, on (1) the criterion can have 1 or 2 chars. Examples with both ideas:

# Last modification, name, inode, device
rmdupe --sort Ynid [...]
rmdupe --sort Yanaiada [...]

# Last modification, name reversed, inode, device
rmdupe --sort Ynrid [...]
rmdupe --sort Yandiada [...]

I prefer (1) for the parameter. I think that Ynid should be the default criteria. Also, the id/iada should be the parameter suffix no matter what was given (i.e., inode and device numbers as last criteria just to avoid randomness). The --sort is just an idea that makes sense for me as I would be thinking to get the head -n1 from the sort command result, with sort -r as the reverse, and an implicit sort -n for numbers. Yet the comparison on each ASCII criterion (file/owner/group name) would be [[ name1 < name2 ]] instead of (( curage < otherage )), there's no need to call sort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant