Criteria for choosing the file to be kept #4

danilobellini · 2016-06-02T01:06:55Z

Today rmdupe is hardcoded to use stat -c%Z for sorting files (curage and otherage variables). But that gives us a timestamp that is changed even when a file is just renamed.

AFAIK, not even touch can change the "last status change" timestamp, only access and modification timestamps. But even moving, renaming or rsyncing data (with rsync -av) from a media to another changes such status, so I'd argue that information doesn't really show which file is older. A copy would be older than a renamed/moved original.

Here I changed these 2 lines to use stat -c%Y instead. That gets the last data modification timestamp, which I think that should be the default, or at least the first thing to be seen in such a comparison. Copying with cp doesn't copy the data modification timestamp, it uses the copying time unless its --preserve parameter is used. That timestamp recognizes what are really duplicate copies, and a touch -m allow me to control things a little bit.

I think there should be more than one comparison level for finding which file should be kept when duplicates are found. Also, there should be an option to change the criterion in every comparison level. "Older" would be just the default criterion (BTW, that would keep the rmdupe --help description). The comparison order might be different for every level, so I think that the --old should be --inv with the description "inverts the ascending/descending order for every comparison criterion (remove oldest duplicates instead of newest when used with the defaults)".

If the "data modification" timestamps are identical, there should be a second level comparison. On such a level, the "status change" timestamp makes some sense, but I still think using other stuff like the file name would be better. The last comparison levels can get weird information like the inode number followed by the device id, just to avoid randomness. Nevertheless, these n-th level comparison would be calling:

stat -c%n file name
stat -c%u file owner ID
stat -c%U file owner name
stat -c%g file group ID
stat -c%G file group name
stat -c%i inode number
stat -c%d device number
stat -c%W file birth timestamp (is this one useful for anyone?)
stat -c%X last access timestamp
stat -c%Y last modification timestamp
stat -c%Z last status change timestamp

These are the ones that IMHO makes sense as some comparison level. There are other criteria that makes sense (e.g. the name lengths), but most cases that needs a specific comparison scheme don't need anything beyond the criteria above. The only thing that is still missing is whether the comparison should keep the file with the smallest or biggest value on each criterion. A solution would be (1) sorting, (2) keeping only the first on the sorting result, and (3) allowing a "reversed sorting" for each criterion by using an extra suffix.

The stat format parameter that changes above has only a single char in nuUgGidWXYZ, and for choosing the criteria order these would be enough. As every criterion could be reversed, there's 2 solutions I thought: (1) using r as a suffix (luckily, %r and %R aren't stat format parameters), or (2) using a or d to denote ascending or descending order for sorting, telling the user that the first file after sorting is the only one kept. On (2) every criterion would need exact 2 chars and "d" has 2 different meanings, on (1) the criterion can have 1 or 2 chars. Examples with both ideas:

# Last modification, name, inode, device
rmdupe --sort Ynid [...]
rmdupe --sort Yanaiada [...]

# Last modification, name reversed, inode, device
rmdupe --sort Ynrid [...]
rmdupe --sort Yandiada [...]

I prefer (1) for the parameter. I think that Ynid should be the default criteria. Also, the id/iada should be the parameter suffix no matter what was given (i.e., inode and device numbers as last criteria just to avoid randomness). The --sort is just an idea that makes sense for me as I would be thinking to get the head -n1 from the sort command result, with sort -r as the reverse, and an implicit sort -n for numbers. Yet the comparison on each ASCII criterion (file/owner/group name) would be [[ name1 < name2 ]] instead of (( curage < otherage )), there's no need to call sort.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Criteria for choosing the file to be kept #4

Criteria for choosing the file to be kept #4

danilobellini commented Jun 2, 2016

Criteria for choosing the file to be kept #4

Criteria for choosing the file to be kept #4

Comments

danilobellini commented Jun 2, 2016