Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.0.7 wrap up #47

Merged
merged 10 commits into from
Dec 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,11 @@ go.work
/.code
/dist

# output files
smash
smash.exe
smash.out

# temporary files
report.json
analysis.json
Expand Down
Binary file added docs/artefacts/smash-v0.0.7-demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/demos.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@
Pre-recorded demos of `smash` in action.

# Examples
* [v0.0.7-hdd-photos](https://vhs.charm.sh/vhs-4OwN0BJfb3F3CTzGJCFHcs.gif) - `smash`ing an old portable USB HDD of photos & removing duplicates.
* [v0.0.5-hdd-photos](https://vhs.charm.sh/vhs-7B6XHxXq8VPvZ6AY9FpGIc.gif) - `smash`ing an old portable USB HDD of photos with excluded directories.

# Versions

* [v0.0.7](https://vhs.charm.sh/vhs-5uZbZAvk8Y6eq4dihLppbk.gif) - powered by [VHS](https://vhs.charm.sh)
* [v0.0.5](https://vhs.charm.sh/vhs-1zSMi9vYpmh0DivoB4E6g4.gif) - powered by [VHS](https://vhs.charm.sh)
* [v0.0.4](https://vhs.charm.sh/vhs-tgMXNRqo7UovLRd5iSlgF.gif) - powered by [VHS](https://vhs.charm.sh)
* [v0.0.3](https://vhs.charm.sh/vhs-1T6pqQivwvPAmudnDpwVQP.gif) - powered by [VHS](https://vhs.charm.sh)
Expand Down
17 changes: 16 additions & 1 deletion docs/vhs/demo-photos-hdd.tape
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,22 @@ Set WindowBar Colorful
Set Theme "TokyoNight"

# smash Linux/drivers
Type "./smash /media/thushan/smash/photos/ --exclude-dir=_uploaded,sort,tmp,events"
Type "./smash /media/thushan/smash/photos/ --exclude-dir=sort,tmp,events -o report.json"
Sleep 500ms
Enter
Sleep 30s
Type "clear"
Sleep 500ms
Enter
Type `jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json`
Sleep 500ms
Enter
Sleep 5s
Type "clear"
Sleep 500ms
Enter
Type `jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs ls -lh`
Sleep 500ms
Enter
Sleep 5s

29 changes: 28 additions & 1 deletion docs/vhs/demo.tape
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,34 @@ Set WindowBar Colorful
Set Theme "JetBrains Darcula"

# smash Linux/drivers
Type "./smash ~/linux/drivers"
Type "./smash ~/linux/drivers --exclude-dir=git -o report.json"
Sleep 500ms
Enter
Sleep 10s
Type "clear"
Sleep 500ms
Enter
Type `jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l`
Sleep 500ms
Enter
Sleep 5s
Type `jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs rm`
Sleep 500ms
Enter
Sleep 5s
Type "cd ~/linux/drivers"
Sleep 500ms
Enter
Sleep 2s
Type "git status -s"
Sleep 500ms
Enter
Sleep 3s
Type "git reset --hard"
Sleep 500ms
Enter
Sleep 3s
Type "git status -s"
Sleep 500ms
Enter
Sleep 5s
4 changes: 4 additions & 0 deletions internal/smash/configuration.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ func (app *App) printConfiguration() {
theme.Println(b.Sprint("Algorithm: "), theme.ColourConfig(algorithms.Algorithm(f.Algorithm)))
theme.Println(b.Sprint("Locations: "), theme.ColourConfig(strings.Join(app.Locations, ", ")))

if !f.HideOutput && f.OutputFile != "" {
theme.Println(b.Sprint("Output: "), theme.ColourConfig(f.OutputFile), "(json)")
}

if len(f.ExcludeDir) > 0 || len(f.ExcludeFile) > 0 {
theme.StyleBold.Println("Excluded")
if len(f.ExcludeDir) > 0 {
Expand Down
41 changes: 30 additions & 11 deletions internal/smash/export.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,18 +41,23 @@ type ReportTopFilesSummary struct {
}
type ReportFiles struct {
Fails []ReportFailSummary `json:"fails"`
Empty []ReportFileSummary `json:"empty"`
Empty []ReportFileBaseSummary `json:"empty"`
Dupes []ReportDuplicateSummary `json:"dupes"`
}

type ReportFailSummary struct {
Filename string `json:"filename"`
Error string `json:"error"`
}
type ReportFileSummary struct {

type ReportFileBaseSummary struct {
Filename string `json:"filename"`
Location string `json:"location"`
Path string `json:"path"`
}

type ReportFileSummary struct {
ReportFileBaseSummary
Hash string `json:"hash"`
Size uint64 `json:"size"`
FullHash bool `json:"fullHash"`
Expand Down Expand Up @@ -116,19 +121,23 @@ func getHostName() string {
if host, err := os.Hostname(); err == nil {
return host
}
return "Unknown"
return "Classified"
}

func summariseRunAnalysis(session *AppSession) ReportFiles {

fails := summariseSmashFails(session.Fails)
empty := summariseEmptyFiles(session.Empty.Files)
dupes := transformDupes(session.Dupes)

return ReportFiles{
Fails: summariseSmashedFails(session.Fails),
Empty: summariseSmashedFiles(session.Empty.Files),
Dupes: transformDupes(session.Dupes),
Fails: fails,
Empty: empty,
Dupes: dupes,
}
}

func summariseSmashedFails(fails *xsync.MapOf[string, error]) []ReportFailSummary {
func summariseSmashFails(fails *xsync.MapOf[string, error]) []ReportFailSummary {
summary := make([]ReportFailSummary, fails.Size())
var index = 0
fails.Range(func(key string, value error) bool {
Expand All @@ -147,16 +156,24 @@ func transformDupes(duplicates *xsync.MapOf[string, *DuplicateFiles]) []ReportDu
var index = 0
duplicates.Range(func(hash string, dupe *DuplicateFiles) bool {
root := dupe.Files[0]
rest := dupe.Files[1:]
dupes[index] = ReportDuplicateSummary{
ReportFileSummary: summariseSmashedFile(root),
Duplicates: summariseSmashedFiles(dupe.Files),
Duplicates: summariseSmashedFiles(rest),
}
index++
return true
})
return dupes
}

func summariseEmptyFiles(files []File) []ReportFileBaseSummary {
summary := make([]ReportFileBaseSummary, len(files))
for i, file := range files {
summary[i] = summariseSmashedFile(file).ReportFileBaseSummary
}
return summary
}
func summariseSmashedFiles(files []File) []ReportFileSummary {
summary := make([]ReportFileSummary, len(files))
for i, file := range files {
Expand All @@ -166,9 +183,11 @@ func summariseSmashedFiles(files []File) []ReportFileSummary {
}
func summariseSmashedFile(file File) ReportFileSummary {
return ReportFileSummary{
Filename: file.Filename,
Location: file.Location,
Path: filepath.Dir(file.Path),
ReportFileBaseSummary: ReportFileBaseSummary{
Filename: file.Filename,
Location: file.Location,
Path: filepath.Dir(file.Path),
},
Hash: file.Hash,
Size: file.FileSize,
FullHash: file.FullHash,
Expand Down
35 changes: 24 additions & 11 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,13 @@ Tool to `smash` through to find duplicate files efficiently by slicing a file (o
and computing a hash using a fast non-cryptographic algorithm such as [xxhash](https://xxhash.com/) or [murmur3](https://en.wikipedia.org/wiki/MurmurHash).

<p align="center">
<img src="https://vhs.charm.sh/vhs-1zSMi9vYpmh0DivoB4E6g4.gif" alt="Made with VHS"><br/>
<img src="https://vhs.charm.sh/vhs-5uZbZAvk8Y6eq4dihLppbk.gif" alt="Made with VHS"><br/>
<sub>
<sup>Find duplicates in the <a href="https://github.com/torvalds/linux">linux/drivers</a> source tree with <code>smash</code> (see our <a href="docs/demos.md">🍿 other demos</a>). Made with <a href="https://vhs.charm.sh" target="_blank">vhs</a>!</sup>
</sub>
</p>

`smash` has a read-only view of the underlying filesystem and only reports duplicates - currently, we do not remove
duplicates and instead leave that for you to do via the output. We also don't support symlinks or NT Junction Points (Windows Symlinks) and ignore them.
`smash` has a read-only view of the underlying filesystem, outputs empty and duplicate files into a json report that you can use a tool like [jq](https://github.com/jqlang/jq) to operate on. See examples below or [this vhs tape](https://vhs.charm.sh/vhs-4OwN0BJfb3F3CTzGJCFHcs.gif).

The name comes from a prototype tool called SmartHash (written many years ago in C/ASM that's now lost in source &
too hard to modernise) which operated on a similar concept (with CRC32 then later MD5).
Expand Down Expand Up @@ -48,25 +47,26 @@ $ go install github.com/thushan/smash@latest
```
Usage:
smash [flags] [locations-to-smash]

Flags:
--algorithm algorithm Algorithm to use to hash files. Supported: xxhash, murmur3, md5, sha512, sha256 (full list, see readme) (default xxhash)
--base strings Base directories to use for comparison Eg. --base=/c/dos,/c/dos/run/,/run/dos/run
--disable-slicing Disable slicing & hash the full file instead
--disable-autotext Disable detecting text-files to opt for a full hash for those
--disable-meta Disable storing of meta-data to improve hashing mismatches
--disable-slicing Disable slicing & hash the full file instead
--exclude-dir strings Directories to exclude separated by comma Eg. --exclude-dir=.git,.idea
--exclude-file strings Files to exclude separated by comma Eg. --exclude-file=.gitignore,*.csv
-h, --help help for smash
--ignore-empty Ignore empty/zero byte files (default true)
--ignore-hidden Ignore hidden files & folders Eg. files/folders starting with '.' (default true)
--ignore-system Ignore system files & folders Eg. '$MFT', '.Trash' (default true)
-p, --max-threads int Maximum threads to utilise (default 16)
-w, --max-workers int Maximum workers to utilise when smashing (default 8)
-w, --max-workers int Maximum workers to utilise when smashing (default 16)
--nerd-stats Show nerd stats
--no-output Disable report output
--no-progress Disable progress updates
--no-top-list Hides top x duplicates list
-o, --output-file string Export as JSON
-o, --output-file string Export analysis as JSON (generated automatically otherwise)
--profile Enable Go Profiler - see localhost:1984/debug/pprof
--progress-update int Update progress every x seconds (default 5)
--show-duplicates Show full list of duplicates
Expand All @@ -84,18 +84,30 @@ Examples are given in Unix format, but apply to Windows as well.

### Basic

To check for duplicates in a single path (Eg. `~/media/photos`)
To check for duplicates in a single path (Eg. `~/media/photos`) & output report to `report.json`

```bash
$ ./smash ~/media/photos
$ ./smash ~/media/photos -o report.json
```

You can then look at `report.json` with [jq](https://github.com/jqlang/jq) to check duplicates:

```bash
$ jq '.analysis.dupes[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l
```

### Show Empty Files

By default, `smash` ignores empty files but can report on them with the `--ignore-empty=false` argument:

```bash
$ ./smash ~/media/photos --ignore-empty=false
$ ./smash ~/media/photos --ignore-empty=false -o report.json
```

You can then look at `report.json` with [jq](https://github.com/jqlang/jq) to check empty files:

```bash
$ jq '.analysis.empty[]|[.location,.path,.filename]|join("/")' report.json | xargs wc -l
```

### Show Top 50 Duplicates
Expand Down Expand Up @@ -155,6 +167,7 @@ $ ./smash --algorithm:murmur3 ~/media/photos

This project was possible thanks to the following projects or folks.

* [@jqlang/jq](https://github.com/jqlang/jq) - without `jq` we'd be a bit lost!
* [@wader/fq](https://github.com/wader/fq) - countless nights of inspecting binary blobs!
* [@cespare/xxhash](https://github.com/cespare/xxhash) - xxhash implementation
* [@spaolacci/murmur3](https://github.com/spaolacci/murmur3) - murmur3 implementation
Expand All @@ -164,7 +177,7 @@ This project was possible thanks to the following projects or folks.
* [@golangci/golangci-lint](https://github.com/golangci/golangci-lint) - Go Linter
* [@dkorunic/betteralign](https://github.com/dkorunic/betteralign) - Go alignment checker

Testers - MarkB, JarredT, BenW, DencilW, JayT, ASV, TimW, RyanW, WilliamH
Testers - MarkB, JarredT, BenW, DencilW, JayT, ASV, TimW, RyanW, WilliamH, SpencerB, EmadA, ChrisE, AngelaB

# Licence

Expand Down
Loading