-
-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Duplicate Finder #1107
base: dev
Are you sure you want to change the base?
Conversation
a941899
to
de28e84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this! Looks delightful.
I'll give this a more indepth review later, but I don't think the API should have any "batch" kind of endpoints if possible - Single/Atomic operations are easier to code on the server-side and clients can just loop a bunch of calls..
If the toasts are annoying I believe it'd be useful to add an optional parameter to the JS function for API calls to disable showing them on a successful operation.
Please add ignore option to the archive. For example, if you want to preserve duplicate archives with different translation styles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for having taken so long to get to reviewing this - It's not a lot of lines but due to being a brand new feature I wanted to give it a detailed review.
I think the overall logic and minion job part is solid, most of my comments are on cleaning up the UI.
@@ -107,6 +107,26 @@ sub regen_thumbnails { | |||
); | |||
} | |||
|
|||
# Queue the find_duplicates Minion job. | |||
sub find_duplicates { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this API endpoint -- I believe you could just use the existing /api/minion/find_duplicates/queue
by passing the threshold in the args
array parameter.
# Go through the archives in the content directory and build the template at the end. | ||
sub index { | ||
my $self = shift; | ||
my $redis = $self->LRR_CONF->get_redis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use get_redis_config
so it doesn't pollute the ID list, imo.
true, | ||
(d) => { | ||
$(".find-duplicates").prop("disabled", false); | ||
LRR.toast({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a toast, I'd just refresh the window so that the newly found duplicates appear.
(While making sure that the refreshed URL doesn't contain delete=1
so it doesn't insta-delete the dupes we just found!)
* Sends a POST request to queue a find_duplicates job, | ||
* detecting archive duplicates based on their thumbnail hashes. | ||
*/ | ||
Server.findDuplicates = function () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function probably belongs in duplicates.js?
<h2 class="ih" style="text-align:center">Duplicates</h2> | ||
<p>Found [% duplicates.size %] duplicate groups</p> | ||
|
||
[% IF userlogged %] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already require is_logged_in
in Routing, so checking for userlogged
is unnecessary here
<body> | ||
<div class='ido' style='text-align:center; overflow-x:auto;'> | ||
<h2 class="ih" style="text-align:center">Duplicates</h2> | ||
<p>Found [% duplicates.size %] duplicate groups</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be some explanatory text here that details how the feature works.
("Found X duplicate groups" might be good to hide if there's no duplicate results? Not necessary though imo)
<button type="button" class="stdbtn find-duplicates">Find Duplicates</button> | ||
<button type="button" class="stdbtn clear-duplicates">Clear Duplicates</button> | ||
</div> | ||
<div class="select-btn-group"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd hide this div if there are no duplicate groups found/available.
<div class="thumbnail-wrapper"> | ||
<a href="/reader?id=[% archive.arcid %]" title="[% archive.title %]"> | ||
<img class="thumbnail" src="/api/archives/[% archive.arcid %]/thumbnail" width="100"/> | ||
</a> | ||
<div class="thumbnail-popover"> | ||
<img src="/api/archives/[% archive.arcid %]/thumbnail" /> | ||
</div> | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use the same tooltip mechanism as the main index here instead:
<div class="thumbnail-wrapper"> | |
<a href="/reader?id=[% archive.arcid %]" title="[% archive.title %]"> | |
<img class="thumbnail" src="/api/archives/[% archive.arcid %]/thumbnail" width="100"/> | |
</a> | |
<div class="thumbnail-popover"> | |
<img src="/api/archives/[% archive.arcid %]/thumbnail" /> | |
</div> | |
</div> | |
<div class="thumbnail-wrapper"> | |
<a onmouseover="IndexTable.buildImageTooltip(this)" href="${new LRR.apiURL('/reader?id=[% archive.arcid %]')}" title="[% archive.title %]"> | |
<img class="thumbnail" src="${new LRR.apiURL(`/api/archives/[% archive.arcid %]/thumbnail`)}" width="100"/> | |
</a> | |
<div class="caption" style="display: none;"> | |
<img style="height:300px" src="${new LRR.apiURL('/api/archives/${data.arcid}/thumbnail')}" | |
onerror="this.src='${new LRR.apiURL('/img/noThumb.png')}'"> | |
</div> | |
</div> |
|
||
foreach my $id (@$_) { | ||
# Skip if this ID has already been processed in another thread | ||
next if $visited{$id}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've used split_workload_by_cpu
, normally each process/thread should have its own unique set of IDs.
I'm not sure you need visited as a result?
<script src="[% c.url_for("/js/common.js?$version") %]" type="text/JAVASCRIPT"></script> | ||
<script src="[% c.url_for("/js/server.js?$version") %]" type="text/JAVASCRIPT"></script> | ||
<script src="[% c.url_for("/js/duplicates.js?$version") %]" type="text/JAVASCRIPT"></script> | ||
<style> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not a fan of having special CSS here, this will mess up custom themes.
Ideally I think everything here is doable with the base lrr.css and re-using Index bits for the tags and thumbnail popups.
It's fine to add a few extra classes to lrr.css if you'd need specific sizing.
Implement a feature to detect and delete duplicate archives in the library based on the hamming distance between thumbnail hashes (#338).
The minion job is parallelized and is way faster than the initial plugin, for my library (27k archives, 730GB) a full run took about 1h.
Additionally, I improved upon the script instead of only return duplicate pairs (e.g. we have 3 duplicate archives: id1 -> id2, id1 -> id3, id1 -> id3, now it returns one group [id1, id2, id3]).
view in
/duplicates
, it is currently not connected from anywhere else. Should be considered if it should be linked to from navbar or settings. Furthermore, the CSS is defined inline in the HTML, maybe it should be separated into separate file?I could not figure out how to do the tooltip for the tags like it is on index page. Would be great to see the actual tags if you hover over the tag count. I would need some help there.
I implemented different ways to delete the duplicates:
A batch delete endpoint in the API would be great, the toasts are a bit annoying if you delete some archives in batch.
Tested on my library, detected and deleted 776 duplicate groups.
Would appreciate any feedback.