-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
功能需求:过滤重复图片 #244
Labels
enhancement
New feature or request
Comments
好需求,能给个本子id吗,作为测试用例 |
我想了一下,获取在全部下载完成后,再进行过滤检测,根据检测结果与用户阈值匹配,再决定物理删除文件,也是可以的。 #!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import hashlib
from collections import defaultdict
def calculate_md5(file_path):
"""计算文件的MD5哈希值"""
hash_md5 = hashlib.md5()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def find_duplicate_files(root_folder):
"""递归读取文件夹下所有文件并计算MD5出现次数"""
md5_dict = defaultdict(list)
for root, _, files in os.walk(root_folder):
for file in files:
file_path = os.path.join(root, file)
file_md5 = calculate_md5(file_path)
md5_dict[file_md5].append(file_path)
# 打印MD5出现次数大于等于2的文件
for md5, paths in md5_dict.items():
if len(paths) >= 2:
print(f"MD5: {md5} 出现次数: {len(paths)}")
for path in paths:
print(f" {path}")
if __name__ == '__main__':
dir_path = r"G:\Nexon\20240521\故鄉的那些女人"
find_duplicate_files(dir_path) 执行结果:
|
JM212707 |
我的第一想法也是这样,全部下载完再检测,这样一来这个功能的实现和jmcomic可以完全无关。 |
hect0x7
added a commit
that referenced
this issue
May 24, 2024
实现示例 + 测试代码测试环境:使用最新dev的jmcomic代码from jmcomic import *
# 插件定义
class DeleteDuplicatedFilesPlugin(JmOptionPlugin):
plugin_key = 'delete_duplicated_files'
def calculate_md5(self, file_path):
import hashlib
"""计算文件的MD5哈希值"""
hash_md5 = hashlib.md5()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def find_duplicate_files(self, root_folder):
"""递归读取文件夹下所有文件并计算MD5出现次数"""
import os
from collections import defaultdict
md5_dict = defaultdict(list)
for root, _, files in os.walk(root_folder):
for file in files:
file_path = os.path.join(root, file)
file_md5 = self.calculate_md5(file_path)
md5_dict[file_md5].append(file_path)
return md5_dict
def invoke(self,
album=None,
downloader=None,
limit=2,
delete_if_exceed_limit=True,
**kwargs,
) -> None:
if album is None:
return
# 获取到下载本子所在根目录
# 这个方法是 最新dev分支新加的
root_folder = self.option.dir_rule.decide_album_root_dir(album)
md5_dict = self.find_duplicate_files(root_folder)
# 打印MD5出现次数大于等于2的文件
for md5, paths in md5_dict.items():
if len(paths) >= limit:
print(f"MD5: {md5} 出现次数: {len(paths)}")
for path in paths:
print(f" {path}")
# 判断参数配置,是否要删除文件
if delete_if_exceed_limit:
self.do_delete(paths)
def do_delete(self, paths):
"""
复用父类的删除方法
"""
self.delete_original_file = True
# 删除文件
self.execute_deletion(paths)
# 手动注册插件
JmModuleConfig.register_plugin(DeleteDuplicatedFilesPlugin)
op = create_option_by_env()
op.download_album(123) option配置 plugins:
after_album: # 每当一个本子下载完后,调用插件
- plugin: delete_duplicated_files
kwargs:
# 对md5出现次数的限制
limit: 1
# 如果文件的md5的出现次数 >= limit,是否要删除
# 在 limit: 1配置下,效果是删除所有文件
delete_if_exceed_limit: true |
Merged
测试过,可以用,我个人使用阈值2完全没问题。 如果设置默认值,建议阈值3起步,阈值2感觉不太妥当 |
hect0x7
added a commit
that referenced
this issue
May 27, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
很多漫画每一话的头尾,会出现重复的图片页。
希望在下载的时候,计算每页MD5值,统计每个MD5值出现的次数,大于一定阈值,如5,大于5次,则后面下载的每一话,都会过滤掉重复的图片。
假设某部漫画200话,每话50张图片,总共10000张图片,下载计算MD5最大增加耗时50ms/张,额外多出500秒,可以接受。
理想情况下,前面5话获得头尾共2张重复5次的图片,则后面195话可以省略195*2张图片。
这个需求的意义:韩漫头部起码3-5张重复上一话,拇指滑动非常累,阈值自定义调整为大于等于2,去掉每话头部重复上一话剧情,提升体验。
希望能接受这个需求并做成插件。
The text was updated successfully, but these errors were encountered: