Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare and launch text-reuse detection with Passim #5

Open
7 of 9 tasks
piconti opened this issue May 2, 2024 · 2 comments
Open
7 of 9 tasks

Prepare and launch text-reuse detection with Passim #5

piconti opened this issue May 2, 2024 · 2 comments
Assignees

Comments

@piconti
Copy link
Member

piconti commented May 2, 2024

Text-reuse detection with Passim version 1 takes place in multiple steps:

Some of these steps have already been done, others are in the process.
Among the important steps is adapting a lot of the code that was previously run with dask-kubernetes to be run on runai.
This means creating a docker image and scripts that perform all the various steps.

@piconti
Copy link
Member Author

piconti commented May 31, 2024

It is not clear exactly which command should have been used for the actual detection.
Two text-reuse detections were launched, with different configurations:

1st configuration (similar to the one for boilerplate):
SPARK_SUBMIT_ARGS='—master local[25] —driver-memory 200G —executor-memory 200G —conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim -w 4 —schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" —fields 'date(date) as day' —pairwise —output-format json —filterpairs 'day < day2 AND datediff(day2,day) < 32 AND gid = gid2 AND uid <> uid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output/"

2nd configuration launched (matches the one from 2019 currently in the S3):
SPARK_SUBMIT_ARGS='--master local[30] --driver-memory 200G --executor-memory 200G --conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim --schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" --output-format json --filterpairs 'gid < gid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output_conf2/"

The results of both will be postprocessed and compared to see which match most the previous results and/or are the best.

@piconti
Copy link
Member Author

piconti commented Nov 5, 2024

Using the python version has not been experimented with. This could be done for the next release potentially, where another approach will probably need to be devised due to the very large amount of new data coming in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant