Prepare and launch text-reuse detection with Passim #5

piconti · 2024-05-02T08:49:31Z

Text-reuse detection with Passim version 1 takes place in multiple steps:

Rebuild all canonical data in the passim-rebuilt format (rebuilt adapted for passim's formatting needs)
Install all necessary tools based on the PH tutorial
Launch the boilerplate detection on all data which contains OLR (article-level segmentation)
Process this first output to create the pb.pkl dataframe containing the ids of all boileerplate articles
Remove the identified boilerplate articles from the considered data, and prepare for the actual text-reuse detection
Launch the text-reuse detection process
Post-process the output (another issue: Postprocess text-reuse and ensure results coincide with previous versions #6).
Also experiment with the updated python version to compare results (another issue: Detect text-reuse with Python version of Passim and compare #7)
Document the updated process (another issue: Document updated text-reuse detection process #8)

Some of these steps have already been done, others are in the process.
Among the important steps is adapting a lot of the code that was previously run with dask-kubernetes to be run on runai.
This means creating a docker image and scripts that perform all the various steps.

piconti · 2024-05-31T07:25:29Z

It is not clear exactly which command should have been used for the actual detection.
Two text-reuse detections were launched, with different configurations:

1st configuration (similar to the one for boilerplate):
SPARK_SUBMIT_ARGS='—master local[25] —driver-memory 200G —executor-memory 200G —conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim -w 4 —schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" —fields 'date(date) as day' —pairwise —output-format json —filterpairs 'day < day2 AND datediff(day2,day) < 32 AND gid = gid2 AND uid <> uid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output/"

2nd configuration launched (matches the one from 2019 currently in the S3):
SPARK_SUBMIT_ARGS='--master local[30] --driver-memory 200G --executor-memory 200G --conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim --schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" --output-format json --filterpairs 'gid < gid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output_conf2/"

The results of both will be postprocessed and compared to see which match most the previous results and/or are the best.

piconti · 2024-11-05T09:21:44Z

Using the python version has not been experimented with. This could be done for the next release potentially, where another approach will probably need to be devised due to the very large amount of new data coming in.

e-maud assigned piconti May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare and launch text-reuse detection with Passim #5

Prepare and launch text-reuse detection with Passim #5

piconti commented May 2, 2024 •

edited

Loading

piconti commented May 31, 2024

piconti commented Nov 5, 2024

Prepare and launch text-reuse detection with Passim #5

Prepare and launch text-reuse detection with Passim #5

Comments

piconti commented May 2, 2024 • edited Loading

piconti commented May 31, 2024

piconti commented Nov 5, 2024

piconti commented May 2, 2024 •

edited

Loading