You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When reading dataset with DocumentDataset.read_parquet(..., blocksize=???, files_per_partition=None) and running fuzzy dedup, protocol=ucxfalse positive=on we run into an error during the shuffle_docs_on_buckets -> _batched_merge_and_write step
Stage3 (FalsePostiveCheck): Shuffledocs0%||0/1 [00:00<?, ?it/s]
Startedprocessingbucket-mappartitions0through1of1Using4textpartitions.
2025-01-0208:31:21,288-distributed.worker-ERROR-ComputeFailedKey: ('read_parquet-fused-assign-7d4479cf1a375160a1452f529c7dfcef', 1)
State: executingTask: <Task ('read_parquet-fused-assign-7d4479cf1a375160a1452f529c7dfcef', 1) _execute_subgraph(...)>Exception: "ValueError('Cannot align indices with non-unique values')"Traceback: ' File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_expr/_expr.py", line 1849, in assign\n df[name] = val\n File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper\n return func(*args, **kwargs)\n File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/dataframe.py", line 1445, in __setitem__\n self.insert(self._num_columns, arg, value)\n File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper\n return func(*args, **kwargs)\n File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/dataframe.py", line 3329, in insert\n return self._insert(\n File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper\n return func(*args, **kwargs)\n File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/dataframe.py", line 3403, in _insert\n value = value._align_to_index(\n File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/indexed_frame.py", line 3739, in _align_to_index\n raise ValueError("Cannot align indices with non-unique values")\n'2025-01-0208:31:21,351-distributed.worker-ERROR-ComputeFailedKey: getitem-de6fe9f32b5dc94114977026f9696781State: executingTask: <Task'getitem-de6fe9f32b5dc94114977026f9696781'getitem(...)>Exception: 'KeyError(0)'Traceback: ''0%||0/1 [00:01<?, ?it/s]
Traceback (mostrecentcalllast):
File"/benchmark/nemo-curator/scripts/run_curator_with_logs.py", line1127, inmainrun_curation_pipeline(
File"/benchmark/nemo-curator/scripts/run_curator_with_logs.py", line969, inrun_curation_pipelinecuration_steps(dataset)
File"/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/meta.py", line22, in__call__dataset=module(dataset)
File"/benchmark/nemo-curator/scripts/pipeline_utils.py", line115, inwrappedreturnfunc(
File"/benchmark/nemo-curator/scripts/run_curator_with_logs.py", line404, infuzzy_dedupeduplicates=fuzzy_dup(dataset=dataset)
File"/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line673, in__call__self.jaccard_shuffle.shuffle_docs_on_buckets(
File"/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line1166, inshuffle_docs_on_bucketsself._batched_merge_and_write(
File"/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modules/fuzzy_dedup.py", line1309, in_batched_merge_and_writewritten_files=written_files.compute()
File"/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_expr/_collection.py", line480, incomputereturnDaskMethodsMixin.compute(out, **kwargs)
File"/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line372, incompute
(result,) =compute(self, traverse=False, **kwargs)
File"/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line660, incomputeresults=schedule(dsk, keys, **kwargs)
File"/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/client.py", line2427, in_gatherraiseexception.with_traceback(traceback)
KeyError: 0
When reading dataset with
DocumentDataset.read_parquet(..., blocksize=???, files_per_partition=None)
and running fuzzy dedup,protocol=ucx
false positive=on
we run into an error during theshuffle_docs_on_buckets
->_batched_merge_and_write
stepEnvironment
The text was updated successfully, but these errors were encountered: