-
Notifications
You must be signed in to change notification settings - Fork 40
ASO via Rucio
Rationals and "white paper" writeup in Rucio for CRAB page
The following is a proposal:
- CRAB output is eventually registered in a DBS dataset = Rucio container which can contain data from multiple CRAB tasks
- We map that dataset to one container, that's moot
- We can also create one container for each task, to help keep track of output form specific tasks (will be decided as we finalize code and see how useful this). Each such container will have multiple datasets to be mapped to DBS blocks (sticking with usual 100-files/block)
- Those (Rucio) datasets will be added to both the top container and the task container
- There will be one rule per task container, i.e. one rule per task. This allows easy monitoring.
- then we can use RUcio to keep bookkeeping of successive iterations of the script:
- all actions like register replicas need to be idempotent
-
task_process/transfers/last_transfer.txt
is updated at the end of the script - at each iteration the status of the task container can be retrieved from Rucio, and current dataset defined for inserting new files
- ASO progress is monitored by checking the task rule. When replica is OK, transfersdb for that file is updated with transfer status and dbs-block name (the block name i.e. the rucio dataset name can be retrieved from the file name)
- when a dataset is fully replicated the tm_block_complete flag is set to True in transfersdb for at least one of the files in that dataset
- Internal bookkeeping:
- write one file per dataset with list of LFNs and keep track of datasets being replicated or all done. WHen a dataset is all done ⇒ put info in transfersdb
- a json file {totFiles:30, inProgress:10, Replicated:20, Failed:0, lfns:[{lfh:LFN, status: OK|Fail|Repl}...]}
- script reads the dataset.json file, updates status of replicas, saves file back. When all replicas are done could e.g. move to different directory
- transfersdb for each LFN must be updated as soon as replica is at destination
- update tm_publish flag in transfersdb to 1 only when block is complete, so publisher does not lock/acquire files which it can’t publish.
- adapt StageoutCheck.py : #6172 #5978
- no change to jobs, job wrapper, cmscp. But fallback direct stageout will be disabled.
- same PostJob for Rucio or FTS stageout (small changes from the no-Rucio era)
- in task_process replace call to
FTS_Transfer.py
withRUCIO_Transfers.py
- Publisher needs quite several changes : #7223
- REST needs changes (store additional info like Rucio dataset names)
- RUCIO_Transfers.py (RT.py) creates one Rucio container (DBS dataset) for the task if needed
- like before, new tasks with same config can add to existing dataset
- RT.py creates a new Rucio datasets every 100 files (current config. for DBS blocks in Publisher)
- may need to port here code which makes smaller datasets if lumi/file is too much large
- RT.py creates Rucio rules as needed to move files from /store/temp/user on running site to /store/user on destination storage
- a single rule is better for the time the user wants to look things up
- RT.py keeps monitoring rules to check for transfer completion (like now for FTS).
- RT.py removes (declares bad) replica in
/store/temp
and gfal-removes the physical file once transfer is done - RT.py stores dataset info in transfersdb for each file, so Publisher will know which DBS blocks to create
- for this create 2 new colums:
publication_block
(dataset name)block_complete
(NULL/'YES'/'NO') (likelyNO
is never needed)
- for this create 2 new colums:
- Publisher will keep current way of starting from all files with publication status =
NEW
and set them toACQUIRED
. Then list datasets for all AQUIRED files, find out the complete ones and publish them.-
Question ❓ How does Publisher know when a dataset/block is complete ? Blocks must be published in one go, it is not possible to add.
- solution 1: Publisher uses RucioClient to check for datasets which have been fully replicated at the storage destination RSE
- solution 2: RT.py does all the talking with Rucio and somehow flags datasets for Publisher. RT.py decides when a dataset is done, not Rucio. A file replica may appear at destination "while" RT decides that it waited long enough and detaches that did from the dataset !
- How does RT.py communicate this to Publisher ?
- 2.a) a new table in CRAB DB ? clean, but more work, maybe overkill ?
- 2.b) RT.py adds a flag to "some" files in the dataset . Publisher assumes dataset to be ready for publication as long as it finds the flag set in at least one file in the dataset. Ugly
- if RT.py dies between detecting dataset done and marking files in transfersdb, need to recover when it restarts
- How does RT.py communicate this to Publisher ?
Solution 1 is less work, Solution 2 is one less dependency for Publisher and also looks more robust: RT.py is in control of closing a dataset.
PLAN: go for 2.b) and see how it fares
-
Problem: datasets are defined before files are transferred, if one file gets stuck, the whole dataset can't be published
- solution 1: bite the bullet. If half task output is left out because of a handful of missing files from running at some ill-fated site.., that's life. Resubmission will hopefully fix.
- solution 2: detach LFN (did) from dataset. So the dataset will be completed (:crossed_fingers: TBC). When job is eventually resubmitted the output will be handled by RT as a new file (TBC) and inserted in a new dataset. Similar to current flow with FTS ASO where output of failed jobs is published when a resubmission eventually succeeds, even if it means to create one block with just one file.
- Note detached LFN still is a registered file with its checksum etc. so inserting the new file from the resubmitted job will still require the steps indicated below.
- PLAN: execute following actions in order
- do nothing - this implements solution 1.
- test detach_did and if all OK implement. This way only the problematic files stay unpublished.
- implement resubmissions (when permissions are granted by Rucio devs) so fully implement solution 2. and recover current flow.
-
job retries/resubmissions for failed jobs are no problem, since this all happens before ASO is called in
-
in a way.. Rucio will try forever to satisfy rules, ASO "will never give up and may never end"
-
in practice nobody wants to wait forever. Also transfer source may disappear (file lost, site down).
-
can not simply resubmit job and create another file with same name, Rucio enforces that replicas of same file are registered with same checksum
-
two-step solution
- now extend ASO waiting time in PostJob from 24h to 7 days. Give time for user/ops to do domething about stuck rules (even if naive as free quota at destination) and somehow prevent new resubmissions (:question: :thinking: ). May need to extend time for which we run task process after condor is all done.
-
next make it possible to stick with current flow: replica in
/store/temp/user
is declared bad and removed from rule as a way to signal "stop trying" and jobs is resubmitted which will insert a new replica and things start again.- ❓ who will "time out" ? RT.py or PostJob ? 🤔 If something goes wrong in RT.py and only PostJob can save the day ? 🤔
- implementing this requires some changes to Rucio authz part, which hopefully are now agreed on and in the pipeline