-
Notifications
You must be signed in to change notification settings - Fork 40
Automatic Splitting: Concept
To reduce undue load on the scheduler of CRAB3, user splitting should be improved by having CRAB3 split tasks into jobs taking only a single user parameter into account: the desired runtime. For this purpose, task processing is going to be split into three stages, each corresponding to one or more HTCondor DAGs:
- A probe stage, where one or several splitting probes are used to estimate the event throughput of the user-provided
pset
and arguments. - Next is the processing stage, or conventional stage for all old splitting modes. Here, all jobs will process the dataset based on the splitting determined by the probe(s).
- With a strict runtime limit enforced by CMSSW, the jobs of the processing stage are going to sometimes not finish processing in time and produce incomplete output. To fully process a dataset, the tail stage analyses the FWJRs of the processing jobs and submits shorter tail jobs to finish the processing of the dataset.
Adding automatic splitting started with the processing flow of a task in CRAB3 that is common to all splitting mechanisms so far:
The part on the schedd was replaced by a DAG with a single probe job, which has an unfilled SubDAG specified as child node. When the splitting probe finishes, the PostJob of the probe reads the framework job report, calculates the event throughput and uses EventAwareLumiBased splitting to split the task into jobs that are estimated to take a user-specified amount of time:
With optional tail jobs, splitting adds a 50% buffer to the timing estimate when splitting the task and configures cmsRun
to stop processing new luminosity sections after the user-specified amount of runtime has passed. Each processing job has a child SubDAG for tail jobs, which is filled by its PostJob, akin to how the probe, but with only the unprocessed luminosity sections taken into account for the splitting.
Main targets of this implementation:
- Move the schedd load from a majority sub-1hour jobs to longer-running jobs.
- Give the users a better handle on the splitting. Most splitting parameters are just guessed and sub-optimal.
Problems of the 2016 setup:
- Only one probe job gives bad splitting estimates for jobs that skim a lot or on data with inhomogeneous events.
- One tail DAG per processing job increases the load on the schedd in a very unreasonable fashion since each DAG requires 16 MB of memory.
Plan to increase the number of probes to 5 and add only a small number <5 of tail DAGs:
Here, the tail DAGs are equivalent what is run so far with DAGMan: a bag of jobs, each with a pre- and postjob. For the other probe DAG, in addition to the bag of probe job, there is one SubDAG, with a prejob that defers until all probe jobs have run, and then runs the splitting on the dataset and fills the main processing DAG. In the main processing DAG, several similar steps are chained together: a first tail DAG is filled when, e.g., 50% of the processing jobs have run. It has a "child" tail DAG, for which the prejob defers execution until all processing jobs have run.