-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovery of history data after downtimes #60
Comments
So, it used to be that HTCondor didn't index records accordingly and would continue to scan through old history files even when it was impossible to find additional records. That is, if you said "return all records from the last 5 minutes" without providing a limit on how many jobs to return, it would search through the entire history database on the schedd side. Now, based on CMS's complaints, upstream updated the schedd and Python bindings to be avoid this situation. IIRC, you provide a "last processed time" (or maybe job ID?) and it'll stop scanning history files once it reaches that point. Could you do some digging and figure out what the minimal HTCondor version is that supports this? Once we can confirm all our schedds are patched, we can do this and increase the number of records we can recover post-failure. |
Probably worth noting that we get the remote HTCondor version as part of the schedd ad in the collector. If we have a significant number of schedds that haven't upgraded, we can parse the version to determine the remote capabilities and take the old / new code-path accordingly. |
Ok, digging a bit, what we want is the Of the 63 schedd's we're currently querying, only 14 are running version 8.7. All the others are still at version 8.6. I'll see if I can come up with some code that would work for 8.7. |
I don't fully trust how the code recovers history data after not finishing for a few hours (e.g. when the VM is down). For example, the VM feeding es-cms was down for a few hours after a reboot on Sunday afternoon (October 7th), and the script was restarted only on Monday afternoon. It recovers some, but not all of the data:
https://es-cms.cern.ch/kibana/goto/14b8189cfdd5119db8dc25405fa4a9f7
Looking at the code, I suspect this:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/history.py#L54
where we specify a limit of 10'000 jobs per query (per schedd). Depending on which 10'000 jobs this retrieves, the
last_completion
time will be set such that older jobs are never recovered.@bbockelm can you clarify which jobs are returned when a limit is passed to
schedd.history
? Should we increase that number?The text was updated successfully, but these errors were encountered: