-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large delay in file injection #9543
Comments
@vlimant Thanks for reporting it. I am looking at the issue now. May come back with further questions later, though. |
I still do not have the reason why it happens, but would like to add some more info regarding the consequences of that. For the workflow in question: It is a an ACDC and the original request is this one: Checking the number of events in the request and the numbers (both events and files) in the Output datasets there is some difference :
Obviously DBS and Phedex are in agreement for the number of files while couchdb is having some discrepancy. Then querying couchdb about the ACDC clearly shows that the ACDC covered the event deficit and the missing file is there:
Obviously due to the delayed injection of the files (after the workflow has been completed) Couchdb is not updating the information about the output. But at least the requested number of events are present. |
I suggest we follow this up in this original issue: which is related to the asynchronous completion of workflows, where the workflow goes to completed and files are still left to be injected into DBS/PhEDEx. Closing this issue as duplicate. Todor, from your report, files are fully available in PhEDEx and DBS. I think Sharad tagged us on a couple of JIRA tickets in the last few days, reporting what Jean-Roch reported privately to me on slack. When you have a chance, could you please have a look at one of those tickets and check the DBS3Upload component logs just to make sure DBS is working properly and injections are happening without much delay? Please report on the other thread and/or JIRA issues. |
@amaltaro I think this issue has value on its own, as we are not talking about the hard development of synchronizing completed and all file being injected (#8148) but really the fact that DBSuploader is lagging more than 2 days behind for injecting all files. I let you guys reopen it to address the actual problem here, impacting production over the last several weeks. 2 days is already a vey long grace period IMO, given that it's only about making POST to DBS and Phedex. |
the 2 days grace period is defined here https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/checkor.py#L1051 |
https://its.cern.ch/jira/browse/CMSCOMPPR-11418 > 1 week delay in injection if you want to look at something urgent @todor-ivanov |
Looking at this exact one now. |
In the DBSUploader component log I find warning as this one:
This block in particular was retried few lines bellow, but for many others I do not see a retry to inject them:
The aforementioned dataset still has some discrepancy between the blocks injected to PhEDEx and those to DBS, but they are for some other blocks not for this one:
The situation is similar for all the blocks I find with a proxy WARNING in the logs. So from the code I find here [1] the place where the proxy error is caught and turned into a WARNING and the relevant block is filled into a checklist. I am now searching to see where is the place where this checklist is referred and if those injections for the blocks in this list are supposed to be retried. |
@todor-ivanov
The block was already in DBS. What was the problem? Here is the DBS monitoring page and it is in a health status: Here is the DBS prod writer monitoring and it looked fine except for some high load this morning FNAL time. |
Hi @yuyiguo, thanks for taking a look. So I took a bad example before. Bu here is another block which is missing in dbs:
see the comparison between DBS and PhEDEx regarding this dataset:
Please, find a list of proxy errors from the agents logs attached bellow. Plenty of those are retried - I can see it now. But for some blocks, we are actually having quite significant delays until they make it to DBS. |
@todor-ivanov |
Thank you @yuyiguo for checking it. If the situation on the DBS side is calm, I am in a very hard situation, then. In case you are still willing to dig further in DBS, you may find all the information I can provide in the file attached to my previous message. There are the timestamps (please, notice that they are well spread among the last two months, only because that many logs We have). The error code is 502 again, the same as in the problem with the lost parentage dataset information - this usually comes from the Frontend indeed, but because it simply cannot forward the request to the backend (either due to timeouts from the backend or any other communication issue). Machine names are also there. Block ids that have been tried are also there - most of them succeeded after N retires, but some of them never make it to DBS. I am currently trying to understand that exact part - if the retires for some of the blocks are stopped for real at some point? But this part of the code has not been touched lately, so I am skeptical. |
@todor-ivanov |
@todor-ivanov |
@todor-ivanov |
@yuyiguo the times are all in CERN/Geneva time. |
Sorry @yuyiguo need to correct myself here:
The correct statement should be:
|
@todor-ivanov |
I only got eight days of log files avaliable to be. They are from Feb 14 to 21. Here is what I found in one of the log files. The message matched node vocms0250/188.184.86.128 and the time is closed to 2020-02-17 15:45. `INFO:cherrypy.access:[17/Feb/2020:12:26:11] vocms0163.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 2613442 in 255 out 133515382 us ] [auth: OK "/DC=c vocms0766/dbs/DBSGlobalWriter-20200216-vocms0766.log:INFO:cherrypy.access:[16/Feb/2020:17:28:03] vocms0766.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 2611039 in 255 out 360011 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ]
Above requests were all from CN/vocms028 and they were not big blocks. They all failed with "400 Bad Request". How the data was generated? How WMAgent call bulkblocks API? This is a very heavy API that insert everything regarding a block into DBS, this includes files, lumi sections, parentages and so on. In one second 17/Feb/2020 12:13:08 on only one of DBS servers, WMAgent send SIX blocks to insert into DBS. The agents may compete resource among them. I would suggest you stretch the uploading a bit. This may actually speed up because it finishes in one upload instead of retries.
Let em know if you need more info from me. So far, I saw DBS had been doing well. The errors were all "400 Bad Request". You may want to check the input data. |
Hi @yuyiguo Thanks a lot. What you provide here is quite exhaustive and helpful. Just to confirm, when you say:
Are you meaning that the situation is similar to the one with this heavy API here [1] too? This indeed, may explain why we have the injection retries, and the more we have the slower we get. Of course we may expect that we will reach some saturation point, but such a situation could explain what P&R sees as files being present in PhEDEx but delayed in DBS. [1] |
Impact of the bug
workflows are ending in completed with missing files/statistics. talked about this to @amaltaro who did not have time to put this on anyone's radar before leaving apparently
Describe the bug
workflow are getting in "completed" while files are still being injected
How to reproduce it
https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_EXO-RunIIFall17wmLHEGS-02448__v1_T_200211_194626_1071
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-RunIIFall17wmLHEGS-02448
and many others in "filemismatch" under https://cms-unified.web.cern.ch/cms-unified//assistance.html are likely affected
Expected behavior
#8148 for example
The text was updated successfully, but these errors were encountered: