-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate and fix source of errors in WMArchive #359
Comments
@vkuznet |
Yuyi, you can find relevant information over here: https://monit-grafana.cern.ch/d/u_qOeVqZk/wmarchive-monit?orgId=11 and https://monit-grafana.cern.ch/d/wma-service/wmarchive-service?orgId=11 The first one contains the latency plot. |
Valentin, Which plots show the WMArchive to AMQ connection duration or disconnection rate? |
Yuyi, I pointed out to existing dashboard, but it does not have duration of AMQ connection, someone should add this to the code. Said that, it is trivial to see from wmarhive logs (vocms750:/cephfs/product/wma-logs/):
So, connection did not last more than a minute since logs shows every time WMArchive sends the data and timestamp in logs shows that usually we have few log entries within a minute. |
FWIW, I can confirm that the problem is still present. I still see an abnormal number of warnings coming from the |
I updated WMArchive configuration to use 1min threshold for recv/send timeouts on production and testbed clusters (FYI: @arooshap , @muhammadimranfarooqi) . Apart from that as I explained earlier is no longer allocated to development on services outside of WM area and further development efforts should be addressed via @klannon |
Our CMSWEB operator has reported on mattermost channel the increased number of errors with WMArchive data flow. They may be related to reports from CERN IT (Lionel) who requested to change heartbeat configuration. I post below Lionel's suggestion and my response:
# from Lionel
# VK response
# Lionel feedback
So, we should check the optimal hear-beat rate assignment based on provided STOMP documentation.
The text was updated successfully, but these errors were encountered: