Metrics endpoint timeout #15

rpeternella · 2020-01-29T16:17:31Z

I've deployed the new plugin, to replace airflow-exporter in our Airflow server, but for some reason I can't make it work. I've checked dependencies (airflow, prometheus_client) and everything is satisfied.
The only thing I'm able to see is that the gunicorn webserver processes timeout at some point:

[2020-01-29 16:02:19 +0000] [32145] [INFO] Handling signal: ttou
[2020-01-29 16:02:19 +0000] [805] [INFO] Worker exiting (pid: 805)
[2020-01-29 16:02:49 +0000] [32145] [INFO] Handling signal: ttin
[2020-01-29 16:02:49 +0000] [5615] [INFO] Booting worker with pid: 5615

Also, trying to curl it from a client times out, with no additional information. Is this a know bug and should be a workaround/fix for it? Thanks!

The text was updated successfully, but these errors were encountered:

abhishekray07 · 2020-01-31T04:52:21Z

Could you give a bit more information about the version you are running?

rpeternella · 2020-02-05T15:55:13Z

Hi Abhishek,

I'm using the following versions:
airflow-prometheus-exporter==1.0.7
apache-airflow==1.10.5
prometheus-client==0.7.1

Also, canary_dag is created and running properly.

abhishekray07 · 2020-02-18T02:54:33Z

I have not been able to reproduce this. Are there any logs or stacktraces when you try to access the metrics endpoint?

rpeternella · 2020-02-20T08:23:15Z

Hi Abhishek,

After some more digging, I was able to find the issue. In turns out the timeout is due to the webserver waiting too long for a response from the database.

The code fails in this part:

def get_task_state_info(): """Number of task instances with particular state.""" with session_scope(Session) as session: task_status_query = ( session.query( TaskInstance.dag_id, TaskInstance.task_id, TaskInstance.state, func.count(TaskInstance.dag_id).label("value"), ) .filter(text("execution_date > NOW() - interval \'14 days\'"),) .group_by( TaskInstance.dag_id, TaskInstance.task_id, TaskInstance.state, ) .subquery() )

Basically due to the lack of filtering on the airflow task_instance table, it was pulling too much data and could not handle it (we have this airflow instance for 4+ years running, 100s of DAGs). I've manually changed the code on my side to pull only the latest 14 days of data:

.filter(text("execution_date > NOW() - interval '14 days'"),)

It's not a good solution at all, but maybe it would make sense to include a parameter that makes it flexible.

For now I'll just deploy the changed version on my side since it fixed our issues

deshraj · 2020-03-26T23:00:14Z

Running into the same issue as mentioned above. Would be great if this can be resolved soon. Thanks!

popovpa · 2020-11-12T16:11:39Z

the same shit:

[2020-11-12 16:09:58,644] {{security.py:328}} INFO - Cleaning faulty perms
[2020-11-12 16:09:59 +0000] [27] [INFO] Handling signal: ttou
[2020-11-12 16:09:59 +0000] [1550] [INFO] Worker exiting (pid: 1550)
[2020-11-12 16:10:29 +0000] [27] [INFO] Handling signal: ttin
[2020-11-12 16:10:29 +0000] [2836] [INFO] Booting worker with pid: 2836
[2020-11-12 16:10:30,923] {{manager.py:545}} WARNING - Refused to delete permission view, assoc with role exists RoleModelView.Copy Role Admin
[2020-11-12 16:10:31,922] {{init.py:51}} INFO - Using executor CeleryExecutor
[2020-11-12 16:10:32,177] {{manager.py:545}} WARNING - Refused to delete permission view, assoc with role exists Airflow.can_refresh_all Admin
[2020-11-12 16:10:33,649] {{security.py:475}} INFO - Start syncing user roles.
[2020-11-12 16:10:37,856] {{security.py:385}} INFO - Fetching a set of all permission, view_menu from FAB meta-table
[2020-11-12 16:10:42,497] {{security.py:328}} INFO - Cleaning faulty perms
[2020-11-12 16:10:42 +0000] [27] [INFO] Handling signal: ttou
[2020-11-12 16:10:42 +0000] [1551] [INFO] Worker exiting (pid: 1551)

rpeternella · 2020-11-12T16:26:50Z

To help the ones still suffering from that - we deployed a much more stable solution using https://github.com/wrouesnel/postgres_exporter in our airflow instance. As we run it using Postgres anyway, we could get all metrics via SQL in a config.yml style.

Scraping takes max 30s for everything, so it has been much more stable in our case.

jonathonbattista · 2021-09-09T20:24:44Z

Any updates here? The exporter is basically worthless once the SQL DB reaches a certain size. Queries take too long and the container uses too much memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics endpoint timeout #15

Metrics endpoint timeout #15

rpeternella commented Jan 29, 2020

abhishekray07 commented Jan 31, 2020

rpeternella commented Feb 5, 2020

abhishekray07 commented Feb 18, 2020

rpeternella commented Feb 20, 2020

deshraj commented Mar 26, 2020

popovpa commented Nov 12, 2020

rpeternella commented Nov 12, 2020

jonathonbattista commented Sep 9, 2021

Metrics endpoint timeout #15

Metrics endpoint timeout #15

Comments

rpeternella commented Jan 29, 2020

abhishekray07 commented Jan 31, 2020

rpeternella commented Feb 5, 2020

abhishekray07 commented Feb 18, 2020

rpeternella commented Feb 20, 2020

deshraj commented Mar 26, 2020

popovpa commented Nov 12, 2020

rpeternella commented Nov 12, 2020

jonathonbattista commented Sep 9, 2021