Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics endpoint timeout #15

Open
rpeternella opened this issue Jan 29, 2020 · 8 comments
Open

Metrics endpoint timeout #15

rpeternella opened this issue Jan 29, 2020 · 8 comments

Comments

@rpeternella
Copy link

I've deployed the new plugin, to replace airflow-exporter in our Airflow server, but for some reason I can't make it work. I've checked dependencies (airflow, prometheus_client) and everything is satisfied.
The only thing I'm able to see is that the gunicorn webserver processes timeout at some point:

[2020-01-29 16:02:19 +0000] [32145] [INFO] Handling signal: ttou
[2020-01-29 16:02:19 +0000] [805] [INFO] Worker exiting (pid: 805)
[2020-01-29 16:02:49 +0000] [32145] [INFO] Handling signal: ttin
[2020-01-29 16:02:49 +0000] [5615] [INFO] Booting worker with pid: 5615

Also, trying to curl it from a client times out, with no additional information. Is this a know bug and should be a workaround/fix for it? Thanks!

@abhishekray07
Copy link

Could you give a bit more information about the version you are running?

@rpeternella
Copy link
Author

Hi Abhishek,

I'm using the following versions:
airflow-prometheus-exporter==1.0.7
apache-airflow==1.10.5
prometheus-client==0.7.1

Also, canary_dag is created and running properly.

@abhishekray07
Copy link

I have not been able to reproduce this. Are there any logs or stacktraces when you try to access the metrics endpoint?

@rpeternella
Copy link
Author

Hi Abhishek,

After some more digging, I was able to find the issue. In turns out the timeout is due to the webserver waiting too long for a response from the database.

The code fails in this part:

def get_task_state_info(): """Number of task instances with particular state.""" with session_scope(Session) as session: task_status_query = ( session.query( TaskInstance.dag_id, TaskInstance.task_id, TaskInstance.state, func.count(TaskInstance.dag_id).label("value"), ) .filter(text("execution_date > NOW() - interval \'14 days\'"),) .group_by( TaskInstance.dag_id, TaskInstance.task_id, TaskInstance.state, ) .subquery() )

Basically due to the lack of filtering on the airflow task_instance table, it was pulling too much data and could not handle it (we have this airflow instance for 4+ years running, 100s of DAGs). I've manually changed the code on my side to pull only the latest 14 days of data:

.filter(text("execution_date > NOW() - interval '14 days'"),)

It's not a good solution at all, but maybe it would make sense to include a parameter that makes it flexible.

For now I'll just deploy the changed version on my side since it fixed our issues

@deshraj
Copy link

deshraj commented Mar 26, 2020

Running into the same issue as mentioned above. Would be great if this can be resolved soon. Thanks!

@popovpa
Copy link

popovpa commented Nov 12, 2020

the same shit:

[2020-11-12 16:09:58,644] {{security.py:328}} INFO - Cleaning faulty perms
[2020-11-12 16:09:59 +0000] [27] [INFO] Handling signal: ttou
[2020-11-12 16:09:59 +0000] [1550] [INFO] Worker exiting (pid: 1550)
[2020-11-12 16:10:29 +0000] [27] [INFO] Handling signal: ttin
[2020-11-12 16:10:29 +0000] [2836] [INFO] Booting worker with pid: 2836
[2020-11-12 16:10:30,923] {{manager.py:545}} WARNING - Refused to delete permission view, assoc with role exists RoleModelView.Copy Role Admin
[2020-11-12 16:10:31,922] {{init.py:51}} INFO - Using executor CeleryExecutor
[2020-11-12 16:10:32,177] {{manager.py:545}} WARNING - Refused to delete permission view, assoc with role exists Airflow.can_refresh_all Admin
[2020-11-12 16:10:33,649] {{security.py:475}} INFO - Start syncing user roles.
[2020-11-12 16:10:37,856] {{security.py:385}} INFO - Fetching a set of all permission, view_menu from FAB meta-table
[2020-11-12 16:10:42,497] {{security.py:328}} INFO - Cleaning faulty perms
[2020-11-12 16:10:42 +0000] [27] [INFO] Handling signal: ttou
[2020-11-12 16:10:42 +0000] [1551] [INFO] Worker exiting (pid: 1551)

@rpeternella
Copy link
Author

To help the ones still suffering from that - we deployed a much more stable solution using https://github.com/wrouesnel/postgres_exporter in our airflow instance. As we run it using Postgres anyway, we could get all metrics via SQL in a config.yml style.

Scraping takes max 30s for everything, so it has been much more stable in our case.

@jonathonbattista
Copy link

Any updates here? The exporter is basically worthless once the SQL DB reaches a certain size. Queries take too long and the container uses too much memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants