Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Dataset Events not publishing when AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS enabled #1363

Open
1 task
nishant-gupta-sh opened this issue Dec 4, 2024 · 5 comments
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:datasets Related to the Airflow datasets feature/module bug Something isn't working stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed triage-needed Items need to be reviewed / assigned to milestone

Comments

@nishant-gupta-sh
Copy link

nishant-gupta-sh commented Dec 4, 2024

Astronomer Cosmos Version

12.1.1

dbt-core version

1.8.7

Versions of dbt adapters

No response

LoadMode

AUTOMATIC

ExecutionMode

AZURE_CONTAINER_INSTANCE

InvocationMode

None

airflow version

2.10.2

Operating System

Astronomer Deployed

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Astronomer

Deployment details

No response

What happened?

When setting the AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS to True, any task that we have that was previously producing dataset events and was not in a task group, stopped producing the events. I suspect there's something wrong with the following function:

def get_dataset_alias_name(dag: DAG | None, task_group: TaskGroup | None, task_id: str) -> str:

Relevant log output

How to reproduce

When AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS is set to true, and when using the DbtRunLocalOperator or DbtSnapshotOperator, dataset events were no longer being published. Dataset events are being published however when using the DbtTaskGroup operator.

All operators successfully published dataset events when AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS was set to False

Anything else :)?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Contact Details

[email protected]

@nishant-gupta-sh nishant-gupta-sh added bug Something isn't working triage-needed Items need to be reviewed / assigned to milestone labels Dec 4, 2024
Copy link

dosubot bot commented Dec 4, 2024

Hey, @nishant-gupta-sh! 👋 I'm here to help you with your bug. Let's squash it together!

The issue you're experiencing seems to be related to how dataset aliases are handled when AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS is set to True. Specifically, tasks not in a TaskGroup stop producing dataset events, while those in a TaskGroup continue to do so. This suggests a potential problem in the logic for managing dataset aliases for tasks outside of TaskGroups.

The dataset.py file determines whether a task is part of a TaskGroup by checking if the task_group parameter is not None. If task_group is provided, it uses task_group.dag_id and task_group.group_id to construct the dataset alias name. If task_group is None, it defaults to using the dag parameter to determine the DAG ID [1].

The AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS configuration controls the use of DatasetAlias in Cosmos. By default, it is set to True, enabling the use of DatasetAlias to support dynamic dataset creation during task execution. However, if you encounter issues like sqlalchemy.orm.exc.FlushError errors, you can set this configuration to False to avoid these errors [2].

Given this, the problem might be in how the dataset alias is constructed or retrieved for tasks not in a TaskGroup when the alias feature is enabled. You might want to review the logic in the dataset.py file related to alias handling for tasks outside of TaskGroups.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration area:datasets Related to the Airflow datasets feature/module labels Dec 4, 2024
@tatiana
Copy link
Collaborator

tatiana commented Dec 11, 2024

Hi @nishant-gupta-sh, which version of Cosmos are you using? You mentioned 12.1.1 in the ticket description, but the last release of Cosmos was 1.7.1.

Did you recently change from executing Cosmos using ExecutionMode.LOCAL to ExecutionMode.AZURE_CONTAINER_INSTANCE? This could justify the problem. Currently, Cosmos only supports emitting datasets when using ExecutionMode.LOCAL and ExecutionMode.VIRTUALENV:
https://github.com/astronomer/astronomer-cosmos/blob/main/cosmos/operators/local.py

I'll update our docs to make this more evident: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html.

@nishant-gupta-sh
Copy link
Author

nishant-gupta-sh commented Dec 11, 2024

Hi @nishant-gupta-sh, which version of Cosmos are you using? You mentioned 12.1.1 in the ticket description, but the last release of Cosmos was 1.7.1.

Did you recently change from executing Cosmos using ExecutionMode.LOCAL to ExecutionMode.AZURE_CONTAINER_INSTANCE? This could justify the problem. Currently, Cosmos only supports emitting datasets when using ExecutionMode.LOCAL and ExecutionMode.VIRTUALENV: https://github.com/astronomer/astronomer-cosmos/blob/main/cosmos/operators/local.py

I'll update our docs to make this more evident: https://astronomer.github.io/astronomer-cosmos/configuration/scheduling.html.

Hi Tatiana, apologies, we're using 1.7.1 for Cosmos and the CeleryExecutor for the ExecutionMode.
Astronomer manages our airflow deployment, and the only thing that is affecting whether or not the dataset events are getting emitted is when we modify the AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS parameter.

@tatiana
Copy link
Collaborator

tatiana commented Dec 17, 2024

Hi @nishant-gupta-sh thanks for your reply and clarifying Cosmos' version.

In the ticket description, you mentioned using Cosmos AZURE_CONTAINER_INSTANCE. Are you using CosmosExecutionMode.AZURE_CONTAINER_INSTANCE? The other part of your description mentions DbtRunLocalOperator or DbtSnapshotOperator - giving the impression the issue is happening using ExecutionMode.LOCAL.

Why do you believe the issue may be in get_dataset_alias_name? Were there any error messages in the scheduler?

This is how the function get_dataset_alias_name is being invoked by the local operators:

DatasetAlias(name=get_dataset_alias_name(dag_id, task_group_id, task_id))

Previously, we had validated emitting datasets with Dataset Alias using DbtDag and DbtTaskGroup. However, we did not test it with an individual operator. My understanding is that the issue happens when using AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS=True and a standalone operator, such as DbtRunLocalOperator.

Please, can you share a small example DAG illustrating the problem you're facing? Something alongside the examples we have that would allow us to reproduce the problem:
https://github.com/astronomer/astronomer-cosmos/tree/main/dev/dags

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:datasets Related to the Airflow datasets feature/module bug Something isn't working stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

No branches or pull requests

2 participants