You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using airflow dates for incremental, the resulting SQL query is using UTC dates. At least in Snowflake, when comparing a timestamp without timezone to a timestamp with timezone, the zone is completely ignored. In the UTC zone, you get at most 1x your DAG frequency of lag which is normal.
In our case however (CET), we get a lag equal to our DAG frequency + whatever offset we're at depending on the time of year. Ideally we would use zone aware cursor columns but unfortunately many vendors do not use zoned timestamps at all. I guess it would be even worse for users on the other side of UTC as the UTC ranges would effectively be in the future and return nothing at all.
We tried many things, adding timezone information in all airflow dates, and in the pyarrow backend kwargs but it didn't help. We then modified the _join_external_scheduler function to convert the airflow dates to our timezone, but then the rows would get filtered out as the cursor column is considered UTC in the pyarrow backend and would fall out of the zoned range.
The only thing that worked is to keep the dates in the UTC zone but apply our current UTC offset to them. There is probably a much better way to do this, but here is a crude fix:
def add_zone_offset(utc_date: pendulum.DateTime, tz: str) -> pendulum.DateTime:
date = utc_date.in_tz(tz)
return utc_date + date.utcoffset()
class NaiveIncremental(Incremental):
def _join_external_scheduler(self) -> None:
def _ensure_airflow_end_date(
start_date: pendulum.DateTime, end_date: pendulum.DateTime
) -> Optional[pendulum.DateTime]:
now = add_zone_offset(pendulum.now(), TZ)
if end_date is None or end_date > now or start_date == end_date:
return now
return end_date
context = get_current_context()
start_date = add_zone_offset(context["data_interval_start"], TZ)
end_date = _ensure_airflow_end_date(start_date, add_zone_offset(context["data_interval_end"], TZ))
self.initial_value = start_date
if end_date is not None:
self.end_value = end_date
else:
self.end_value = None
logger.info(
f"Found Airflow scheduler: initial value: {self.initial_value} from"
f" data_interval_start {context['data_interval_start']}, end value:"
f" {self.end_value} from data_interval_end {context['data_interval_end']}"
)
Expected behavior
dlt should have a way to specify a zone for naive cursor columns.
Steps to reproduce
Use airflow ranges on a naive cursor column that is not UTC.
Operating system
Linux
Runtime environment
Google Cloud Composer
Python version
3.11
dlt data source
SQLAlchemy (Snowflake)
dlt destination
Snowflake
Other deployment details
No response
Additional information
No response
The text was updated successfully, but these errors were encountered:
maybe you are better off to switch default Airflow timezone to "your" timezone, since you already have naive datetimes...
I'm just making sure we should fix dlt or maybe just improve our docs
@rudolfix Correct, situation is as you described. Our DAGs already use dates with our timezone, and the Airflow default timezone is also set accordingly, however the airflow context is in UTC regardless, and the loaded arrow tables are as well (even when passing tz in the backend kwargs).
Initially I tried just converting the UTC airflow ranges to our timezone, it worked as intended on the database side but then the loaded rows would be instantly discarded by the dlt start_out_of_range and end_out_of_range logic as the naive dates are considered UTC and thus would fall out of range. The current workaround was the next best thing.
Maybe some kind of timezone compensation can be integrated in the incremental logic, converting zones would be the most intuitive but I don't think it would behave the same across all backends (depending on how they treat naive dates) so using offset like I currently do may be the most universal fix.
dlt version
1.5.0
Describe the problem
When using airflow dates for incremental, the resulting SQL query is using UTC dates. At least in Snowflake, when comparing a timestamp without timezone to a timestamp with timezone, the zone is completely ignored. In the UTC zone, you get at most 1x your DAG frequency of lag which is normal.
In our case however (CET), we get a lag equal to our DAG frequency + whatever offset we're at depending on the time of year. Ideally we would use zone aware cursor columns but unfortunately many vendors do not use zoned timestamps at all. I guess it would be even worse for users on the other side of UTC as the UTC ranges would effectively be in the future and return nothing at all.
We tried many things, adding timezone information in all airflow dates, and in the pyarrow backend kwargs but it didn't help. We then modified the
_join_external_scheduler
function to convert the airflow dates to our timezone, but then the rows would get filtered out as the cursor column is considered UTC in the pyarrow backend and would fall out of the zoned range.The only thing that worked is to keep the dates in the UTC zone but apply our current UTC offset to them. There is probably a much better way to do this, but here is a crude fix:
Expected behavior
dlt should have a way to specify a zone for naive cursor columns.
Steps to reproduce
Use airflow ranges on a naive cursor column that is not UTC.
Operating system
Linux
Runtime environment
Google Cloud Composer
Python version
3.11
dlt data source
SQLAlchemy (Snowflake)
dlt destination
Snowflake
Other deployment details
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: