All notable changes to this project will be documented in this file.
This project adheres to Semantic Versioning.
- pathway.xpacks.llm.splitter.TokenCountSplitter.
- Introducing new methods for strict conversion of
pw.Json
to desired types within a UDF body:as_int()
as_float()
as_str()
as_bool()
as_list()
as_dict()
- Added
table.col.dt.utc_from_timestamp
method: CreatesDateTimeUtc
from timestamps represented asint
s orfloat
s. - Enhanced the
table.col.dt.timestamp
method with a newunit
argument to specify the unit of the returned timestamp.
- Introduced an experimental xpack with a Microsoft SharePoint input connector.
- Index operator (
[]
) can now be directly applied topw.Json
within UDFs to access elements of JSON objects, arrays, and strings.
- Enhanced the
table.col.dt.from_timestamp
method to createDateTimeNaive
from timestamps represented asint
s orfloat
s. - Deprecated not specifying the
unit
argument of thetable.col.dt.timestamp
method.
KNNIndex
now supports returning computed distances.- Added support for cosine similarity in
KNNIndex
.
- The
offset
argument ofpw.stdlib.temporal.sliding
andpw.stdlib.temporal.tumbling
is deprecated. Useorigin
instead, as it represents a point in time, not a duration.
- Sliding window now works correctly with UTC Datetimes.
- Temporal column in
asof_join
no longer has to be namedt
. asof_join
includes rows with equal times for all values of thedirection
parameter.
- Fixed an issue with
pw.io.gdrive.read
: Shared folders support is now working seamlessly.
- Added Table.split() method for splitting table based on an expression into two tables.
- Columns with datatype duration can now be multiplied and divided by floats.
- Columns with datatype duration now support both true and floor division (
/
and//
) by integers.
- Pathway is better at typing if_else expressions when optional types are involved.
table.flatten()
operator now supports Json array.- Buffers (used to delay outputs, configured via delay in
common_behavior
) now flush the data when the computation is finished. The effect of this change can be seen when run in bounded (batch / multi-revision) mode. pw.io.subscribe()
takes additional argumenton_time_end
- the callback function to be called on each closed time of computation.pw.io.subscribe()
is now a single-worker operator, guaranteeing thaton_end
is triggered at most once.KNNIndex
supports now metadata filtering. Each query can specify it's own filter in the JMESPath format.
- Resolved an optimization bug causing
pw.iterate
to malfunction when handling columns effectively pointing to the same data.
- Pathway now keeps track of
array
columntype better - it is able to keep track of Array dtype and number of dimensions, wherever applicable.
- Fixed issues with standalone panel+Bokeh dashboards to ensure optimal functionality and performance.
- A method
weekday
has been added to thedt
namespace, that can be called on column expressions containing datetime data. This method returns an integer that represents the day of the week. - EXPERIMENTAL: Methods
show
andplot
on Tables, providing visualizations of data using HoloViz Panel. - Added support for
instance
parameter togroupby
,join
,windowby
and temporal join methods. pw.PersistenceMode.UDF_CACHING
persistence mode enabling automatic caching ofAsyncTransformer
invocations.
- Methods
round
andfloor
on columns with datetimes now accept duration argument to be a string. pw.debug.compute_and_print
andpw.debug.compute_and_print_update_stream
have a new argumentn_rows
that limits the number of rows printed.pw.debug.table_to_pandas
has a new argumentinclude_id
(by defaultTrue
). If set toFalse
, creates a new index for the Pandas DataFrame, rather than using the keys of the Pathway Table.windowby
functionshard
argument is now deprecated andinstance
should be used.- Special column name
_pw_shard
is now deprecated, and_pw_instance
should be used. pw.ReplayMode
now can be accessed aspw.PersistenceMode
, while theSPEEDRUN
andREALTIME
variants are now accessible asSPEEDRUN_REPLAY
andREALTIME_REPLAY
.- EXPERIMENTAL:
pw.io.gdrive.read
has a new argumentwith_metadata
(by defaultFalse
). If set toTrue
, adds a_metadata
column containing file metadata to the resulting table. - Methods
get_nearest_items
andget_nearest_items_asof_now
ofKNNIndex
allow to specifyk
(number of returned elements) separately in each query.
- Added ability of creating custom reducers using
pw.reducers.udf_reducer
decorator. Usepw.BaseCustomAccumulator
as a base class for creating accumulators. Decorating accumulator returns reducer following custom logic. - A function
pw.debug.compute_and_print_update_stream
that computes and prints the update stream of the table. - SQLite input connector (
pw.io.sqlite
).
pw.debug.parse_to_table
is now deprecated,pw.debug.table_from_markdown
should be used instead.pw.schema_from_csv
now hasquote
anddouble_quote_escapes
arguments.
- Schema returned from
pw.schema_from_csv
will have quotes removed from column names, so it will now work properly withpw.io.csv.read
.
- Experimental Google Drive input connector.
- Stateful deduplication function (
pw.stateful.deduplicate
) allowing alerting on significant changes. - The ability to split data into batches in
pw.debug.table_from_markdown
andpw.debug.table_from_pandas
.
- class
Behavior
, a superclass of all behavior classes. - class
ExactlyOnceBehavior
indicating we want to create aCommonBehavior
that results in each window producing exactly one output (shifted in time by an optionalshift
parameter). - function
exactly_once_behavior
creating an instance ofExactlyOnceBehavior
.
- BREAKING:
WindowBehavior
is now calledCommonBehavior
, as it can be also used with interval joins. - BREAKING:
window_behavior
is now calledcommon_behavior
, as it can be also used with interval joins. - Deprecating parameter
keep_queries
inpw.io.http.rest_connector
. Nowdelete_completed_queries
with an opposite meaning should be used instead. The default is stilldelete_completed_queries=True
(equivalent tokeep_queries=False
) but it will soon be required to be set explicitly.
- A flag
with_metadata
for the filesystem-based connectors to attach the source file metadata to the table entries. - Methods
pw.debug.table_from_list_of_batches
andpw.debug.table_from_list_of_batches_by_workers
for creating tables with defined data being inserted over time.
- BREAKING:
pw.debug.table_from_pandas
andpw.debug.table_from_markdown
now will create tables in the streaming mode, instead of static, if given table definition contains_time
column. - BREAKING: Renamed the parameter
keep_queries
inpw.io.http.rest_connector
todelete_queries
with the opposite meaning. It changes the default behavior - it waskeep_queries=False
, now it isdelete_queries=False
.
- A method
get_nearest_items_asof_now
inKNNIndex
that allows to get nearest neighbors without updating old queries in the future. - A method
asof_now_join
inTable
to join rows from left side of the join with right side of the join at their processing time. Past rows from left side are not used when new data appears on the right side.
interval_join
now supports forgetting old entries. The configuration can be passed usingbehavior
parameter ofinterval_join
method.- Decorator
@table_transformer
for marking that functions take Tables as arguments. - Namespace for all columns
Table.C.*
. - Output connectors now provide logs about the number of entries written and time taken.
- Filesystem connectors now support reading whole files as rows.
- Command line option for
pathway spawn
to record data andpathway replay
command to replay data.
select
operates only on consistent states.
Schema
methodtypehints
that returns dict of mypy-compatible typehints.- Support for JSON parsing from CSV sources.
restrict
method inTable
to restrict table universe to the universe of the other table.- Better support for postgresql types in the output connector.
- BREAKING: renamed
Table
methoddtypes
totypehints
. It now returns adict
of mypy-compatible typehints. - BREAKING:
Schema.__getitem__
returns a data classColumnSchema
containing all related information on particular column. - BREAKING:
tuple
reducer used after intervals_over window now sorts values by time. - BREAKING: expressions used in
select
,filter
,flatten
,with_columns
,with_id
,with_id_from
have to have the same universe as the table. Earlier it was possible to use an expression from a superset of a table universe. To use expressions from wider universes, one can userestrict
on the expression source table. - BREAKING:
pw.universes.promise_are_equal(t1, t2)
no longer allows to use references fromt1
andt2
in a single expression. To change the universe of a table, usewith_universe_of
. - BREAKING:
ix
andix_ref
are temporarily broken inside joins (both temporal and ordinary). select
,filter
,concat
keep columns as a single stream. The work for other operators is ongoing.
- Optional types other than string correctly output to PostgreSQL.
- Support for messages compressed with zstd in the Kafka connector.
- Support for JSON data format, including
pw.Json
type. - Methods
as_int()
,as_float()
,as_str()
,as_bool()
to convert values fromJson
. - New argument
skip_nones
fortuple
andsorted_tuple
reducers. - New argument
is_outer
forintervals_over
window. pw.schema_from_dict
andpw.schema_from_csv
for generating schema based, respectively, on provided definition as a dictionary and CSV file with sample data.generate_class
method inSchema
class for generating schema class code.
- Method
get()
and[]
to support accessing elements in Jsons. - Function
pw.assert_table_has_schema
for writing asserts checking, whether given table has the same schema as the one that is given as an argument. - BREAKING:
ix
andix_ref
operations are now standalone transformations ofpw.Table
intopw.Table
. Most of the usages remain the same, but sometimes user needs to provide a context (when e.g. using them insidejoin
orgroupby
operations).ix
andix_ref
are temporarily broken inside temporal joins.
- Fixed a bug where new-style optional types (e.g.
int | None
) were translated toAny
dtype.
- Incompatible
beartype
version is now excluded from dependencies.
- Module
pathway.dt
to construct and manipulate DTypes. - New argument
keep_queries
inpw.io.http.rest_connector
.
- Internal representation of DTypes. Inputting types is compatible backwards.
- Temporal functions now accept arguments of mixed types (ints and floats). For example,
pw.temporal.interval
can use ints while columns it interacts with are floats. - Single-element arrays are now treated as arrays, not as scalars.
to_string()
method on datetimes always prints 9 fractional digits.%f
format code instrptime()
parses fractional part of a second correctly regardless of the number of digits.
Table.cast_to_types()
function that can performpathway.cast
on multiple columns.intervals_over
window, which allows to get temporally close data to given times.demo.replay_csv_with_time
function that can replay a CSV file following the timestamps of a given column.
- Static data is now copied to ensure immutability.
- Improved error tracing mechanism to work with any type of error.
tuple
reducer, that returns a tuple with values.ndarray
reducer, that returns an array with values.
numpy
arrays ofint32
,uint32
andfloat32
are now converted to their 64-bit variants instead of tuples.- KNNIndex interface to take columns as inputs.
- Reducers now check types of their arguments.
- Fixed delayed reporting of output connector errors.
- Python objects are now freed more often, reducing peak memory usage.
@
(matrix multiplication) operator.
- Python version 3.10 or later is now required.
- Type checking is now more strict.
- Immediately forget queries in REST connector.
- Make type annotations mandatory in
Schema
.
- Fixed IDs coming from CSV source.
- Fixed indices of dataframes from pandas transformer.