Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(typing): Resolve all mypy & pyright errors for _arrow #2007

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Feb 13, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@dangotbanned
Copy link
Member Author

@MarcoGorelli Almost out of the rabbit hole on this!

I've found some more places where (#1657 (comment)) would be pretty helpful:

def _from_native_series(
self: ArrowSeries[Any],
series: pa.Array[_ScalarT_co]
| pa.ChunkedArray[Any]
| pa.ChunkedArray[_ScalarT_co]
| pa.Array[Any],
) -> ArrowSeries[_ScalarT_co]:
return ArrowSeries(
chunked_array(series),
name=self._name,
backend_version=self._backend_version,
version=self._version,
)
@classmethod
def _from_iterable(
cls: type[Self],
data: Iterable[_ScalarT_co],
name: str,
*,
backend_version: tuple[int, ...],
version: Version,
) -> ArrowSeries[_ScalarT_co]:
return cls(
chunked_array([data]),
name=name,
backend_version=backend_version,
version=version,
)
def __narwhals_namespace__(self: Self) -> ArrowNamespace:
from narwhals._arrow.namespace import ArrowNamespace
return ArrowNamespace(
backend_version=self._backend_version, version=self._version

def cast(self: Self, dtype: DType) -> ArrowSeries[Any]:
ser = self._native_series
data_type = narwhals_to_native_dtype(dtype, self._version)
return self._from_native_series(pc.cast(ser, data_type))

Essentially anywhere that narwhals object would "change" its TypeVar, the current self.__class__(...) route breaks the typing.

So for ArrowSeries[T1], you can't go to ArrowSeries[T2] without a @classmethod that removes T1 from scope.


My brain has fully melted working on this, hope the above made sense 🫠
If not (https://typing.readthedocs.io/en/latest/spec/generics.html#scoping-rules-for-type-variables)

@MarcoGorelli
Copy link
Member

thanks for working on this

is it necessary to make ArrowSeries generic in the PyArrow type? would it work to just keep that out for now?

@dangotbanned
Copy link
Member Author

dangotbanned commented Feb 13, 2025

thanks for working on this

is it necessary to make ArrowSeries generic in the PyArrow type? would it work to just keep that out for now?

@MarcoGorelli 100% needed to resolve the issues I'm afraid πŸ˜”

Without the TypeVar, most of the @overloads end up matching Expression.
This was the main source of errors, since we don't appear to use Expression anywhere, it just introduces a huge amount of noise


Having read through a lot of the code (but not using pyarrow much personally) I am curious as to why we've not used Expression much/at all?

It seems to be available in our min version (https://arrow.apache.org/docs/11.0/python/generated/pyarrow.dataset.Expression.html)


For some context of the kinds of errors, (#1961 (comment))

Image

@dangotbanned
Copy link
Member Author

dangotbanned commented Feb 13, 2025

Started at like 150-300 errors

Down to 32 errors 😁!!!!!!!!!!!
https://github.com/narwhals-dev/narwhals/actions/runs/13313871223/job/37182957093?pr=2007

Update 1

Now 23 errors (https://github.com/narwhals-dev/narwhals/actions/runs/13315369207/job/37187931923?pr=2007)

Update 2

21 errors (https://github.com/narwhals-dev/narwhals/actions/runs/13315749955/job/37189182650?pr=2007)

Update 3

Now with 17 errors (https://github.com/narwhals-dev/narwhals/actions/runs/13317234868/job/37194138357?pr=2007)
And a runtime issue for 3.9 😒

Update 4

Down to 10 errors (https://github.com/narwhals-dev/narwhals/actions/runs/13328559038/job/37227189197?pr=2007)

All remaining errors are in tests/**, finished fixing everything for narwhals/**

Update 5

Only 1 error left (https://github.com/narwhals-dev/narwhals/actions/runs/13328999125/job/37228573119?pr=2007)

Update 6

mypy is satisfied πŸ˜… https://github.com/narwhals-dev/narwhals/actions/runs/13329410417/job/37229803694?pr=2007

Comment on lines +37 to +48
Incomplete: TypeAlias = Any # pragma: no cover
"""
Marker for working code that fails on the stubs.

Common issues:
- Annotated for `Array`, but not `ChunkedArray`
- Relies on typing information that the stubs don't provide statically
- Missing attributes
- Incorrect return types
- Inconsistent use of generic/concrete types
- `_clone_signature` used on signatures that are not identical
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been sprinkling these in with a comment when all else fails, e.g:

def diff(self: ArrowSeries[_NumericOrTemporalT]) -> ArrowSeries[_NumericOrTemporalT]:
# NOTE: stub only permits `ChunkedArray[TemporalScalar]`
# (https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/compute.pyi#L145-L148)
diff: Incomplete = pc.pairwise_diff
return self._from_native_series(diff(self._native_series.combine_chunks()))

If the stub issues get resolved in the future, this will be a lot easier to fix than just using Any directly

`pyright` doesn't need this, `mypy` infers this as `str` - which is too wide

> narwhals/_arrow/namespace.py:372: error: No overload variant of "binary_join_element_wise" matches argument types "Generator[ChunkedArray[StringScalar], None, None]", "str"  [call-overload]> narwhals/_arrow/namespace.py:372: note: Possible overload variants:
Comment on lines +1156 to +1157
# NOTE: stubs leave unannotated
if_else: Incomplete = pc.if_else
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1183 to +1186
# empty bin intervals should have a 0 count
counts_coalesce = cast(
"pa.Array[Any]",
pc.coalesce(cast("pa.Array[Any]", counts.column("counts")), lit(0)),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _hist_from_bin_count(
bin_count: int,
) -> tuple[Sequence[int], Sequence[int | float], Sequence[int | float]]:
def _hist_from_bin_count(bin_count: int): # type: ignore[no-untyped-def] # noqa: ANN202
Copy link
Member Author

@dangotbanned dangotbanned Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole of ArrowSeries.hist is too complex to bother with typing nested inner-method return types.
Falling back to inference is fine.

Kinda concerned though that bin_left is returned and never used?

if bins is not None:
if len(bins) < 2:
counts, bin_left, bin_right = [], [], []
else:
counts, bin_left, bin_right = _hist_from_bins(bins)
elif bin_count is not None:
if bin_count == 0:
counts, bin_left, bin_right = [], [], []
else:
counts, bin_left, bin_right = _hist_from_bin_count(bin_count)

@dangotbanned dangotbanned changed the title fix(DRAFT): Resolve all mypy & pyright errors for _arrow fix(typing): Resolve all mypy & pyright errors for _arrow Feb 14, 2025
@dangotbanned dangotbanned marked this pull request as ready for review February 14, 2025 20:13
@dangotbanned
Copy link
Member Author

is it necessary to make ArrowSeries generic in the PyArrow type? would it work to just keep that out for now?

@MarcoGorelli had a bit of an interesting development with this.
So, regardless of how accurate pyarrow-stubs is - by annotating the ArrowSeries methods we can at least chain those together with accurate static feedback.

I haven't gone too deep into that, but have made a start with:

Comment on lines 95 to 97
if TYPE_CHECKING:
return pa.repeat(None, n).cast(series._type)
return pa.nulls(n, series._type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment, can we type: ignore and report upstream if there's a stubs error?

Copy link
Member Author

@dangotbanned dangotbanned Feb 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli would you be okay with (adef6a0)?

I prefer that, since we can easily find all refs like:
image

I would've added as a suggestion, but the import was outside the range

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more context in (#2007 (comment))

Part of #2007 (comment)

I'm expecting this to report in CI if not available in some version

The previous fix was to resolve `pd.Series` not annotated as accepting `pa.ChunkedArray`
Comment on lines +275 to +279
@property
def _type(self: ArrowSeries[pa.Scalar[DataTypeT_co]]) -> DataTypeT_co:
if TYPE_CHECKING:
return self._native_series[0].type
return self._native_series.type
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a way to get this working without the TYPE_CHECKING block.

I'm using this in a few places to resolve ChunkedArray erasing its type property:

https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/__lib_pxi/table.pyi#L59-L63

def type(self) -> DataType: ...

The type is preserved for Array, so here I'm stealing the generic from that type property:

https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/__lib_pxi/array.pyi#L1058-L1070

def type(self: Array[Scalar[_DataType_CoT]]) -> _DataType_CoT: ...

Comment on lines 95 to +97
def maybe_extract_py_scalar(value: Any, return_py_scalar: bool) -> Any: # noqa: FBT001
if TYPE_CHECKING:
return value.as_py()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some similarities between this one and (#2007 (comment))

Part of this is recreating a subset of the @overload(s) in
https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/__lib_pxi/scalar.pyi#L62-L105

But I'm also needing to lie, since .as_py() isn't available in all versions.

For a lot of the cases where maybe_extract_py_scalar is used - this avoids needing to do a [no-any-return] - since we have pa.Scalar[_BasicDataType[_AsPyType]] provided by #2007 (comment)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first, thanks a tonne for your efforts, really appreciate it

second, I think this might be doing too much - mainly i'm concerned about there being both logic changes (e.g. to_pandas) and typing changes. can we keep them to separate PRs? i'm concerned about missing things with too large PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: get mypy passing with pyarrow-stubs installed
2 participants