fix(typing): Resolve all `mypy` & `pyright` errors for `_arrow` #2007

dangotbanned · 2025-02-13T16:30:56Z

What type of PR is this? (check all applicable)

Related issues

Closes CI: get mypy passing with pyarrow-stubs installed #1961

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

~~Planning to finish getting the mypy errors to 0~~, then ~~tidying things up~~

Will close #1961 Still have ~50 errors for `mypy` to review

dangotbanned · 2025-02-13T16:45:24Z

@MarcoGorelli Almost out of the rabbit hole on this!

I've found some more places where (#1657 (comment)) would be pretty helpful:

narwhals/narwhals/_arrow/series.py

Lines 126 to 160 in 1d24dec

    
           def _from_native_series( 
        
               self: ArrowSeries[Any], 
        
               series: pa.Array[_ScalarT_co] 
        
               | pa.ChunkedArray[Any] 
        
               | pa.ChunkedArray[_ScalarT_co] 
        
               | pa.Array[Any], 
        
           ) -> ArrowSeries[_ScalarT_co]: 
        
               return ArrowSeries( 
        
                   chunked_array(series), 
        
                   name=self._name, 
        
                   backend_version=self._backend_version, 
        
                   version=self._version, 
        
               ) 
        
           @classmethod 
        
           def _from_iterable( 
        
               cls: type[Self], 
        
               data: Iterable[_ScalarT_co], 
        
               name: str, 
        
               *, 
        
               backend_version: tuple[int, ...], 
        
               version: Version, 
        
           ) -> ArrowSeries[_ScalarT_co]: 
        
               return cls( 
        
                   chunked_array([data]), 
        
                   name=name, 
        
                   backend_version=backend_version, 
        
                   version=version, 
        
               ) 
        
           def __narwhals_namespace__(self: Self) -> ArrowNamespace: 
        
               from narwhals._arrow.namespace import ArrowNamespace 
        
               return ArrowNamespace( 
        
                   backend_version=self._backend_version, version=self._version

narwhals/narwhals/_arrow/series.py

Lines 549 to 552 in 1d24dec

    
           def cast(self: Self, dtype: DType) -> ArrowSeries[Any]: 
        
               ser = self._native_series 
        
               data_type = narwhals_to_native_dtype(dtype, self._version) 
        
               return self._from_native_series(pc.cast(ser, data_type))

Essentially anywhere that narwhals object would "change" its TypeVar, the current self.__class__(...) route breaks the typing.

So for ArrowSeries[T1], you can't go to ArrowSeries[T2] without a @classmethod that removes T1 from scope.

My brain has fully melted working on this, hope the above made sense 🫠
If not (https://typing.readthedocs.io/en/latest/spec/generics.html#scoping-rules-for-type-variables)

MarcoGorelli · 2025-02-13T16:50:42Z

thanks for working on this

is it necessary to make ArrowSeries generic in the PyArrow type? would it work to just keep that out for now?

narwhals/_arrow/group_by.py

narwhals/_arrow/namespace.py

dangotbanned · 2025-02-13T17:08:40Z

thanks for working on this

is it necessary to make ArrowSeries generic in the PyArrow type? would it work to just keep that out for now?

@MarcoGorelli 100% needed to resolve the issues I'm afraid 😔

Without the TypeVar, most of the @overloads end up matching Expression.
This was the main source of errors, since we don't appear to use Expression anywhere, it just introduces a huge amount of noise

Having read through a lot of the code (but not using pyarrow much personally) I am curious as to why we've not used Expression much/at all?

It seems to be available in our min version (https://arrow.apache.org/docs/11.0/python/generated/pyarrow.dataset.Expression.html)

For some context of the kinds of errors, (#1961 (comment))

https://github.com/narwhals-dev/narwhals/actions/runs/13312428814/job/37178107510?pr=2007

Causing issues in CI, but not locally? https://github.com/narwhals-dev/narwhals/actions/runs/13313655502/job/37182229033?pr=2007

Possibly the source of https://github.com/narwhals-dev/narwhals/actions/runs/13313804295/job/37182729479?pr=2007

dangotbanned · 2025-02-13T17:52:52Z

Started at like 150-300 errors

Down to 32 errors 😁!!!!!!!!!!!
https://github.com/narwhals-dev/narwhals/actions/runs/13313871223/job/37182957093?pr=2007

All remaining errors are in tests/**, finished fixing everything for narwhals/**

Update 5

Only 1 error left (https://github.com/narwhals-dev/narwhals/actions/runs/13328999125/job/37228573119?pr=2007)

Update 6

mypy is satisfied 😅 https://github.com/narwhals-dev/narwhals/actions/runs/13329410417/job/37229803694?pr=2007

Resolves https://github.com/narwhals-dev/narwhals/actions/runs/13313871223/job/37182957093?pr=2007#step:8:15 Doesn't resolve apache/arrow#41017

…) [attr-defined] `__iter__` doesn't seem to be defined in the docs or stubs https://arrow.apache.org/docs/python/generated/pyarrow.Table.html

- The stubs are overly strict for `Table.from_arrays` - It can accept `Sequence[Array[T] | ChunkedArray[T]]`

…s" cannot be "object" [type-var] - Had 2 of these - Used a different path that preserved the generic type

narwhals/_arrow/series.py

dangotbanned · 2025-02-13T21:35:38Z

narwhals/_arrow/typing.py

+Incomplete: TypeAlias = Any  # pragma: no cover
+"""
+Marker for working code that fails on the stubs.
+
+Common issues:
+- Annotated for `Array`, but not `ChunkedArray`
+- Relies on typing information that the stubs don't provide statically
+- Missing attributes
+- Incorrect return types
+- Inconsistent use of generic/concrete types
+- `_clone_signature` used on signatures that are not identical
+"""


Note
See for more context https://github.com/python/typeshed/blob/a410f251ee535950eb4440ad41ae4c00c87e6a67/stdlib/_typeshed/__init__.pyi#L45-L50

I've been sprinkling these in with a comment when all else fails, e.g:

narwhals/narwhals/_arrow/series.py

Lines 486 to 490 in 1b537e7

def diff(self: ArrowSeries[_NumericOrTemporalT]) -> ArrowSeries[_NumericOrTemporalT]:

# NOTE: stub only permits `ChunkedArray[TemporalScalar]`

# (https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/compute.pyi#L145-L148)

diff: Incomplete = pc.pairwise_diff

return self._from_native_series(diff(self._native_series.combine_chunks()))

If the stub issues get resolved in the future, this will be a lot easier to fix than just using Any directly

Want to start a thread, but havent changed the code here yet

narwhals/_interchange/dataframe.py

Fixes 89672e2#r1955890197

`pyright` doesn't need this, `mypy` infers this as `str` - which is too wide > narwhals/_arrow/namespace.py:372: error: No overload variant of "binary_join_element_wise" matches argument types "Generator[ChunkedArray[StringScalar], None, None]", "str" [call-overload]> narwhals/_arrow/namespace.py:372: note: Possible overload variants:

Related e2940e7

dangotbanned · 2025-02-14T13:30:31Z

narwhals/_arrow/series.py

+            # NOTE: stubs leave unannotated
+            if_else: Incomplete = pc.if_else


Note
stubs introduce an Unknown here
(https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/compute.pyi#L1829)

dangotbanned · 2025-02-14T13:32:59Z

narwhals/_arrow/series.py

+            # empty bin intervals should have a 0 count
+            counts_coalesce = cast(
+                "pa.Array[Any]",
+                pc.coalesce(cast("pa.Array[Any]", counts.column("counts")), lit(0)),


Note
stubs missing support for pa.ChunkedArray
https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/compute.pyi#L200
https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/compute.pyi#L1826-L1828

dangotbanned · 2025-02-14T13:40:08Z

narwhals/_arrow/series.py

-        def _hist_from_bin_count(
-            bin_count: int,
-        ) -> tuple[Sequence[int], Sequence[int | float], Sequence[int | float]]:
+        def _hist_from_bin_count(bin_count: int):  # type: ignore[no-untyped-def] # noqa: ANN202


The whole of ArrowSeries.hist is too complex to bother with typing nested inner-method return types.
Falling back to inference is fine.

Kinda concerned though that bin_left is returned and never used?

narwhals/narwhals/_arrow/series.py

Lines 1214 to 1224 in 3bdf0e8

if bins is not None:

if len(bins) < 2:

counts, bin_left, bin_right = [], [], []

else:

counts, bin_left, bin_right = _hist_from_bins(bins)

elif bin_count is not None:

if bin_count == 0:

counts, bin_left, bin_right = [], [], []

else:

counts, bin_left, bin_right = _hist_from_bin_count(bin_count)

Resolves #2007 (comment)

- Initially was just trying to clean up `__invert__` - Then `pyright` pointed me towards this route of annotating type-transforming methods

dangotbanned · 2025-02-14T21:31:39Z

is it necessary to make ArrowSeries generic in the PyArrow type? would it work to just keep that out for now?

@MarcoGorelli had a bit of an interesting development with this.
So, regardless of how accurate pyarrow-stubs is - by annotating the ArrowSeries methods we can at least chain those together with accurate static feedback.

I haven't gone too deep into that, but have made a start with:

series_str.py
series_dt.py
some of the easier ones in series.py

narwhals/_arrow/dataframe.py

narwhals/_arrow/group_by.py

MarcoGorelli · 2025-02-15T11:49:07Z

narwhals/_arrow/utils.py

+    if TYPE_CHECKING:
+        return pa.repeat(None, n).cast(series._type)
+    return pa.nulls(n, series._type)


same comment, can we type: ignore and report upstream if there's a stubs error?

@MarcoGorelli would you be okay with (adef6a0)?

I prefer that, since we can easily find all refs like:

I would've added as a suggestion, but the import was outside the range

Some more context in (#2007 (comment))

Maybe resolves #2007 (comment)

#2007 (comment)

Part of #2007 (comment) I'm expecting this to report in CI if not available in some version The previous fix was to resolve `pd.Series` not annotated as accepting `pa.ChunkedArray`

narwhals/_arrow/series.py

dangotbanned · 2025-02-15T14:10:01Z

narwhals/_arrow/series.py

+    @property
+    def _type(self: ArrowSeries[pa.Scalar[DataTypeT_co]]) -> DataTypeT_co:
+        if TYPE_CHECKING:
+            return self._native_series[0].type
+        return self._native_series.type


I don't think there's a way to get this working without the TYPE_CHECKING block.

I'm using this in a few places to resolve ChunkedArray erasing its type property:

https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/__lib_pxi/table.pyi#L59-L63

def type(self) -> DataType: ...

The type is preserved for Array, so here I'm stealing the generic from that type property:

https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/__lib_pxi/array.pyi#L1058-L1070

def type(self: Array[Scalar[_DataType_CoT]]) -> _DataType_CoT: ...

dangotbanned · 2025-02-15T14:19:33Z

narwhals/_arrow/series.py

 def maybe_extract_py_scalar(value: Any, return_py_scalar: bool) -> Any:  # noqa: FBT001
+    if TYPE_CHECKING:
+        return value.as_py()


There's some similarities between this one and (#2007 (comment))

Part of this is recreating a subset of the @overload(s) in
https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/__lib_pxi/scalar.pyi#L62-L105

But I'm also needing to lie, since .as_py() isn't available in all versions.

For a lot of the cases where maybe_extract_py_scalar is used - this avoids needing to do a [no-any-return] - since we have pa.Scalar[_BasicDataType[_AsPyType]] provided by #2007 (comment)

MarcoGorelli

first, thanks a tonne for your efforts, really appreciate it

second, I think this might be doing too much - mainly i'm concerned about there being both logic changes (e.g. to_pandas) and typing changes. can we keep them to separate PRs? i'm concerned about missing things with too large PRs

@MarcoGorelli

@MarcoGorelli the ignore(s) will show up for `mypy` after #2008 (I assume) #2007 (review)

dangotbanned added 3 commits February 13, 2025 16:23

fix(typing): Resolve all pyright warnings for _arrow

e7f465f

Will close #1961 Still have ~50 errors for `mypy` to review

Merge remote-tracking branch 'upstream/main' into typing-major-fixing-1

0015ac1

chore(typing): add extra ignore

1d24dec

dangotbanned added internal typing labels Feb 13, 2025

dangotbanned commented Feb 13, 2025

View reviewed changes

narwhals/_arrow/group_by.py Outdated Show resolved Hide resolved

dangotbanned commented Feb 13, 2025

View reviewed changes

narwhals/_arrow/namespace.py Outdated Show resolved Hide resolved

dangotbanned added 4 commits February 13, 2025 17:35

ci: omit _arrow.typing.py from coverage

286beeb

https://github.com/narwhals-dev/narwhals/actions/runs/13312428814/job/37178107510?pr=2007

fix(typing): error: Name "ser" already defined on line 59 [no-redef]

757c6bd

fix(DRAFT): try reordering maybe_extract_py_scalar overloads

b830aa2

Causing issues in CI, but not locally? https://github.com/narwhals-dev/narwhals/actions/runs/13313655502/job/37182229033?pr=2007

ci(typing): re-enable pyarrow-stubs

5cadbed

Possibly the source of https://github.com/narwhals-dev/narwhals/actions/runs/13313804295/job/37182729479?pr=2007

dangotbanned added 6 commits February 13, 2025 18:51

fix: only check for .ordered when type can have property

1e6eb75

Resolves https://github.com/narwhals-dev/narwhals/actions/runs/13313871223/job/37182957093?pr=2007#step:8:15 Doesn't resolve apache/arrow#41017

fix(typing): use utils.chunked_array

51ad255

fix(typing): misc assignment/redef errors

658c963

fix(typing): error: "Table" has no attribute "__iter__" (not iterable…

84033f1

…) [attr-defined] `__iter__` doesn't seem to be defined in the docs or stubs https://arrow.apache.org/docs/python/generated/pyarrow.Table.html

chore(typing): lie about broadcast_series

f35b01e

- The stubs are overly strict for `Table.from_arrays` - It can accept `Sequence[Array[T] | ChunkedArray[T]]`

fix(typing): error: Value of type variable "_ArrayT" of "concat_array…

1b537e7

…s" cannot be "object" [type-var] - Had 2 of these - Used a different path that preserved the generic type

dangotbanned commented Feb 13, 2025

View reviewed changes

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

dangotbanned commented Feb 13, 2025

View reviewed changes

dangotbanned added 2 commits February 14, 2025 10:07

chore(DRAFT): add comment on pyarrow.interchange.from_dataframe

89672e2

Want to start a thread, but havent changed the code here yet

Merge remote-tracking branch 'upstream/main' into typing-major-fixing-1

ce986e1

dangotbanned commented Feb 14, 2025

View reviewed changes

narwhals/_interchange/dataframe.py Outdated Show resolved Hide resolved

dangotbanned added 4 commits February 14, 2025 10:20

fix(typing): ambiguous pyarrow.interchange.from_dataframe import

4d4a722

Fixes 89672e2#r1955890197

chore(typing): mark pc.max_element_wise as Incomplete everywhere

c7f23d5

chore(typing): more help mypy match @overload

b7d149d

Related e2940e7

dangotbanned commented Feb 14, 2025

View reviewed changes

dangotbanned added 5 commits February 14, 2025 14:56

fix(typing): use nulls_like in Series.shift

27da267

Resolves #2007 (comment)

fix(typing): widen ArrowSeries.(mean|median)

c43d092

Resolves #2007 (comment)

ci(typing): remove pyarrow comments

b549ee5

refactor: reuse nulls_like

483db21

feat(typing): correct methods returning ArrowSeries[pa.BooleanScalar]

44c8c52

- Initially was just trying to clean up `__invert__` - Then `pyright` pointed me towards this route of annotating type-transforming methods

dangotbanned changed the title ~~fix(DRAFT): Resolve all mypy & pyright errors for _arrow~~ fix(typing): Resolve all mypy & pyright errors for _arrow Feb 14, 2025

dangotbanned marked this pull request as ready for review February 14, 2025 20:13

dangotbanned added 3 commits February 14, 2025 20:14

Merge remote-tracking branch 'upstream/main' into typing-major-fixing-1

bcec6f1

Merge branch 'main' into typing-major-fixing-1

1b065f4

feat(typing): series_str return annotations

41c9661

dangotbanned mentioned this pull request Feb 14, 2025

chore(typing): resolve time unit/zone set invariance #2012

Merged

10 tasks

Merge remote-tracking branch 'upstream/main' into typing-major-fixing-1

85b01a6

MarcoGorelli reviewed Feb 15, 2025

View reviewed changes

narwhals/_arrow/dataframe.py Show resolved Hide resolved

MarcoGorelli reviewed Feb 15, 2025

View reviewed changes

narwhals/_arrow/group_by.py Outdated Show resolved Hide resolved

MarcoGorelli reviewed Feb 15, 2025

View reviewed changes

This was referenced Feb 15, 2025

ci(typing): use tool.mypy.pretty = true #2018

Merged

chore: fix pandas-stubs issues #2008

Open

dangotbanned added 3 commits February 15, 2025 12:51

refactor(typing): use Incomplete for nulls_like

adef6a0

Maybe resolves #2007 (comment)

refactor(typing): mark pc.binary_join_element_wise as Incomplete

4a2cfc5

#2007 (comment)

refactor(DRAFT): try simplifying ArrowSeries.to_pandas

f704424

Part of #2007 (comment) I'm expecting this to report in CI if not available in some version The previous fix was to resolve `pd.Series` not annotated as accepting `pa.ChunkedArray`

dangotbanned commented Feb 15, 2025

View reviewed changes

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

dangotbanned commented Feb 15, 2025

View reviewed changes

MarcoGorelli reviewed Feb 15, 2025

View reviewed changes

dangotbanned added 2 commits February 15, 2025 15:31

revert: undo ArrowSeries.to_pandas change

f8617ff

@MarcoGorelli the ignore(s) will show up for `mypy` after #2008 (I assume) #2007 (review)

Merge remote-tracking branch 'upstream/main' into typing-major-fixing-1

b27665f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(typing): Resolve all `mypy` & `pyright` errors for `_arrow` #2007

fix(typing): Resolve all `mypy` & `pyright` errors for `_arrow` #2007

dangotbanned commented Feb 13, 2025 •

edited

Loading

dangotbanned commented Feb 13, 2025

MarcoGorelli commented Feb 13, 2025

dangotbanned commented Feb 13, 2025 •

edited

Loading

dangotbanned commented Feb 13, 2025 •

edited

Loading

dangotbanned Feb 13, 2025

dangotbanned Feb 14, 2025

dangotbanned Feb 14, 2025

dangotbanned Feb 14, 2025 •

edited

Loading

dangotbanned commented Feb 14, 2025

MarcoGorelli Feb 15, 2025

dangotbanned Feb 15, 2025 •

edited

Loading

dangotbanned Feb 15, 2025

dangotbanned Feb 15, 2025

dangotbanned Feb 15, 2025

MarcoGorelli left a comment

	def diff(self: ArrowSeries[_NumericOrTemporalT]) -> ArrowSeries[_NumericOrTemporalT]:
	# NOTE: stub only permits `ChunkedArray[TemporalScalar]`
	# (https://github.com/zen-xu/pyarrow-stubs/blob/d97063876720e6a5edda7eb15f4efe07c31b8296/pyarrow-stubs/compute.pyi#L145-L148)
	diff: Incomplete = pc.pairwise_diff
	return self._from_native_series(diff(self._native_series.combine_chunks()))

		# NOTE: stubs leave unannotated
		if_else: Incomplete = pc.if_else

	if bins is not None:
	if len(bins) < 2:
	counts, bin_left, bin_right = [], [], []
	else:
	counts, bin_left, bin_right = _hist_from_bins(bins)

	elif bin_count is not None:
	if bin_count == 0:
	counts, bin_left, bin_right = [], [], []
	else:
	counts, bin_left, bin_right = _hist_from_bin_count(bin_count)

fix(typing): Resolve all mypy & pyright errors for _arrow #2007

Are you sure you want to change the base?

fix(typing): Resolve all mypy & pyright errors for _arrow #2007

Conversation

dangotbanned commented Feb 13, 2025 • edited Loading

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

dangotbanned commented Feb 13, 2025

MarcoGorelli commented Feb 13, 2025

dangotbanned commented Feb 13, 2025 • edited Loading

dangotbanned commented Feb 13, 2025 • edited Loading

Update 1

Update 2

Update 3

Update 4

Update 5

Update 6

dangotbanned Feb 13, 2025

Choose a reason for hiding this comment

dangotbanned Feb 14, 2025

Choose a reason for hiding this comment

dangotbanned Feb 14, 2025

Choose a reason for hiding this comment

dangotbanned Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

dangotbanned commented Feb 14, 2025

MarcoGorelli Feb 15, 2025

Choose a reason for hiding this comment

dangotbanned Feb 15, 2025 • edited Loading

Choose a reason for hiding this comment

dangotbanned Feb 15, 2025

Choose a reason for hiding this comment

dangotbanned Feb 15, 2025

Choose a reason for hiding this comment

dangotbanned Feb 15, 2025

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

fix(typing): Resolve all `mypy` & `pyright` errors for `_arrow` #2007

fix(typing): Resolve all `mypy` & `pyright` errors for `_arrow` #2007

dangotbanned commented Feb 13, 2025 •

edited

Loading

dangotbanned commented Feb 13, 2025 •

edited

Loading

dangotbanned commented Feb 13, 2025 •

edited

Loading

dangotbanned Feb 14, 2025 •

edited

Loading

dangotbanned Feb 15, 2025 •

edited

Loading