[SPARK-50685][PYTHON][FOLLOW-UP] Improve Py4J performance by leveraging getattr #49412

HyukjinKwon · 2025-01-08T07:43:40Z

This PR is. a followup of #49313 that fixes more places missed.

This PR fixes Core, SQL, ML and Structured Streaming. Tests codes, MLLib and DStream are not affected.

To reduce the overhead of Py4J calls.

No.

Manually tested as demonstrated in #49312

No.

zhengruifeng · 2025-01-08T11:50:24Z

To further optimize Py4J calls, does it make sense to cache the result? e.g.

@functools.lru_cache(maxsize=128)
def get_jvm_attr(jvm: "JVMView", name: str) -> Any:
    return getattr(jvm, name)

HyukjinKwon · 2025-01-09T00:18:08Z

Yeah but we should think about how it will affect GC in Python and JVM.

HyukjinKwon · 2025-01-09T04:01:55Z

Merged to master.

github-actions bot added SQL ML PYTHON AVRO PROTOBUF labels Jan 8, 2025

HyukjinKwon force-pushed the SPARK-50685-followup branch from b486595 to ffef2f7 Compare January 8, 2025 07:44

itholic approved these changes Jan 8, 2025

View reviewed changes

dongjoon-hyun approved these changes Jan 8, 2025

View reviewed changes

zhengruifeng approved these changes Jan 8, 2025

View reviewed changes

HyukjinKwon force-pushed the SPARK-50685-followup branch from ffef2f7 to 16e672e Compare January 8, 2025 08:26

HyukjinKwon added 3 commits January 9, 2025 10:25

followup

443d314

fixup

ab741c4

fixup

5513d51

HyukjinKwon force-pushed the SPARK-50685-followup branch from 953694a to 5513d51 Compare January 9, 2025 01:25

HyukjinKwon closed this in cb093e6 Jan 9, 2025

Provide feedback