[python-package][PySpark] Expose Training and Validation Metrics #11133

ayoub317 · 2024-12-29T02:04:48Z

wbo4958 · 2025-01-02T02:57:59Z

python-package/xgboost/spark/xgboost_training_summary.py

+
+
+@dataclass
+class _XGBoostTrainingSummary:


Could be XGBoostTrainingSummary ?

wbo4958 · 2025-01-02T02:58:26Z

python-package/xgboost/spark/xgboost_training_summary.py

@@ -0,0 +1,43 @@
+"""Xgboost training summary integration submodule."""


how about renaming xgboost_training_summary.py to summary.py?

wbo4958 · 2025-01-03T04:42:22Z

Hi @ayoub317, Could you be able to add some unit tests for this feature?

ayoub317 · 2025-01-03T05:08:45Z

Hi @wbo4958,

Yes, with pleasure ! Could you please provide some pointers on where I should define these tests ?

I had a quick look at the test codebase, and I would assume the following:

The tests for the regressor and classifier could be placed under:
xgboost -> tests -> test_distributed -> test_with_spark -> test_spark_local.py -> in the class TestPySparkLocal

The test for the ranker could be placed under:
xgboost -> tests -> test_distributed -> test_with_spark -> test_spark_local.py -> in the class TestPySparkLocalLETOR

Does that sound like a good choice ?

Thanks !

wbo4958 · 2025-01-03T06:44:46Z

The path you pasted should be ok for adding new testing.

…gSummary to XGBoostTrainingSummary

ayoub317 · 2025-01-05T00:08:28Z

Thanks @wbo4958 ! I pushed another commit adding the tests. Any feedback is welcome.

wbo4958 · 2025-01-06T06:33:49Z

python-package/xgboost/spark/core.py

@@ -1148,7 +1151,7 @@ def _train_booster(
                if dvalid is not None:
                    dval = [(dtrain, "training"), (dvalid, "validation")]
                else:
-                    dval = None
+                    dval = [(dtrain, "training")]


@trivialfis, Could you check this is ok by enabling it by default?

wbo4958 · 2025-01-06T06:38:05Z

tests/test_distributed/test_with_spark/test_xgboost_summary.py

+
+from .test_spark_local import spark as spark_local
+
+logging.getLogger("py4j").setLevel(logging.INFO)


is this for debug?

Yes, and since it was also set in test_spark_local.py, I kept it. Do you prefer that we remove it ?

wbo4958 · 2025-01-06T06:41:56Z

tests/test_distributed/test_with_spark/test_xgboost_summary.py

@@ -0,0 +1,233 @@
+import logging


I'm wondering, if we could put the tests in this file into the existing test_spark_local.py and reuse the existing test data?

Yes, I can move them there without much effort. Let me know if you'd like me to proceed with that.
However, in my humble opinion, it’s better to keep them in this separate file, and here’s the rationale :
The test_spark_local.py file already exceeds 1800 lines of code, which makes it increasingly difficult to read, maintain and navigate. As new features are added to PySpark XGBoost, this file will only continue to grow, compounding the problem.
I think refactoring the tests to organize them by key features, rather than bundling everything under the TestPySparkLocal class would be a better long-term approach.

If we decide to keep the tests in this file, I can either leave the examples here as they are, or, as you suggested, for better modularity and data reuse, we could import them from test_spark_local.py. Another option is to store all shared data in a separate file, allowing both test_spark_local.py and test_xgboost_summary.py to import what they need from it.

Let me know what you think, I have no strong opinion on this.

Yeah, That's good point. Originally, I would like to separate the tests per the estimators. like XGBoostClassifier/Regressor/Ranker, instead of per features. So you can share the same dataset for different features.

wbo4958 · 2025-01-06T06:44:34Z

tests/test_distributed/test_with_spark/test_xgboost_summary.py

+        assert not xgb_model.training_summary.validation_objective_history
+
+    @staticmethod
+    def assert_non_empty_training_objective_history(


I'm wondering if we could get the evaluate_results from xgboost itself and the training summary from xgboost-pyspark on the same dataset, and then check if they are equal? You can see some tests in test_spark_local.py are doing same comparison.

Yes, absolutely. Thank you for pointing this out ! I tested this on a simple DataFrame locally, and the results matched perfectly. We should definitely add such tests, I’ll take care of that !

ayoub317 force-pushed the expose-metrics branch 9 times, most recently from 3d7161a to 99fc349 Compare December 29, 2024 20:53

[python-package][PySpark] Expose Training and Validation Metrics

25a06b8

ayoub317 force-pushed the expose-metrics branch from 99fc349 to 25a06b8 Compare December 29, 2024 20:55

wbo4958 reviewed Jan 2, 2025

View reviewed changes

ayoub317 force-pushed the expose-metrics branch from a1f9df2 to 469c6bb Compare January 2, 2025 06:40

ayoub317 force-pushed the expose-metrics branch from 469c6bb to ef56736 Compare January 3, 2025 21:51

Renamed xgboost_training_summary.py to summary.py and _XGBoostTrainin…

984bc8e

…gSummary to XGBoostTrainingSummary

ayoub317 force-pushed the expose-metrics branch 8 times, most recently from 5828a0c to da80def Compare January 4, 2025 23:30

Add tests for the PySpark XGBoost summary

3e60eec

ayoub317 force-pushed the expose-metrics branch from da80def to 3e60eec Compare January 5, 2025 15:22

wbo4958 reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package][PySpark] Expose Training and Validation Metrics #11133

[python-package][PySpark] Expose Training and Validation Metrics #11133

ayoub317 commented Dec 29, 2024

wbo4958 Jan 2, 2025

ayoub317 Jan 2, 2025

wbo4958 Jan 2, 2025

ayoub317 Jan 2, 2025

wbo4958 commented Jan 3, 2025

ayoub317 commented Jan 3, 2025

wbo4958 commented Jan 3, 2025

ayoub317 commented Jan 5, 2025

wbo4958 Jan 6, 2025

wbo4958 Jan 6, 2025

ayoub317 Jan 6, 2025

wbo4958 Jan 6, 2025

ayoub317 Jan 6, 2025

wbo4958 Jan 7, 2025

wbo4958 Jan 6, 2025

ayoub317 Jan 6, 2025 •

edited

Loading

		@@ -0,0 +1,43 @@
		"""Xgboost training summary integration submodule."""


		from .test_spark_local import spark as spark_local

		logging.getLogger("py4j").setLevel(logging.INFO)

[python-package][PySpark] Expose Training and Validation Metrics #11133

Are you sure you want to change the base?

[python-package][PySpark] Expose Training and Validation Metrics #11133

Conversation

ayoub317 commented Dec 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented Jan 3, 2025

ayoub317 commented Jan 3, 2025

wbo4958 commented Jan 3, 2025

ayoub317 commented Jan 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayoub317 Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

ayoub317 Jan 6, 2025 •

edited

Loading