Skip to content

Commit

Permalink
feat: Design of EstimatorReport (#997)
Browse files Browse the repository at this point in the history
closes #834

Investigate an API for a `EstimatorReport`.

#### TODO

- [x] Metrics
  - [x] handle string metrics has specified in the accessor
  - [x] handle callable metrics
  - [x] handle scikit-learn scorers
  - [x] use efficiently the cache as much as possible
  - [x] add testing for all of those features
- [x] allow to pass new validation set to functions instead of using the
internal validation set
  - [x] add a proper help and rich `__repr__`
- [x] Plots
  - [x] add the roc curve display
  - [x] add the precision recall curve display
  - [x] add prediction error display for regressor
  - [x] make proper testing for those displays
  - [x] add a proper `__repr__` for those displays
- [x] Documentation 
- [x] (done for the checked part) add an example to showcase all the
different features
- [x] find a way to show the accessors documentation in the page of
`EstimatorReport`. It could be a bit tricky because they are only
defined once the instance created.
- We need to have a look at the `series.rst` page from pandas to see how
they document this sort of pattern.
- [x] check the autocompletion: when typing `report.metrics.->tab` it
should provide the autocompetion. **edit**: having a stub file is
actually working. I prefer this than type hints directly in the file.
- Open questions
  - [x] we use hashing to retrieve external set.
- use the caching for the external validation set? To make it work we
need to compute the hash of potentially big arrays. This might more
costly than making the model predict.

#### Notes

This PR build upon:
- #962 to reuse the
`skore.console`
- #998 to be able to detect
clusterer in a consistent manner.
  • Loading branch information
glemaitre authored Jan 10, 2025
1 parent 1e0b605 commit 1a4151a
Show file tree
Hide file tree
Showing 34 changed files with 5,713 additions and 7 deletions.
385 changes: 385 additions & 0 deletions examples/model_evaluation/plot_estimator_report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
"""
============================================
Get insights from any scikit-learn estimator
============================================
This example shows how the :class:`skore.EstimatorReport` class can be used to
quickly get insights from any scikit-learn estimator.
"""

# %%
#
# TODO: we need to describe the aim of this classification problem.
from skrub.datasets import fetch_open_payments

dataset = fetch_open_payments()
df = dataset.X
y = dataset.y

# %%
from skrub import TableReport

TableReport(df)

# %%
TableReport(y.to_frame())

# %%
# Looking at the distributions of the target, we observe that this classification
# task is quite imbalanced. It means that we have to be careful when selecting a set
# of statistical metrics to evaluate the classification performance of our predictive
# model. In addition, we see that the class labels are not specified by an integer
# 0 or 1 but instead by a string "allowed" or "disallowed".
#
# For our application, the label of interest is "allowed".
pos_label, neg_label = "allowed", "disallowed"

# %%
# Before training a predictive model, we need to split our dataset into a training
# and a validation set.
from skore import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42)

# %%
# TODO: we have a perfect case to show useful feature of the `train_test_split`
# function from `skore`.
#
# Now, we need to define a predictive model. Hopefully, `skrub` provides a convenient
# function (:func:`skrub.tabular_learner`) when it comes to getting strong baseline
# predictive models with a single line of code. As its feature engineering is generic,
# it does not provide some handcrafted and tailored feature engineering but still
# provides a good starting point.
#
# So let's create a classifier for our task and fit it on the training set.
from skrub import tabular_learner

estimator = tabular_learner("classifier").fit(X_train, y_train)
estimator

# %%
#
# Introducing the :class:`skore.EstimatorReport` class
# ----------------------------------------------------
#
# Now, we would be interested in getting some insights from our predictive model.
# One way is to use the :class:`skore.EstimatorReport` class. This constructor will
# detect that our estimator is already fitted and will not fit it again.
from skore import EstimatorReport

reporter = EstimatorReport(
estimator, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test
)
reporter

# %%
#
# Once the reporter is created, we get some information regarding the available tools
# allowing us to get some insights from our specific model on the specific task.
#
# You can get a similar information if you call the :meth:`~skore.EstimatorReport.help`
# method.
reporter.help()

# %%
#
# Be aware that you can access the help for each individual sub-accessor. For instance:
reporter.metrics.help()

# %%
reporter.metrics.plot.help()

# %%
#
# Metrics computation with aggressive caching
# -------------------------------------------
#
# At this point, we might be interested to have a first look at the statistical
# performance of our model on the validation set that we provided. We can access it
# by calling any of the metrics displayed above. Since we are greedy, we want to get
# several metrics at once and we will use the
# :meth:`~skore.EstimatorReport.metrics.report_metrics` method.
import time

start = time.time()
metric_report = reporter.metrics.report_metrics(pos_label=pos_label)
end = time.time()
metric_report

# %%
print(f"Time taken to compute the metrics: {end - start:.2f} seconds")

# %%
#
# An interesting feature provided by the :class:`skore.EstimatorReport` is the
# the caching mechanism. Indeed, when we have a large enough dataset, computing the
# predictions for a model is not cheap anymore. For instance, on our smallish dataset,
# it took a couple of seconds to compute the metrics. The reporter will cache the
# predictions and if you are interested in computing a metric again or an alternative
# metric that requires the same predictions, it will be faster. Let's check by
# requesting the same metrics report again.

start = time.time()
metric_report = reporter.metrics.report_metrics(pos_label=pos_label)
end = time.time()
metric_report

# %%
print(f"Time taken to compute the metrics: {end - start:.2f} seconds")

# %%
#
# Since we obtain a pandas dataframe, we can also use the plotting interface of
# pandas.
import matplotlib.pyplot as plt

ax = metric_report.T.plot.barh()
ax.set_title("Metrics report")
plt.tight_layout()

# %%
#
# Whenever computing a metric, we check if the predictions are available in the cache
# and reload them if available. So for instance, let's compute the log loss.

start = time.time()
log_loss = reporter.metrics.log_loss()
end = time.time()
log_loss

# %%
print(f"Time taken to compute the log loss: {end - start:.2f} seconds")

# %%
#
# We can show that without initial cache, it would have taken more time to compute
# the log loss.
reporter.clean_cache()

start = time.time()
log_loss = reporter.metrics.log_loss()
end = time.time()
log_loss

# %%
print(f"Time taken to compute the log loss: {end - start:.2f} seconds")

# %%
#
# By default, the metrics are computed on the test set. However, if a training set
# is provided, we can also compute the metrics by specifying the `data_source`
# parameter.
reporter.metrics.log_loss(data_source="train")

# %%
#
# In the case where we are interested in computing the metrics on a completely new set
# of data, we can use the `data_source="X_y"` parameter. In addition, we need to provide
# a `X` and `y` parameters.

start = time.time()
metric_report = reporter.metrics.report_metrics(
data_source="X_y", X=X_test, y=y_test, pos_label=pos_label
)
end = time.time()
metric_report

# %%
print(f"Time taken to compute the metrics: {end - start:.2f} seconds")

# %%
#
# As in the other case, we rely on the cache to avoid recomputing the predictions.
# Internally, we compute a hash of the input data to be sure that we can hit the cache
# in a consistent way.

# %%
start = time.time()
metric_report = reporter.metrics.report_metrics(
data_source="X_y", X=X_test, y=y_test, pos_label=pos_label
)
end = time.time()
metric_report

# %%
print(f"Time taken to compute the metrics: {end - start:.2f} seconds")

# %%
#
# .. warning::
# In this last example, we rely on computing the hash of the input data. Therefore,
# there is a trade-off: the computation of the hash is not free and it might be
# faster to compute the predictions instead.
#
# Be aware that you can also benefit from the caching mechanism with your own custom
# metrics. We only expect that you define your own metric function to take `y_true`
# and `y_pred` as the first two positional arguments. It can take any other arguments.
# Let's see an example.


def operational_decision_cost(y_true, y_pred, amount):
mask_true_positive = (y_true == pos_label) & (y_pred == pos_label)
mask_true_negative = (y_true == neg_label) & (y_pred == neg_label)
mask_false_positive = (y_true == neg_label) & (y_pred == pos_label)
mask_false_negative = (y_true == pos_label) & (y_pred == neg_label)
# FIXME: we need to make sense of the cost sensitive part with the right naming
fraudulent_refuse = mask_true_positive.sum() * 50
fraudulent_accept = -amount[mask_false_negative].sum()
legitimate_refuse = mask_false_positive.sum() * -5
legitimate_accept = (amount[mask_true_negative] * 0.02).sum()
return fraudulent_refuse + fraudulent_accept + legitimate_refuse + legitimate_accept


# %%
#
# In our use case, we have a operational decision to make that translate the
# classification outcome into a cost. It translate the confusion matrix into a cost
# matrix based on some amount linked to each sample in the dataset that are provided to
# us. Here, we randomly generate some amount as an illustration.
import numpy as np

rng = np.random.default_rng(42)
amount = rng.integers(low=100, high=1000, size=len(y_test))

# %%
#
# Let's make sure that a function called the `predict` method and cached the result.
# We compute the accuracy metric to make sure that the `predict` method is called.
reporter.metrics.accuracy()

# %%
#
# We can now compute the cost of our operational decision.
start = time.time()
cost = reporter.metrics.custom_metric(
metric_function=operational_decision_cost,
metric_name="Operational Decision Cost",
response_method="predict",
amount=amount,
)
end = time.time()
cost

# %%
print(f"Time taken to compute the cost: {end - start:.2f} seconds")

# %%
#
# Let's now clean the cache and see if it is faster.
reporter.clean_cache()

# %%
start = time.time()
cost = reporter.metrics.custom_metric(
metric_function=operational_decision_cost,
metric_name="Operational Decision Cost",
response_method="predict",
amount=amount,
)
end = time.time()
cost

# %%
print(f"Time taken to compute the cost: {end - start:.2f} seconds")

# %%
#
# We observe that caching is working as expected. It is really handy because it means
# that you can compute some additional metrics without having to recompute the
# the predictions.
reporter.metrics.report_metrics(
scoring=["precision", "recall", operational_decision_cost],
pos_label=pos_label,
scoring_kwargs={
"amount": amount,
"response_method": "predict",
"metric_name": "Operational Decision Cost",
},
)

# %%
#
# It could happen that you are interested in providing several custom metrics which
# does not necessarily share the same parameters. In this more complex case, we will
# require you to provide a scorer using the :func:`sklearn.metrics.make_scorer`
# function.
from sklearn.metrics import make_scorer, f1_score

f1_scorer = make_scorer(
f1_score,
response_method="predict",
metric_name="F1 Score",
pos_label=pos_label,
)
operational_decision_cost_scorer = make_scorer(
operational_decision_cost,
response_method="predict",
metric_name="Operational Decision Cost",
amount=amount,
)
reporter.metrics.report_metrics(scoring=[f1_scorer, operational_decision_cost_scorer])

# %%
#
# Effortless one-liner plotting
# -----------------------------
#
# The :class:`skore.EstimatorReport` class also provides a plotting interface that
# allows to plot *defacto* the most common plots. As for the the metrics, we only
# provide the meaningful set of plots for the provided estimator.
reporter.metrics.plot.help()

# %%
#
# Let's start by plotting the ROC curve for our binary classification task.
display = reporter.metrics.plot.roc(pos_label=pos_label)
plt.tight_layout()

# %%
#
# The plot functionality is built upon the scikit-learn display objects. We return
# those display (slightly modified to improve the UI) in case you want to tweak some
# of the plot properties. You can have quick look at the available attributes and
# methods by calling the `help` method or simply by printing the display.
display

# %%
display.help()

# %%
display.plot()
display.ax_.set_title("Example of a ROC curve")
display.figure_
plt.tight_layout()

# %%
#
# Similarly to the metrics, we aggressively use the caching to avoid recomputing the
# predictions of the model. We also cache the plot display object by detection if the
# input parameters are the same as the previous call. Let's demonstrate the kind of
# performance gain we can get.
start = time.time()
# we already trigger the computation of the predictions in a previous call
reporter.metrics.plot.roc(pos_label=pos_label)
plt.tight_layout()
end = time.time()

# %%
print(f"Time taken to compute the ROC curve: {end - start:.2f} seconds")

# %%
#
# Now, let's clean the cache and check if we get a slowdown.
reporter.clean_cache()

# %%
start = time.time()
reporter.metrics.plot.roc(pos_label=pos_label)
plt.tight_layout()
end = time.time()

# %%
print(f"Time taken to compute the ROC curve: {end - start:.2f} seconds")

# %%
# As expected, since we need to recompute the predictions, it takes more time.
Loading

0 comments on commit 1a4151a

Please sign in to comment.