Estimation

`mercury.monitoring.estimation`

`performance_predictor`

`PerformancePredictor(model, metric_fn, corruptions=None, percentiles=None, performance_predictor=None, param_grid=None, K_cv=5, random_state=None, store_train_data=False)`

This class allow us to estimate the performance of a model on an unlabeled dataset, for example to monitor performance in production data when we still don't have the labels. The method is based on the paper Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. In a nutshell, the steps of the method are:

1) Apply corruptions to a held-out (labeld) dataset 2) Obtain percentiles of model outputs and the performance of the model when applying these corruptions 3) Train a regressor model to predict model performance. The samples to train this regressor model are the percentiles and performances obtained in 2) 4) Use the trained regressor to estimate the performance on serving unlabeled data

According to the paper, the method works well when: 1) We have a situation of covariate (changes in input data distributions) and 2) We know in advance what kind of covariate shift we can find in our serving data. However, in our experiments we have found that in some situations the method still works when the data also suffers from label shift. At the same time, it is important to mention that the method is not 100% accurate and cannot detect performance drop in all cases.

Original paper: https://ssc.io/pdf/mod0077s.pdf

Parameters:

Name	Type	Description	Default
`model`	`BaseEstimator`	The model that we want to estimate the performance	required
`metric_fn`	`Callable`	Function that calculates the metric that we want to estimate. The function should accept the true labels as first argument and the predictions as the second argument. For example, you can use functions from sklearn.metrics module.	required
`corruptions`	`List[Tuple]`	Optional list of corruptions to apply in the dataset specified in `fit` method. If we specify them, we use a list of tuples where each tuple has two elements: 1) A string with the type of drift to apply. 2) A dictionary with the parameters of the drift to apply. For the first element you can use any method available in mercury.monitoring.drift.drift_simulation.BatchDriftGenerator class. In the second element, the parameters are the arguments of the drift function. You can see the tutorial of class or the BatchDriftGenerator documentation for more details. If not specified the corruptions will be added in the `fit()` method according to the drift detected.	`None`
`percentiles`	`Union[List, array]`	np.array or list with percentiles to calculate in model outputs to be used as features in the regressor. By default, the calculated percentiles are [0, 5, 10, ..., 95, 100]	`None`
`performance_predictor`	`BaseEstimator`	(unfitted) model to use as regressor. By default it will be a RandomForestRegressor with n_estimators=15	`None`
`param_grid`	`dict`	dictionary with the hyperparameters grid that will be used when doing a grid search when training the regressor. By default just the the max_depth of the RandomForestRegressor is tunned.	`None`
`K_cv`	`int`	Number of folds to use when doing the GridSearch cross-validation to train the regressor. By default 5 will be used	`5`
`random_state`	`int`	random state to use in the RandomForestRegressor. By default is None.	`None`
`store_train_data`	`bool`	whether to store the data to train the regressor in the attributes `X_train_regressor` and `y_train_regressor`. This can be useful for analysis when performing some experiments of the method. By default is False.	`False`

Example

>>> model.fit(X_train, y_train)
>>> from mercury.monitoring.estimation.performance_predictor import PerformancePredictor
>>> from sklearn.metrics import accuracy_score
>>> performance_predictor = PerformancePredictor(model, metric_fn=accuracy_score, random_state=42)
>>> performance_predictor.fit(X=df_test[features], y=df_test[label], X_serving=df_serving[features])