Skip to content

Model Tests

mercury.robust.model_tests

ClassificationInvarianceTest(original_samples, perturbed_samples, model=None, predict_fn=None, threshold=0.05, check_total_errors_rate=True, name=None, *args, **kwargs)

Bases: RobustModelTest

The idea of the ClassificationInvarianceTest is to check that if we apply a label-preserving perturbation the prediction of the model shouldn't change.

This class helps to check this by checking the number of samples where the conditional of preserving the label doesn't hold and raising an error if the percentage of samples where the label is not preserved is higher than a specified threshold. We must pass to the test the original samples and the already generated perturbed samples.

When calling run(), a exception FailedTestError is raised if the test fails. Additonally, the next attributes are filled:

- preds_original_samples: stores the predictions for the original samples
- preds_perturbed_samples: stores the predictions for perturbed samples
- pred_is_different: stores for each sample a boolean array indicating if the predictions for the perturbed samples are different
    to the original sample
- num_failed_per_sample: stores for each sample the number of perturbations where the prediction is different to the original sample
- num_perturbed_per_sample: stores for each samples the number of perturbations
- samples_with_errors: boolean array containing which samples contain errors
- rate_samples_with_errors: the percentage of samples that contains at least one pertubed sample that the model predicted
    a different label.
- total_rate_errors: the total percentage of perturbed samples that the model predicted a different label
This test is based on the paper

'Beyond Accuracy: Behavioral Testing of NLP Models with CheckList'

Parameters:

Name Type Description Default
original_samples Union(List[str], np.array

List or array containing the original samples

required
perturbed_samples Union(List[List[str]], np.array

List or array containing the perturbed samples. Each element of the list or each row of the array contains one or several perturbed samples corresponding to the sample in the same position/index in original_samples

required
model BaseEstimator

The model being evaluated. The model must be already trained. It is assumed to have a sklearn-like compliant predict() method that works on the original_samples and perturbed_samples and returns a vector with the the predictions. Alternatively, you can pass a predict_fn

None
predict_fn Callable

function that given the samples returns the predicted labels. Only used if model argument is None.

None
threshold float

if the percentage of samples with errors is higher than this threshold, then a FailedTestError will be raised. Default value is 0.05

0.05
check_total_errors_rate bool

this indicates what to consider as percentage of errors. If True, then each perturbed sample counts to calculate the rate. If False, then the rate is calculated with the number of samples indepedently of how many perturbations each sample has. Default value is True

True
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> original_samples = ["sample1", "sample2"]
>>> perturbed_samples = [
...    "pertubed_sample_1 for sample1", "perturbed_sample_2 for sample1",
...    "perturbed_sample_1 for sample2", "perturbed_sample2 for sample2"
... ]
>>> test = ClassificationInvarianceTest(
...    original_samples,
...    pertubed_samples,
...    predict_fn=my_model.predict,
...    threshold=0.1,
...    check_total_errors_rate=True,
...    name="Invariance Test"
... )
>>> test.run()
Source code in mercury/robust/model_tests.py
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
def __init__(
    self,
    original_samples: Union[List[str], np.array],
    perturbed_samples: Union[List[List[str]], np.array],
    model: "BaseEstimator" = None,  # noqa: F821
    predict_fn: Callable = None,
    threshold: float = 0.05,
    check_total_errors_rate: bool = True,
    name: str = None,
    *args, **kwargs
):
    super().__init__(model, name, *args, **kwargs)
    self.original_samples = original_samples
    self.perturbed_samples = perturbed_samples
    self.predict_fn = predict_fn
    self.threshold = threshold
    self.check_total_errors_rate = check_total_errors_rate
    self.num_samples = len(self.original_samples)

    # Attributes for results
    self.preds_original_samples = None
    self.preds_perturbed_samples = None
    self.pred_is_different = None
    self.num_failed_per_sample = None
    self.num_perturbed_per_sample = None
    self.samples_with_errors = None
    self.rate_samples_with_errors = None
    self.total_rate_errors = None

get_examples_failed(n_samples=5, n_perturbed=1)

Returns examples of samples that failed.

Parameters:

Name Type Description Default
- n_samples (int

number of samples to recover.

required
- n_perturbed (int

for each sample, how many failed perturbations to recover.

required
Source code in mercury/robust/model_tests.py
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
def get_examples_failed(self, n_samples: int = 5, n_perturbed: int = 1):
    """
    Returns examples of samples that failed.

    Args:
        - n_samples (int): number of samples to recover.
        - n_perturbed (int): for each sample, how many failed perturbations to recover.
    """

    selected_failed_samples = self._select_random_failed_samples(n_samples)

    # Get perturbations that failed for each selected sample
    examples = []
    for idx_selected in selected_failed_samples:
        selected_perturbed = self._select_random_failed_perturbations(idx_selected, n_perturbed)
        for idx_perturbed in selected_perturbed:
            sample_original = self.original_samples[idx_selected]
            sample_perturbed = self.perturbed_samples[idx_selected][idx_perturbed]
            pred_original = self.preds_original_samples[idx_selected]
            pred_perturbed = self.preds_perturbed_samples[idx_selected][idx_perturbed]
            examples.append((sample_original, sample_perturbed, pred_original, pred_perturbed))

    return pd.DataFrame(examples, columns=["original", "perturbed", "pred_original", "pred_perturbed"])

run(*args, **kwargs)

run the test

Source code in mercury/robust/model_tests.py
783
784
785
786
787
788
789
790
def run(self, *args, **kwargs):
    """run the test"""

    self._check_args()
    self.preds_original_samples = self._get_predictions(self.original_samples)
    self._check_changes_in_perturbed_samples_predictions()
    self._calculate_errors_rate()
    self._fail_test_if_high_error()

DriftMetricResistanceTest(model, X, Y, drift_type, drift_args, names_categorical=None, dataset_schema=None, eval=None, tolerance=None, task=None, name=None)

Bases: DriftPredictionsResistanceTest, TaskInferrer

This test checks the robustness of a trained model to drift in the X dataset. It uses the model to predict the Y_no_drifted from the given X. Then, it applies some drift to the data in X by using a BatchDriftGenerator object and calculates Y_drifted. Then calculates a metric using Y_true with Y_no_drifted on the one hand and using Y_true with Y_drifted on the other hand. If the difference between these two metrics diverge more than some given tolerance value, the test fails. This test does only one verification. If we need doing more than one drift check, just apply multiple tests with appropriate names to simplify following up the results.

Parameters:

Name Type Description Default
model

The model being evaluated. The model must be already trained and will not be trained again by this test. It is assumed to have a sklearn-like compliant predict() method that works on the dataset and returns a vector that is accepted by the evaluation function.

required
X pd.DataFrame

A pandas dataset that can be used by the model's predict() method and whose predicted values will be used as the ground truth drift measurement.

required
Y np.array

array with the ground truth values. It will be used to calculate the metric for the non-drifted dataset and for the drifted dataset

required
drift_type

The name of the method of a BatchDriftGenerator specifying the type of drift to be applied. E.g., "shift_drift", "scale_drift", ... You can check the class BatchDriftGenerator in _drift_simulation to see all available types

required
drift_args

A dictionary with the argument expected by the drift method. E.g., {cols: ['a', 'b'], iqr: [1.12, 1.18]} for "scale_drift".

required
names_categorical

An optional list with the names of the categorical variables. If this is used, the internal BatchDriftGenerator will use a DataSchema object to fully define the variables in X as either categorical (if in the list) or continuous (otherwise). This allows automatically selecting the columns without using the cols argument in drift_args. If this parameter is not given, the DataSchema is not initally defined and either you select the columns manually by declaring a cols argument in drift_args or the BatchDriftGenerator will create a DataSchema than automatically infers the column types.

None
dataset_schema

Alternatively, you can provide a pre built schema for an even higher level of control. If you use this argument, names_categorical is not used. The schema fully defines binary, categorical, discrete or continuous. If you still define the cols argument in drift_args, that selection will prevail over whatever is in dataset_schema.

None
eval

the evaluation function to use to calculate the metric. If passed, the interface of the function must be eval_fn(y_true, y_hat). If not used, the mean absolute error will be used for regression and the accuracy for classification.

None
tolerance

A real value to be compared with the difference of the computed metric with the non-drifted dataset and with the dirfted dataset.

None
task

'classification' or 'regression'. If not given, the test will try to infer it from Y

None
name

A name for the test. If not used, it will take the name of the class.

None
Example
>>> testing_dataset = pd.DataFrame(...)
>>> rf = RandomForestClassifier().fit(train_data)
>>> drift_args = {'cols': 'feature_1', 'method': 'percentile', 'method_params': {'percentile': 95}}
>>> test = DriftMetricResistanceTest(
...    model = rf,
...    X = testing_dataset[features],
...    Y = testing_dataset[target],
...    drift_type = 'outliers_drift',
...    drift_args = drift_args,
...    tolerance = 0.05
... )
>>> test.run()  # The test will fail if the difference in the metric is more than 0.05
Source code in mercury/robust/model_tests.py
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
def __init__(
    self,
    model,
    X: pd.DataFrame,
    Y: np.array,
    drift_type,
    drift_args,
    names_categorical=None,
    dataset_schema=None,
    eval=None,
    tolerance=None,
    task=None,
    name=None
):
    super().__init__(model, X, drift_type, drift_args, names_categorical, dataset_schema, eval, tolerance, name)
    self.Y_true = Y
    self.task = task
    self.metric_no_drifted = None
    self.metric_drifted = None
    self.metric_diff = None

DriftPredictionsResistanceTest(model, X, drift_type, drift_args, names_categorical=None, dataset_schema=None, eval=None, tolerance=0.001, name=None)

Bases: RobustModelTest

This test checks the robustness of a trained model to drift in the X dataset. It uses the model to predict the Y from the given X and uses that Y as a ground truth. Then, it applies some drift to the data in X by using a BatchDriftGenerator object and does a new prediction drifted_Y using the drifted dataset. If both the Y and drifted_Y diverge by more that some given tolerance value, the test fails. This test does only one verification. If we need doing more than one drift check, just apply multiple tests with appropriate names to simplify following up the results.

Parameters:

Name Type Description Default
model

The model being evaluated. The model must be already trained and will not be trained again by this test. It is assumed to have a sklearn-like compliant predict() method that works on the dataset and returns a vector that is accepted by the evaluation function.

required
X

A pandas dataset that can be used by the model's predict() method and whose predicted values will be used as the ground truth drift measurement.

required
drift_type

The name of the method of a BatchDriftGenerator specifying the type of drift to be applied. E.g., "shift_drift", "scale_drift", ... You can check the class BatchDriftGenerator in _drift_simulation to see all available types

required
drift_args

A dictionary with the argument expected by the drift method. E.g., {cols: ['a', 'b'], iqr: [1.12, 1.18]} for "scale_drift".

required
names_categorical

An optional list with the names of the categorical variables. If this is used, the internal BatchDriftGenerator will use a DataSchema object to fully define the variables in X as either categorical (if in the list) or continuous (otherwise). This allows automatically selecting the columns without using the cols argument in drift_args. If this parameter is not given, the DataSchema is not initally defined and either you select the columns manually by declaring a cols argument in drift_args or the BatchDriftGenerator will create a DataSchema than automatically infers the column types.

None
dataset_schema

Alternatively, you can provide a pre built schema for an even higher level of control. If you use this argument, names_categorical is not used. The schema fully defines binary, categorical, discrete or continuous. If you still define the cols argument in drift_args, that selection will prevail over whatever is in dataset_schema.

None
eval

If given, an evaluation function that defines how "different" the predictions are. The function must accept two vectors returned by model.predict() and return some positive value that indicates the difference in the predictions and is compared with tolerance. If not given, then a sum of squared differences will be used unless the model.predict() method generates the hard labels for a multiclass classification problem. In this last case, the eval function will be a function to compute the number of different predictions.

None
tolerance

A real value to be compared with the result of the evaluation function. Note that the purpose of the test is to check if the model is robust to the introduced drift. Therefore, the test will fail when the result (named as loss) is higher than the tolerance, meaning the model predictions considerably change with the introduced drift. When the test fails, you can see the value returned by the eval function in the RuntimeError message displayed as loss.

0.001
name

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.robust.model_tests import DriftPredictionsResistanceTest
>>> test = DriftPredictionsResistanceTest(
>>>     model = trained_model,
>>>     X = X,
>>>     drift_type = "shift_drift",
>>>     drift_args = {'cols': ['feature_1'], 'force': 100.},
>>>     tolerance = 5,
>>> )
>>> test.run()
Source code in mercury/robust/model_tests.py
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
def __init__(
    self,
    model,
    X,
    drift_type,
    drift_args,
    names_categorical=None,
    dataset_schema=None,
    eval=None,
    tolerance=1e-3,
    name=None,
):
    super().__init__(model, name)

    self.model = model
    self.X = X.copy()
    self.Y = model.predict(X)

    if dataset_schema is not None:
        self.gen = BatchDriftGenerator(X=self.X, schema=dataset_schema)
    elif names_categorical is not None:
        self.gen = BatchDriftGenerator(
            X=self.X,
            schema=DataSchema().generate_manual(
                dataframe=self.X,
                categ_columns=names_categorical,
                discrete_columns=[],
                binary_columns=[],
            ),
        )
    else:
        self.gen = BatchDriftGenerator(X=self.X)

    self.fun = getattr(self.gen, drift_type, None)

    if not callable(self.fun):
        raise RuntimeError(
            "drift_type = %s must be a method of BatchDriftGenerator."
        )

    self.drift_args = drift_args
    self.eval = eval
    self.tolerance = tolerance
    self.loss = None

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/model_tests.py
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        `FailedTestError` with a descriptive message if any of the attempts fail.
    """

    super().run(*args, **kwargs)

    drifted_Y = self._calculate_drifted_Y()

    if self.eval is None:
        eval_fn = self._get_default_eval_fn()
        self.loss = eval_fn(self.Y, drifted_Y)
    else:
        self.loss = self.eval(self.Y, drifted_Y)

    if self.loss > self.tolerance:
        raise FailedTestError(
            "Test failed. Prediction loss drifted above tolerance = %.3f (loss = %.3f)"
            % (self.tolerance, self.loss)
        )

FeatureCheckerTest(model, train, target, test=None, model_fn_args=None, importance=None, eval=None, tolerance=0.001, num_tries=3, remove_num=1, name=None)

Bases: RobustModelTest

This model robustness test checks if training the models using less columns in the dataframe can achieve identical results. To do so, it uses the variable importance taken from the model itself or estimated using a mercury.explainability explainer (ShuffleImportanceExplainer). It does a small number of attempts at removing unimportant variables and "fails if it succeeds", since success implies that a smaller, therefore more efficient, dataset should be used instead. The purpose of this test is not to find that optimal dataset. That can be achieved by removing the columns identified as unimportant and iterating.

NOTE: This class will retrain (fit) the model several times resulting in the model being altered as a side effect. Make copies of your model before using this tool. This tool is intended as a diagnostic tool.

Parameters:

Name Type Description Default
model Union[Estimator, Callable]

The model being evaluated. The model is assumed to comply to a minimalistic sklearn-like interface. More precisely: 1. It must have a fit() method that works on the dataset and the dataset with some columns removed. It is important that each time the method fit() is called the model is trained from scratch (ie does not perform incremental training). 2. It must have a predict() method that works on the dataset and returns a vector that is accepted by the evaluation function. 3. If the argument importance is used, it must have an attribute with that name containing a list of (value, column_name) tuples which is consistent with many sklearn models. Alternatively, you can provide a function that creates a model or pipeline. This is useful in models or pipelines where removing columns from a dataframe raises an error because it expects the removed column. In this case, the interface of the function is model_fn(dataframe, model_args), where dataframe is the input pandas dataframe with the features already removed and model_fn_args is a dictionary with optional parameters that can be passed to the function. Importantly, this function just needs to create the model (unfitted) instance, but not to perform the training

required
model_fn_args dict

if you are using a function in model parameter, you can use this argument to provide arguments to that function.

None
train pd.DataFrame

The pandas dataset used for training the model, possibly with some columns removed.

required
target str

The name of the target variable predicted by the model which must be one columns in the train dataset.

required
test pd.DataFrame

If given, a separate dataset with identical column structure used for the evaluation parts. Otherwise, the train dataset will be used instead.

None
importance str

If given, the name of a property in the model is updated by a fit() call. It must contain the importance of the columns as a list of (value, column_name) tuples. Otherwise, the importance of the variables will be estimated using a mercury.explainer ShuffleImportanceExplainer.

None
eval Callable

If given, an evaluation function that defines what "identical" results are. The function must accept two vectors returned by model.predict() and return some positive value that is smaller than tolerance if "identical". Otherwise a sum of squared differences will be used instead.

None
tolerance float

A real value to be compared with the result of the evaluation function. Note that the purpose of the test is finding unimportant variables. Therefore, the test will fail when the result (named as loss) is smaller than the tolerance, meaning the model could work "identically" well with less variables. When the test fails, you can see the value returned by the eval function in the RuntimeError message displayed as loss.

0.001
num_tries int

The total number of column removal tries the test should do before passing. This value times remove_num must be smaller than the number of columns (Y excluded).

3
remove_num int

The number of columns removed at each try.

1
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.robust.model_tests import FeatureCheckerTest
>>> test = FeatureCheckerTest(
>>>     model=model,
>>>     train=df_train,
>>>     target="label_col",
>>>     test=df_test,
>>>     num_tries=len(df_train.columns)-1,
>>>     remove_num=1,
>>>     tolerance=len(df_test)*0.01
>>> )
>>> test.run()
Source code in mercury/robust/model_tests.py
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
def __init__(
    self,
    model: Union["Estimator", Callable],  # noqa: F821
    train: pd.DataFrame,
    target: str,
    test: pd.DataFrame = None,
    model_fn_args: dict = None,
    importance: str = None,
    eval: Callable = None,
    tolerance: float = 1e-3,
    num_tries: int = 3,
    remove_num: int = 1,
    name: str = None
):
    super().__init__(model, name)
    self.model_fn_args = model_fn_args
    self.train = train
    self.target = target
    self.test = test if test is not None else train
    self.importance = importance
    self.eval = eval
    self.remove_num = remove_num
    self.num_tries = num_tries
    self.tolerance = tolerance
    self.losses = {}

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/model_tests.py
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        `FailedTestError` with a descriptive message if any of the attempts fail.
    """

    super().run(*args, **kwargs)

    N = self.remove_num * self.num_tries

    X_train = self.train.loc[:, self.train.columns != self.target]
    Y_train = self.train.loc[:, self.target]

    # Fit model with all the features
    self._fit_model(X_train, Y_train)

    X_test = self.test.loc[:, self.test.columns != self.target]
    Y_hat = self.fitted_model.predict(X_test)

    remove = self._get_least_important(N)

    if len(remove) != N:
        raise RuntimeError(
            "Wrong arguments. Not enough columns for remove_num*num_tries."
        )

    for t in range(self.num_tries):
        exclude = remove[0:self.remove_num]
        remove = remove[self.remove_num:]

        exclude.append(self.target)

        x_cols = [c for c in self.train.columns if c not in exclude]

        X_alt_train = self.train.loc[:, x_cols]

        # Fit model with the removed features
        self._fit_model(X_alt_train, Y_train)

        X_alt_test = self.test.loc[:, x_cols]
        Y_alt_hat = self.fitted_model.predict(X_alt_test)

        if self.eval is None:
            loss = sum((Y_hat - Y_alt_hat) ** 2)
        else:
            loss = self.eval(Y_hat, Y_alt_hat)
        self.losses[", ".join(exclude[:-1])] = loss

        if loss < self.tolerance:
            raise FailedTestError(
                "Test failed. A model fitted removing columns [%s] is identical within tolerance %.3f (loss = %.3f)"
                % (", ".join(exclude[:-1]), self.tolerance, loss)
            )

ModelReproducibilityTest(model, train_dataset, target, train_fn, train_params=None, eval_fn=None, eval_params=None, threshold_eval=0.0, predict_fn=None, predict_params=None, threshold_yhat=0.0, yhat_allowed_diff=0.0, test_dataset=None, name=None, *args, **kwargs)

Bases: RobustModelTest

This test checks if the training of a model is reproducible. It does so by training the model two times and checking whether they give the same evaluation metric and predictions. If the difference in the evaluation metric is higher than threshold_eval parameter then the test fails. Similarly, if the percentage of different predictions is higher than threshold_yhat then the test fails. You can check only one of the checks (the evaluation metric and predictions) or only one of them.

Parameters:

Name Type Description Default
model BaseEstimator, tf.keras.Model

Unfitted model that we are checking reproducibility

required
train_dataset pd.DataFrame

The pandas dataset used for training the model.

required
target str

The name of the target variable predicted by the model which must be one column in the train dataset.

required
train_fn Callable

function called to train the model. The interface of the function is train_fn(model, X, y, train_params) and returns the fitted model.

required
train_params dict

Params to use for training. It is passed as a parameter to the train_fn

None
eval_fn Callable

function called to evaluate the model. The interface of the function is eval_fn(model, X, y, eval_params) and returns a float. If None, then the check of looking if training two times produces the same evaluation metric won't be performed.

None
eval_params dict

Params to use for evaluation. It is passed as a parameter to the eval_fn.

None
threshold_eval float

difference that we are able to tolerate in the evaluation function in order to pass the test. If the difference of the evaluation metric when training the model two times is higher than the threshold, then the test fails. Default value is 0

0.0
predict_fn Callable

function called to get the predictions of a dataset once the model is trained. The interface of the function is predict_fn(model, X, predict_params) and returns the predictions.

None
predict_params dict

Params to use for prediction. It is passed as a parameter to the predict_fn

None
threshold_yhat float

If predict_fn is given, this is the percentage of different predictions that we are can tolerate without making the test fail. A prediction from the model trained two times is considered different according to the parameter yhat_allowed_diff. Default value for threshold_yhat is 0, meaning that with just one prediction different the test will fail.

0.0
yhat_allowed_diff float

difference that we can tolerate in order to consider that a sample has the same prediction from two models. If a prediction of the model trained two times differ by more than yhat_allowed_diff then that prediction is considered to be different. Default value is 0.

0.0
test_dataset pd.DataFrame

If given, a separate dataset with identical column structure used for the evaluation parts.

None
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> # Define necessary functions for model training, evaluation, and getting predictions
>>> def train_model(model, X, y, train_params=None):
>>>     model.fit(X, y)
>>>     return model
>>> def eval_model(model, X, y, eval_params=None):
>>>     y_pred = model.predict(X)
>>>     return accuracy_score(y, y_pred)
>>> def get_predictions(model, X, pred_params=None):
>>>     return model.predict(X)
>>> # Create and run test
>>> from mercury.robust.model_tests import ModelReproducibilityTest
>>> test = ModelReproducibilityTest(
>>>     model = model,
>>>     train_dataset = df_train,
>>>     target = "label_col",
>>>     train_fn = train_model_fn,
>>>     eval_fn = eval_model_fn,
>>>     threshold_eval = 0,
>>>     predict_fn = get_predictions_fn,
>>>     epsilon_yhat = 0,
>>>     test_dataset = df_test
>>> )
>>> test.run()
Source code in mercury/robust/model_tests.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
def __init__(
    self, model: Union["BaseEstimator", "tf.keras.Model"],  # noqa: F821
    train_dataset: "pd.DataFrame",  # noqa: F821,
    target: str,
    train_fn: Callable, train_params: dict = None,
    eval_fn: Callable = None, eval_params: dict = None, threshold_eval: float = 0.,
    predict_fn: Callable = None, predict_params: dict = None,
    threshold_yhat: float = 0., yhat_allowed_diff: float = 0.,
    test_dataset: "pd.DataFrame" = None,  # noqa: F821,
    name: str = None,
    *args, **kwargs
):
    super().__init__(model, name, *args, **kwargs)
    self.train_dataset = train_dataset
    self.train_fn = train_fn
    self.train_params = train_params
    self.target = target
    self.eval_fn = eval_fn
    self.eval_params = eval_params
    self.threshold_eval = threshold_eval
    self.predict_fn = predict_fn
    self.predict_params = predict_params
    self.threshold_yhat = threshold_yhat
    self.yhat_allowed_diff = yhat_allowed_diff
    self.test_dataset = test_dataset

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/model_tests.py
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if training is not reproducible
    """

    super().run(*args, **kwargs)

    X_train = self.train_dataset.loc[:, self.train_dataset.columns != self.target]
    y_train = self.train_dataset.loc[:, self.target]
    if self.test_dataset is not None:
        X_test = self.test_dataset.loc[:, self.test_dataset.columns != self.target]
        y_test = self.test_dataset.loc[:, self.target]

    # Clone the model
    model_1 = self._clone_unfitted_model(self.model)
    model_2 = self._clone_unfitted_model(self.model)

    # Train the models
    if self.train_fn is None:
        raise ValueError("You must provide a valid train_fn to train the model")
    model_1 = self.train_fn(model_1, X_train, y_train, self.train_params)
    model_2 = self.train_fn(model_2, X_train, y_train, self.train_params)

    # Check that at least one of eval_fn or predict_fn are specified
    if self.eval_fn is None and self.predict_fn is None:
        raise ValueError("At least one of eval_fn or predict_fn must be specified")

    if self.eval_fn is not None:
        # Check evaluation on train
        if self._check_diff_in_eval(model_1, model_2, X_train, y_train):
            raise FailedTestError(
                f"Eval metric different in train dataset when training two times ({round(self.eval_1,4)} vs {round(self.eval_2,4)}). "
                f"The max difference allowed is {round(self.threshold_eval, 4)} "
                f"The model is not reproducible."
            )
        # Check evaluation on test
        if (self.test_dataset is not None) and self._check_diff_in_eval(model_1, model_2, X_test, y_test):
            raise FailedTestError(
                f"Eval metric different in test dataset when training two times ({round(self.eval_1,4)} vs {round(self.eval_2,4)}). "
                f"The max difference allowed is {round(self.threshold_eval, 4)} "
                f"The model is not reproducible."
            )

    if self.predict_fn is not None:
        # Check evaluation on train
        if self._check_diff_in_predictions(model_1, model_2, X_train):
            raise FailedTestError(
                f"Percentage of different predictions in training set is {round(self.diff,4)} when training two times and the "
                f"maximum allowed is {round(self.threshold_yhat,4)} "
                f"The model is not reproducible."
            )
        # Check evaluation on test
        if (self.test_dataset is not None) and self._check_diff_in_predictions(model_1, model_2, X_test):
            raise FailedTestError(
                f"Percentage of different predictions in test set is {round(self.diff,4)} when training two times and the "
                f"maximum allowed is {round(self.threshold_yhat,4)} "
                f"The model is not reproducible."
            )

ModelSimplicityChecker(model, X_train, y_train, X_test, y_test, baseline_model=None, ignore_feats=None, task=None, eval_fn=None, predict_fn=None, threshold=None, name=None, encode_cat_feats=True, scale_num_feats=True, test_predictions=None, schema_custom_feature_map=None, dataset_schema=None, *args, **kwargs)

Bases: RobustModelTest, TaskInferrer

This test looks if a trained model has a simple baseline which trained in the same dataset gives better or similar performance on a test dataset. If not specified, the baseline is considered a LogisticRegression model for classification tasks and LinearRegression model for regression tasks.

Parameters:

Name Type Description Default
model Union[BaseEstimator, Model]

the trained model which we will compare against a baseline model

required
X_train Union[DataFrame, np.ndarray]

features of train dataset used to train the model. This same dataset will be used to train the baseline.

required
y_train Union[DataFrame, np.ndarray]

targets of train dataset used to train the model. This same dataset will be used to train the baseline.

required
X_test Union[DataFrame, np.ndarray]

features of test dataset which will be used to evaluate the model and the baseline.

required
y_test Union[DataFrame, np.ndarray]

targets of test dataset which will be used to evaluate the model and the baseline.

required
ignore_feats List[str]

Features which won't be used in the baseline_model. Only use when X_train and X_test are pandas dataframes.

None
baseline_model BaseEstimator

Optional model that will be used as a baseline. It doesn't have to be an sklearn model, however, it needs to implement the fit() and predict() methods. If not specified, a LogisticRegression is used in case of classification and LinearRegression in case of regression.

None
task str

Task of the dataset. It must be either 'classification' or 'regression'. If None provided then it will be auto inferred from the target column.

None
eval_fn Callable[[np.ndarray], np.ndarray]

function which returns a metric to compare the performance of the model against the baseline. The interface of the function is eval_fn(y_true, y_pred). Note that in this test is assumed that the higher the metric the better the model, therefore if you use a metric which lower means better then the eval_fn should return the negative of the metric. If not specified, the accuracy score will be used in case of classification and R2 in case of regression.

None
predict_fn Callable[[BaseEstimator], np.ndarray]

Custom predict function to obtain predictions from model. The interface of the function is predict_fn(model, X_test). Note that by default this is None, and in that case, the test will try to obtain the predictions using the predict() method of the model. That works in many cases where you are using scikit-learn models and the predict function returns what the eval_fn is expecting. However, in some case you might need to define a custom predict_fn. For example, when using a tf.keras model and the accuracy_score as eval_fn, the predict() method from tf.keras model returns the probabilities and the accuracy_score expects the classes, therefore you can use the predict_fn to obtain the classes. Another alternative is to pass the already computed predictions using the test_predictions parameter.

None
test_predictions np.array

array of predictions of the test set obtained by the model. If given, the test will use them instead of computing them using the predict() method of the model or the predict_fn This might be useful in cases where you are creating multiple tests and you want to avoid to compute the predictions each time.

None
threshold float

The threshold to use when comparing the model and the baseline. It is used to establish the limit to consider that the baseline performs similar or better than the model. Concretely, if the baseline model performs worse than the model with a difference equal or higher than this threshold then the test passes. Otherwise, if the baseline model performs better or performs worse but with a difference lower than this threshold, then the test fails.

None
name str

A name for the test. If not used, it will take the name of the class.

None
encode_cat_feats bool

bool to indicate whether to encode categorical features as one-hot-encoding. Note that if you specify it as False and your dataset has string column then the test will raise an exception since the default baseline models won't be able to deal with string columns. Default value is True

True
scale_num_feats bool

bool to indicate whether to scale the numeric features. If True, a StandardScaler is used. Default value is True.

True
schema_custom_feature_map Dict[str, FeatType]

Internally, this test generates a DataSchema object. In case you find it makes wrong feature type assignations to your features you can pass here a dictionary which specify the feature type of the columns you want to fix. (See DataSchema. force_types parameter for more info on this).

None
dataset_schema DataSchema

Pre built schema. This argument is complementary to schema_custom_feature_map. In case you want to manually build your own DataSchema object, you can pass it here and the test will internally use it instead of the default, automatically built. If you provide this parameter, schema_custom_feature_map will not be used.

None
Example
>>> from mercury.robust.model_tests import ModelSimplicityChecker
>>> test = ModelSimplicityChecker(
>>>     model = model,
>>>     X_train = X_train,
>>>     y_train = y_train,
>>>     X_test = X_test,
>>>     y_test = y_test,
>>>     threshold = 0.02,
>>>     eval_fn = roc_auc_score
>>> )
>>> test.run()
Source code in mercury/robust/model_tests.py
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
def __init__(
    self, model: Union["BaseEstimator", "tf.keras.Model"],  # noqa: F821
    X_train: Union["pandas.DataFrame", np.ndarray],  # noqa: F821
    y_train: Union["pandas.Series", np.ndarray],  # noqa: F821
    X_test: Union["pandas.DataFrame", np.ndarray],  # noqa: F821
    y_test: Union["pandas.Series", np.ndarray],  # noqa: F821
    baseline_model: "BaseEstimator" = None,  # noqa: F821
    ignore_feats: List[str] = None,
    task: str = None,
    eval_fn: Callable[[np.ndarray], np.ndarray] = None,
    predict_fn: Callable[["BaseEstimator"], np.ndarray] = None,  # noqa: F821
    threshold: float = None,
    name: str = None,
    encode_cat_feats: bool = True,
    scale_num_feats: bool = True,
    test_predictions: np.array = None,
    schema_custom_feature_map: Dict[str, FeatType] = None,
    dataset_schema: DataSchema = None,
    *args, **kwargs
):

    super().__init__(model, name, *args, **kwargs)

    self.X_train = X_train
    self.y_train = y_train
    self.X_test = X_test
    self.y_test = y_test
    self.ignore_feats = ignore_feats if ignore_feats is not None else []

    self.task = task
    self.baseline_model = baseline_model

    self.threshold = threshold
    self.eval_fn = eval_fn
    self.predict_fn = predict_fn
    self.test_predictions = test_predictions

    self.encode_cat_feats = encode_cat_feats
    self.scale_num_feats = scale_num_feats

    self._schema_custom_feature_map = schema_custom_feature_map
    self._dataset_schema = dataset_schema

    self.metric_model = None
    self.metric_baseline_model = None

TreeCoverageTest(model, test_dataset, threshold_coverage=0.7, name=None, *args, **kwargs)

Bases: RobustModelTest

This test checks whether a given test_dataset covers a minimum percentage of all the branches of a tree. Use this in case you want to make sure no leaves are left unexplored when testing your model. In case the percentage of coverage is less than the required threshold_coverage, the test will fail.

Right now, this test only supports scikit-learn tree models, including sklearn pipelines with one tree model in one of its steps.

TODO: Add support for other frameworks such as lightgbm, catboost or xgboost.

Parameters:

Name Type Description Default
model Union[DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor, Pipeline]

Fitted tree-based model (or sklearn pipeline with a tree-based model) to inspect

required
test_dataset DataFrame

Dataset for testing the coverage.

required
threshold_coverage float

this threshold represents the minimum percentage that the test_dataset needs to cover in order to pass the test. Eg. if threshold_coverage=0.7 then the test_dataset needs to cover at least 70% of the branches to pass the test.

0.7
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> testing_dataset = pd.DataFrame(...)
>>> rf = RandomForestClassifier().fit(train_data)
>>> test = TreeCoverageTest(
...    rf,
...    testing_dataset,
...    threshold_coverage=.8
...    name="My Tree Coverage Test"
... )
>>> test.run()  # The test will fail if the obtained coverage is less than 80%
Source code in mercury/robust/model_tests.py
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
def __init__(
    self, model: Union["DecisionTreeClassifier", "DecisionTreeRegressor",
                       "RandomForestClassifier", "RandomForestRegressor",
                       "sklearn.pipeline.Pipeline"],  # noqa: F821
    test_dataset: "pd.DataFrame",  # noqa: F821,
    threshold_coverage: float = .7,
    name: str = None,
    *args, **kwargs
):
    super().__init__(model, name, *args, **kwargs)
    self.test_dataset = test_dataset
    self.threshold_coverage = threshold_coverage
    if isinstance(model, sklearn.pipeline.Pipeline):
        index_model = self._identify_tree_in_pipeline(model)
        self.test_dataset = self._execute_pipeline_until_step(model, index_model, self.test_dataset)
        model = model.steps[index_model][1]
    self._analyzer = self._get_analyzer(model)
    self.coverage = None

run(*args, **kwargs)

Run the test

Source code in mercury/robust/model_tests.py
965
966
967
968
969
970
971
972
def run(self, *args, **kwargs):
    """Run the test"""
    self._analyzer.analyze(self.test_dataset)
    coverage = self._analyzer.get_percent_coverage()
    self.coverage = coverage

    if coverage < self.threshold_coverage:
        raise FailedTestError(f"Achieved a coverage of {coverage} while the minimum required was {self.threshold_coverage}")