Skip to content

Data Tests

mercury.robust.data_tests

CohortPerformanceTest(base_dataset, group_col, eval_fn, threshold, compare_max_diff=True, threshold_is_percentage=True, name=None, *args, **kwargs)

Bases: RobustDataTest

This test looks if some metric performs poorly for some cohort of your data when compared with other groups. The computed metric is specified with the argument eval_group_fn, which the user needs to define (see example). The metric is computed for all the groups of the variable specified with the variable group_col. When the argument compare_max_diff is True, then the comparison is between the group which has the maximum metric and the group which has the minimum metric. If the difference is higher than the specified threshold then the test will fail. When the argument compare_max_diff is False, then the metric calculated for each group is compared against the mean of the whole dataset. If the difference of any of the groups with the mean is higher than the threshold, then the test will fail. The argument threshold_is_percentage controls if the threshold is compared with the absolute value of the difference (threshold_is_percentage=False), or if is compared with the percentage of the difference (threshold_is_percentage=True).

Parameters:

Name Type Description Default
base_dataset pandas.DataFrame

Dataset which will be evaluated. It must contain the specified group_col and the necessary columns to calculate the metric defined in eval_group_fn

required
group_col str

column name which contains the groups to be evaluated

required
eval_fn Callable

evaluation function which computes the metric to evaluate. It must return a float

required
threshold float

threshold to compare. If compare_max_diff is True and the difference between the maximum metric and minimum metric is higher than the threshold, then the test fails. If compare_max_diff is False and the difference between the mean metric of the dataset and the metric in any of the groups is higher than the threshold, then the test fails.

required
compare_max_diff bool

If True, then the comparison is between the group which has the max metric and the group which has the min metric. If False, then the comparison is between the mean of the whole dataset and all the other groups. Default value is True

True
threshold_is_percentage bool

If True, then the comparison with the threshold is with the percentage of the computed differences, therefore the threshold represents the percentage. If False, then the comparison with the threshold is with the absolute values of the differences, therefore the threshold represents the absolute value. Default value is True.

True
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.robust.data_tests import CohortPerformanceTest
>>> def eval_acc(df):
>>>     return accuracy_score(df["y_true"], df["y_pred"])
>>> test1 = CohortPerformanceTest(
>>>     base_dataset=dataset, group_col="gender", eval_fn = eval_acc, threshold = 0.2, compare_max_diff=True
>>> )
>>> test1.run()
>>> # Check test result details
>>> test1.info()
Source code in mercury/robust/data_tests.py
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
def __init__(
    self,
    base_dataset: "pandas.DataFrame",  # noqa: F821
    group_col: str,
    eval_fn: Callable,
    threshold: float,
    compare_max_diff: bool = True,
    threshold_is_percentage: bool = True,
    name: str = None,
    *args: Any,
    **kwargs: Any
):
    super().__init__(base_dataset, name, *args, **kwargs)
    self.group_col = group_col
    self.eval_fn = eval_fn
    self.threshold = threshold
    self.compare_max_diff = compare_max_diff
    self.threshold_is_percentage = threshold_is_percentage
    self.metric_by_group = None

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/data_tests.py
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if metric difference above the threshold is detected
    """
    super().run(*args, **kwargs)

    self.metric_by_group = self.base_dataset.groupby(self.group_col).apply(lambda x: self.eval_fn(x))

    if self.compare_max_diff:
        self._compare_max_diff()
    else:
        self._compare_diff_with_mean()

DriftTest(base_dataset, schema_ref=None, drift_detector_args=None, name=None, *args, **kwargs)

Bases: RobustDataTest

This test ensures the distributions of new dataset, or new batch of data, are not too different from a reference dataset (i.e. training data). In other words, it checks for no data drift between features.

If drift is detected on any of the features, it will raise an Exception, crashing the test.

Parameters:

Name Type Description Default
base_dataset DataFrame

Dataset which will be evaluated

required
schema_ref Union[str, DataSchema]

Schema the base_dataset will be evaluated against. It can be either a DataSchema object or a string. In case of the later, it must be a path to a previously serialized schema.

None
drift_detector_args dict

Dictionary with arguments passed to the internal drift detectors: HistogramDistanceDrift for continuous features and Chi2Drift for the categoricals.

None
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.dataschema import DataSchema
>>> schma_reference = DataSchema().generate(df_train).calculate_statistics()
>>> from mercury.robust.data_tests import DriftTest
>>> test = DriftTest(df_inference, schma_reference)
>>> test.run()
>>> # Check test result details
>>> test.info()
Source code in mercury/robust/data_tests.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
def __init__(self,
             base_dataset: "pandas.DataFrame",  # noqa: F821
             schema_ref: Union[str, DataSchema] = None,
             drift_detector_args: dict = None,
             name: str = None,
             *args: Any, **kwargs: Any):
    super().__init__(base_dataset, name, *args, **kwargs)
    self.schema_ref = schema_ref

    # Default args for ALL detectors. Depending on the implementation, only the
    # necessary ones will be used.
    self._detector_args = dict(
        distance_metric="hellinger",
        correction="bonferroni",
        n_runs=3,
        test_size=0.3,
        p_val=0.01,
        n_permutations=100
    )

    if isinstance(drift_detector_args, dict):
        self._detector_args.update(drift_detector_args)

    if type(schema_ref) == str:
        self.schema_ref = DataSchema.load(schema_ref)

    self.continuous_drift_metrics = None
    self.cat_drift_metrics = None

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/data_tests.py
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if any drift is detected
    """
    from mercury.monitoring.drift.histogram_distance_drift_detector import HistogramDistanceDrift
    from mercury.monitoring.drift.chi2_drift_detector import Chi2Drift

    super().run(*args, **kwargs)

    schema_ref = self.schema_ref

    # Test numerical features with HistogramDistance
    featnames = schema_ref.continuous_feats + schema_ref.discrete_feats

    if len(featnames) > 0:

        src_histograms, tgt_histograms = self._get_features_histograms(featnames)

        detector_args = {
            'distance_metric': self._detector_args['distance_metric'],
            'correction': self._detector_args['correction'],
            'p_val': self._detector_args['p_val'],
            'n_permutations': self._detector_args['n_permutations']
        }

        drift_detector = HistogramDistanceDrift(
            distr_src=src_histograms,
            distr_target=tgt_histograms,
            features=featnames,
            **detector_args
        )

        self.continuous_drift_metrics = drift_detector.calculate_drift()

        if self.continuous_drift_metrics['drift_detected']:
            raise FailedTestError(f"Test failed. Drift was detected on the following features: {drift_detector.get_drifted_features()}")

    # Test categorical feats with Chi2
    featnames = schema_ref.binary_feats + schema_ref.categorical_feats
    if len(featnames) > 0:

        src_histograms, tgt_histograms = self._get_features_histograms(featnames)

        chi2_args = {
            'correction': self._detector_args['correction'],
            'p_val': self._detector_args['p_val'],
        }

        drift_detector = Chi2Drift(
            distr_src=src_histograms,
            distr_target=tgt_histograms,
            features=featnames,
            **chi2_args
        )

        self.cat_drift_metrics = drift_detector.calculate_drift()

        if self.cat_drift_metrics['drift_detected']:
            raise FailedTestError(f"Test failed. Drift was detected on the following features: {drift_detector.get_drifted_features()}")

LabelLeakingTest(base_dataset, label_name, task=None, threshold=None, metric=None, ignore_feats=None, schema_custom_feature_map=None, dataset_schema=None, handle_str_cols='error', name=None, *args, **kwargs)

Bases: RobustDataTest

This test ensures the target variable is not being leaked into the predictors (i.e. no feature has a strong relationship with the target). For this, it uses the TreeSelector class which tries to predict the target variable given each one of the features. If, given a particular feature, the target is easily predicted, it means that feature is very important. By default, performance is measured by ROC-AUC for classification problems and R^2 for regression.

After the test has been executed, two attributes are made available for further inspection.

1) `importances_`: dictionary with the the feature importances. That is,
   the closer a feature is to 1 (ideal value), the most important it is. If any
   feature f is f > 1 - threshold, the test will fail.
2) `_selector`: Fitted TreeSelector used for the test.

Parameters:

Name Type Description Default
base_dataset DataFrame

Dataset which will be evaluated. Features must be numerical (i.e. no strings/objects)

required
label_name str

column which represent the target label. If it's multi-categorical, it must be label-encoded.

required
task str

Task of the dataset. It must be either 'classification' or 'regression'. If None provided it will be auto inferred from the label_name column.

None
threshold float

Custom threshold which will be used when considering a particular feature important. If the obtained score for a feature < threshold, it will be considered "too important", and, thus, the test will fail.

None
metric Callable

Metric to be used by the internal TreeSelector. If None, ROC AUC will be used for classification and R2 for regression.

None
ignore_feats List[str]

Features which will be not tested.

None
schema_custom_feature_map Dict[str, FeatType]

Internally, this test generates a DataSchema object. In case you find it makes wrong feature type assignations to your features you can pass here a dictionary which specify the feature type of the columns you want to fix. (See DataSchema. force_types parameter for more info on this).

None
dataset_schema DataSchema

Pre built schema. This argument is complementary to schema_custom_feature_map. In case you want to manually build your own DataSchema object, you can pass it here and the test will internally use it instead of the default, automatically built. If you provide this parameter, schema_custom_feature_map will not be used.

None
handle_str_cols str

if the dataset contains string columns, this parameter indicates how to handle them. If 'error' then a ValueError exception is raised if the dataset contains string columns. If 'transform' the string columns are transformed to integer using a LabelEncoder. Default value is 'error'

'error'
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.robust.data_tests import LabelLeakingTest
>>> test = LabelLeakingTest(
>>>     train_df,
>>>     label_name = "y",
>>>     task = "classification",
>>>     threshold = 0.05,
>>> )
>>> test.run()
>>> # Check test result details
>>> test.info()
Source code in mercury/robust/data_tests.py
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
def __init__(self,
             base_dataset: "pandas.DataFrame",  # noqa: F821
             label_name: str,
             task: str = None,
             threshold: float = None,
             metric: Callable = None,
             ignore_feats: List[str] = None,
             schema_custom_feature_map: Dict[str, FeatType] = None,
             dataset_schema: DataSchema = None,
             handle_str_cols: str = "error",
             name: str = None,
             *args: Any, **kwargs: Any):
    super().__init__(base_dataset, name, *args, **kwargs)

    self.label_name = label_name
    self.task = task
    self.threshold = threshold
    self.ignore_feats = ignore_feats
    self.metric = metric
    self.importances_ = None
    self._schema_custom_feature_map = schema_custom_feature_map
    self._base_schema = dataset_schema
    self.handle_str_cols = handle_str_cols

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/data_tests.py
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if any of the checks fail.
    """
    from ._treeselector import TreeSelector
    from sklearn.metrics import r2_score, roc_auc_score

    super().run(*args, **kwargs)

    base_dataset = self.base_dataset

    if self.ignore_feats is not None:
        base_dataset = base_dataset.loc[:, [c for c in base_dataset.columns if c not in self.ignore_feats]]

    label_name = self.label_name
    threshold = self.threshold
    metric = self.metric

    if self._base_schema is None:
        schema = DataSchema().generate(base_dataset, force_types=self._schema_custom_feature_map)
    else:
        schema = self._base_schema

    if not self.task:
        if label_name in schema.binary_feats or label_name in schema.categorical_feats:
            self.task = 'classification'
        if label_name in schema.discrete_feats or label_name in schema.continuous_feats:
            self.task = 'regression'

    task = self.task

    if task not in ('classification', 'regression'):
        raise ValueError("Error. 'task' must be either classification or regression")

    # This threshold has been found to be working well in most tested cases for
    # R2 and AUC.
    threshold = 0.05 if not threshold else threshold
    if task == 'classification':
        metric = roc_auc_score if metric is None else metric
    else:
        metric = r2_score if metric is None else metric

    # The TreeSelector used later doesn't support string columns. Raise error or transform them
    str_cols = schema.get_features_by_type(datatype=DataType.STRING)
    if len(str_cols) > 0:
        if self.handle_str_cols.lower() == 'error':
            raise ValueError(
                "String column found. Set 'handle_str_cols' parameter to 'transform' to automatically transform "
                "string columns to integers or transform them before using this test."
            )
        elif self.handle_str_cols.lower() == 'transform':
            # We need to create a copy in this case to avoid to modify the original dataset
            base_dataset = base_dataset.copy()
            for col in str_cols:
                if col in base_dataset.columns:
                    base_dataset.loc[:, col] = LabelEncoder().fit_transform(base_dataset.loc[:, col])
        else:
            raise ValueError("Wrong value for 'handle_str_cols' parameter. Set to 'error' or 'transform'")

    # Separate features from target
    X = base_dataset.loc[:, [f for f in base_dataset.columns if f != label_name]]
    y = base_dataset.loc[:, label_name]

    selector = TreeSelector(task, metric=metric)
    selector.fit(X, y)

    # Store selector for inspection. Mainly debugging purposes
    self._selector = selector

    # We calculate our "importance" scores depending on whether we have a
    # classification or regression dataset.
    importances = selector.feature_importances

    # By default,  we ensure the best performing feature (i.e.
    # the one with highest metric) is not too close to one (the best
    # possible value for classification and regression / AUC or R^2).
    metrics = 1 - np.clip(np.array(importances['metrics']), 0, 1)
    high_importance_feats = metrics < threshold

    # Store computed importances for post-analysis
    ids = X.columns[selector.feature_importances['features']].tolist()
    self.importances_ = dict()
    for i, val in enumerate(ids):
        self.importances_[val] = 1 - metrics[i]  # Recover original metric instead of its inverse

    # If any of the features is too important the test fails.
    if high_importance_feats.any():
        idxs = selector.feature_importances['features'][high_importance_feats]
        names = X.columns[idxs].tolist()
        raise FailedTestError((
            f"Test failed because high importance features were detected: {names}. "
            "Check for possible target leaking."
        ))

LinearCombinationsTest(base_dataset, schema_custom_feature_map=None, dataset_schema=None, name=None, *args, **kwargs)

Bases: RobustDataTest

This test ensures a certain dataset doesn't have any linear combination between its numerical columns and no categorical variable is redundant. See the following functions if you want to know further details:

  • For numerical features: _lin_combs_in_columns.lin_combs_in_columns
  • For categorical features: _CategoryStruct.individually_redundant

In case any combination or redundancy is detected, the test fails.

Parameters:

Name Type Description Default
base_dataset DataFrame

Dataset which will be evaluated

required
schema_custom_feature_map Dict[str, FeatType]

Internally, this test generates a DataSchema object. In case you find it makes wrong feature type assignations to your features you can pass here a dictionary which specify the feature type of the columns you want to fix. (See DataSchema. force_types parameter for more info on this).

None
dataset_schema DataSchema

Pre built schema. This argument is complementary to schema_custom_feature_map. In case you want to manually build your own DataSchema object, you can pass it here and the test will internally use it instead of the default, automatically built. If you provide this parameter, schema_custom_feature_map will not be used.

None
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.robust.data_tests import LinearCombinationsTest
>>> test = LinearCombinationsTest(df_train)
>>> test.run()
Source code in mercury/robust/data_tests.py
101
102
103
104
105
106
107
108
def __init__(self,
             base_dataset: "pandas.DataFrame",  # noqa: F821
             schema_custom_feature_map: Dict[str, FeatType] = None,
             dataset_schema: DataSchema = None,
             name: str = None, *args: Any, **kwargs: Any):
    super().__init__(base_dataset, name, *args, **kwargs)
    self._schema_custom_feature_map = schema_custom_feature_map
    self._dataset_schema = dataset_schema

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/data_tests.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if any of the checks fail.
    """
    from ._linalg import lin_combs_in_columns
    from ._category_struct import CategoryStruct

    super().run(*args, **kwargs)

    if self._dataset_schema is None:
        current_schema = DataSchema().generate(self.base_dataset, force_types=self._schema_custom_feature_map)
    else:
        current_schema = self._dataset_schema

    lin_combinations = None
    numeric_feats = current_schema.continuous_feats + current_schema.discrete_feats
    if len(numeric_feats) > 0:
        cont_cols = self.base_dataset.loc[:, numeric_feats].values

        lin_combinations = lin_combs_in_columns(
            np.matmul(cont_cols.T, cont_cols)  # Compress original matrix
        )

    if lin_combinations is not None:
        raise FailedTestError("Test failed. Linear combinations for continuous features were encountered.")

    individually_redundant = CategoryStruct.individually_redundant(self.base_dataset, current_schema.categorical_feats)
    if len(individually_redundant) > 0:
        raise FailedTestError((
            f"""Test failed. Any of these categorical variables is redundant: {individually_redundant}. """
            """Try deleting any of them."""
        ))

NoDuplicatesTest(dataset, ignore_feats=None, name=None, use_hash=None, *args, **kwargs)

Bases: RobustDataTest

This test checks no duplicated samples are present in a dataframe. In case of having duplicated samples it could bias your model and/or evaluation metrics.

Parameters:

Name Type Description Default
dataset pandas.DataFrame

Dataset in which duplicates will be looked for.

required
ignore_feats List[str]

List of columns that will be ignored when evaluating whether two samples are equal or not

None
use_hash bool

If True, it will create a hash for the rows in the dataframes in order to find duplicates. If False it will use duplicate() pandas method. Using hash usually results in faster execution in bigger datasets. If None it will use one method or other depending of the available memory

None
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.robust.data_tests import NoDuplicatesTest
>>> test = NoDuplicatesTest(mydataframe)
>>> test.run()
>>> # Check test result details
>>> test.info()
Source code in mercury/robust/data_tests.py
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
def __init__(
    self,
    dataset: "pandas.DataFrame",  # noqa: F821
    ignore_feats: List[str] = None,
    name: str = None,
    use_hash: bool = None,
    *args: Any, **kwargs: Any
):
    super().__init__(dataset, name, *args, **kwargs)
    self.ignore_feats = ignore_feats if ignore_feats else []
    if use_hash is None:
        use_hash = psutil.virtual_memory().percent > 50.0
    self.use_hash = use_hash
    self._num_duplicates = None

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/data_tests.py
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if there exist duplicated samples in the provided dataframe
    """
    dataset = self.base_dataset[[x for x in self.base_dataset.columns if x not in self.ignore_feats]]
    if not self.use_hash:
        duplicates = dataset.duplicated()
    else:
        hashes = pd.util.hash_pandas_object(dataset, index=False)
        duplicates = hashes.duplicated()

    self._num_duplicates = duplicates.sum()
    self._index_duplicates = duplicates[duplicates].index

    if self._num_duplicates > 0:
        raise FailedTestError(
            f"Your dataset has {self._num_duplicates} duplicates. Drop or inspect them via the `info` method"
        )

NoisyLabelsTest(base_dataset, label_name, threshold=None, text_col=None, preprocessor=None, ignore_feats=None, calculate_idx_issues=True, label_issues_args=None, name=None, schema_custom_feature_map=None, dataset_schema=None, *args, **kwargs)

Bases: RobustDataTest

This test looks if the labels of a dataset contain a high level of noise. Internally, it uses the cleanlab library to obtain those samples in the dataset where the labels are considered as noisy. Noisy labels can happen because the sample is incorrectly labelled, because the sample could belong to several label categories, or some other reason. If the percentage of samples with noisy labels is higher than a specified threshold, then the test fails. It can be used with tabular datasets and with text datasets. In the case of a text dataset, the argument text_col must be specified. This test only works for classification tasks.

After the test has been executed, two attributes are made available for further inspection.

1) `idx_issues_`: indices of the labels detected as issues (only available when `calculate_idx_issues` is True)
2) `rate_issues_`: percentage of labels detected containing possible issues.

IMPORTANT: If the Test reports convergence problems or it takes too long, then you can change the model used in the algorithm (see example below). Sometimes, the default Logistic Regression used might not converge in some datasets. In those cases, changing the solver is usually enough to solve the problem.

Parameters:

Name Type Description Default
base_dataset pandas.DataFrame

Dataset which will be evaluated.

required
label_name str

column which represents the target label.

required
threshold float

threshold to specify the percentage of noisy labels from which the test will fail. Default value is 0.4

None
text_col str

column containing the text. Only has to be specified when using a text dataset.

None
preprocessor Union[TransformerMixin, BaseEstimator]

Optional preprocessor for the features. If not specified, it will create a preprocessor with OneHotEncoder to encode categorical features in case of tabular data (text_col not specified) or a CountVectorizer to encode text in case of a text dataset(text_col specified)

None
ignore_feats List[str]

features that will not be used in the algorithm to detect noisy labels.

None
calculate_idx_issues bool

whether to calculate also the index of samples with label issues. If True, the idx will be available in idx_issues_ attribute. Default value is True.

True
label_issues_args dict

arguments for the algorithm to detect noisy labels. You can check documentation from _label_cleaning.get_label_issues to see all available arguments.

None
schema_custom_feature_map Dict[str, FeatType]

Internally, this test generates a DataSchema object. In case you find it makes wrong feature type assignations to your features you can pass here a dictionary which specify the feature type of the columns you want to fix. (See DataSchema. force_types parameter for more info on this).

None
dataset_schema DataSchema

Pre built schema. This argument is complementary to schema_custom_feature_map. In case you want to manually build your own DataSchema object, you can pass it here and the test will internally use it instead of the default, automatically built. If you provide this parameter, schema_custom_feature_map will not be used.

None
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.robust.data_tests import NoisyLabelsTest
>>> test = NoisyLabelsTest(
>>>     base_dataset=train_df,
>>>     label_name="label_col",
>>>     text_col = "text_col,
>>>     threshold = 0.2,
>>>     label_issues_args={"clf": LogisticRegression(solver='sag')}
>>> )
>>> test.run()
>>> # Percentage of samples with possible label issues
>>> test.rate_issues_
>>> # Access to samples with possible label issues
>>> train_df.iloc[test.idx_issues_]
Source code in mercury/robust/data_tests.py
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
def __init__(self,
             base_dataset: "pandas.DataFrame",  # noqa: F821
             label_name: str,
             threshold: float = None,
             text_col: str = None,
             preprocessor: Union["TransformerMixin", "BaseEstimator"] = None,  # noqa: F821
             ignore_feats: List[str] = None,
             calculate_idx_issues: bool = True,
             label_issues_args: dict = None,
             name: str = None,
             schema_custom_feature_map: Dict[str, FeatType] = None,
             dataset_schema: DataSchema = None,
             *args: Any, **kwargs: Any):
    super().__init__(base_dataset, name, *args, **kwargs)

    self.label_name = label_name
    self.threshold = 0.4 if not threshold else threshold
    self.preprocessor = preprocessor
    self.text_col = text_col
    self.ignore_feats = ignore_feats
    self.calculate_idx_issues = calculate_idx_issues
    self.rate_issues_ = None
    self._schema_custom_feature_map = schema_custom_feature_map
    self._dataset_schema = dataset_schema

    # Default args for function to get label issues
    self.label_issues_args = dict(
        clf=None,
        n_folds=5,
        frac_noise=1.0,
        num_to_remove_per_class=None,
        prune_method='prune_by_noise_rate',
        sorted_index_method=None,
        n_jobs=None,
        seed=None
    )

    if isinstance(label_issues_args, dict):
        self.label_issues_args.update(label_issues_args)

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/data_tests.py
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if noise above threshold is detected
    """
    from ._label_cleaning import get_label_issues, get_confident_joint

    super().run(*args, **kwargs)

    base_dataset = self.base_dataset

    if self.ignore_feats is not None:
        base_dataset = base_dataset.loc[:, [c for c in base_dataset.columns if c not in self.ignore_feats]]

    if self._dataset_schema is None:
        current_schema = DataSchema().generate(base_dataset, force_types=self._schema_custom_feature_map)
    else:
        current_schema = self._dataset_schema

    # Separate features from target
    X = base_dataset.loc[:, [f for f in base_dataset.columns if f != self.label_name]]
    y = base_dataset.loc[:, self.label_name].values

    # Preprocess features
    X = self._preprocess_features(X, current_schema)

    if self.calculate_idx_issues:
        # Get label issues
        self.idx_issues_, _ = get_label_issues(
            X=X,
            y=y,
            **self.label_issues_args
        )

        # Obtain number of label issues
        if self.idx_issues_.dtype == bool:
            self.num_issues = self.idx_issues_.sum()
        else:
            self.num_issues = len(self.idx_issues_)

    else:
        # Compute only confident joint and calculate number of issues as the sum of non-diagonal elements
        confident_join = get_confident_joint(
            X, y, n_folds=self.label_issues_args["n_folds"], clf=self.label_issues_args["clf"], seed=self.label_issues_args["seed"]
        )
        self.num_issues = confident_join.sum() - np.trace(confident_join)

    # Calculate percentage of obtained labels with issues
    self.rate_issues_ = self.num_issues / len(base_dataset)

    # If percentage of issues is higher than threshold, throw exception
    if self.rate_issues_ > self.threshold:
        raise FailedTestError(
            f"Test failed. High level of noise detected in labels. Percentage of labels with noise: {self.rate_issues_}"
        )

SameSchemaTest(base_dataset, schema_ref, custom_feature_map=None, name=None, *args, **kwargs)

Bases: RobustDataTest

This test ensures a certain dataset, or new batch of samples, follows a previously defined schema.

On the run call, it will make a schema validation. If something isn't right, it will raise an Exception, crashing the test.

Parameters:

Name Type Description Default
base_dataset DataFrame

Dataset which will be evaluated

required
schema_ref Union[str, DataSchema]

Schema the base_dataset will be evaluated against. It can be either a DataSchema object or a string. In case of the later, it must be a path to a previously serialized schema.

required
custom_feature_map Dict[str, FeatType]

Dictionary with . You can provide this in case the DataSchema wrongly infers the types for the new data batch. If this is None, the types will be auto inferred.

None
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.dataschema import DataSchema
>>> schma_reference = DataSchema().generate(df_train).calculate_statistics()
>>> from mercury.robust.data_tests import SameSchemaTest
>>> test = SameSchemaTest(df_inference, schma_reference)
>>> test.run()
Source code in mercury/robust/data_tests.py
44
45
46
47
48
49
50
51
52
53
54
55
def __init__(self,
             base_dataset: "pandas.DataFrame",  # noqa: F821
             schema_ref: Union[str, DataSchema],
             custom_feature_map: Dict[str, FeatType] = None,
             name: str = None,
             *args: Any, **kwargs: Any):
    super().__init__(base_dataset, name, *args, **kwargs)
    self.schema_ref = schema_ref
    self.custom_feature_map = custom_feature_map

    if type(schema_ref) == str:
        self.schema_ref = DataSchema.load(schema_ref)

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/data_tests.py
57
58
59
60
61
62
63
64
65
66
67
68
69
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if any of the checks fail.
    """
    super().run(*args, **kwargs)
    current_schema = DataSchema().generate(self.base_dataset, force_types=self.custom_feature_map)
    try:
        self.schema_ref.validate(current_schema)
    except RuntimeError as e:
        raise FailedTestError(str(e))

SampleLeakingTest(base_dataset, test_dataset, ignore_feats=None, threshold=0, name=None, use_hash=None, *args, **kwargs)

Bases: RobustDataTest

This test looks if there are samples in the test dataset that are identical to samples in the base/train dataset. It considers that a sample is the same if it contains the same values for all the columns. If the number of duplicated samples is higher than the allowed by the threshold then the test fails.

Parameters:

Name Type Description Default
base_dataset pandas.DataFrame

Training/Base dataset.

required
test_dataset pandas.DataFrame

Test dataset that we look if we have the same samples than in the base_dataset. It must contain the same columns as the base_dataset

required
ignore_feats List

List of the name of the columns that we want to ignore when comparing the samples

None
threshold Union[float, int]

max percentage (float) or number (int) of samples that we allowed to be duplicated. By default is 0

0
use_hash bool

If True, it will create a hash for the rows in the dataframes in order to find duplicates. If False it will use duplicate() pandas method. Using hash usually results in faster execution in bigger datasets. If None it will use one method or other depending of the available memory

None
name str

A name for the test. If not used, it will take the name of the class.

None
Example
>>> from mercury.robust.data_tests import SampleLeakingTest
>>> test = SampleLeakingTest(base_dataset=my_train_dataframe, test_dataset=my_test_dataset)
>>> test.run()
>>> # Check test result details
>>> test.info()
Source code in mercury/robust/data_tests.py
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
def __init__(
    self,
    base_dataset: "pandas.DataFrame",  # noqa: F821
    test_dataset: "pandas.DataFrame",  # noqa: F821
    ignore_feats: List[str] = None,
    threshold: Union[int, float] = 0,
    name: str = None,
    use_hash: bool = None,
    *args: Any, **kwargs: Any
):
    super().__init__(base_dataset, name, *args, **kwargs)
    self.test_dataset = test_dataset
    self.ignore_feats = ignore_feats if ignore_feats else []
    self.threshold = threshold
    self.use_hash = use_hash
    self._is_duplicated_test = None

run(*args, **kwargs)

Runs the test.

Source code in mercury/robust/data_tests.py
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
def run(self, *args, **kwargs):
    """
    Runs the test.

    Raises:
        FailedTestError if samples in test set existing in train are above threshold
    """
    self._verify_datasets()
    self._is_duplicated_test = self._find_sample_leaking()

    if self._high_number_of_duplicates():
        raise FailedTestError(
            f"Num of samples in test set that appear in train set is {self._is_duplicated_test.sum()} "
            f"(a proportion of {round(self._is_duplicated_test.sum() / len(self.test_dataset), 3)} test samples )"
            f"and the max allowed is {self.threshold}")