Skip to content

Data Schema

mercury.dataschema

anonymize

Anonymize(digest_bits=96, safe_crypto=False)

Cryptographically secure anonymization.

This class encrypts or hashes lists of strings using cryptographically secure standardized algorithms. It can be used with a user defined key or without a key in which case it will produce identical hashes across different platforms.

The key can be given at construction time by setting the environment variable MERCURY_ANONYMIZE_DATASCHEMA_KEY or at any later time by calling the .set_key() method.

Parameters:

Name Type Description Default
digest_bits

This determines the length in (effective) bits of the output hash. As it is encoded in base64, the number of characters will be 1/6 times this number. E.g., 96 (the default) produces 16 char long hashes. If this is set to a value other than zero, the output length is fixed, the output is irreversible (cannot be used with .deanonymize_list()) and the algorithm used for hashing is keyed BLAKE2 (https://www.blake2.net/). If this is set to zero, you will get a variable length secure encryption using Galois/Counter Mode AES. (see the argument safe_crypto) and the result can be deanonymized with the same key using .deanonymize_list().

96
safe_crypto

This argument selects how the encryption is randomized. If True, the same original text with the same key produces different encrypted texts each time. Note that this will change the cardinality of the set of values to the length of the list. If false (the default) the same text will produce the same output with the same key. This preserves cardinality, but can be a target of attacks when the attacker has access to encoded pairs.

False
Source code in mercury/dataschema/anonymize.py
38
39
40
41
42
43
44
45
46
47
48
49
def __init__(self, digest_bits=96, safe_crypto=False):
    self.digest_bits = digest_bits
    self.safe_crypto = safe_crypto

    plain_key = os.environ.get('MERCURY_ANONYMIZE_DATASCHEMA_KEY')
    plain_key = '<void>' if plain_key is None else plain_key

    hash_key = hashes.Hash(hashes.BLAKE2s(32))

    hash_key.update(plain_key.encode('utf-8'))

    self.hash_key = hash_key.finalize()[0:16]
anonymize_list(list_of_str)

Anonymize a list of strings.

This hashes or encrypts a list of strings. The precise function is defined at object construction. (See the doc of the class Anonymize for details.)

Parameters:

Name Type Description Default
list_of_str

A list of strings to be anonymized.

required

Returns:

Type Description

The anonymized list of strings encoded in base64.

Source code in mercury/dataschema/anonymize.py
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def anonymize_list(self, list_of_str):
    """Anonymize a list of strings.

    This hashes or encrypts a list of strings. The precise function is defined at object construction.
    (See the doc of the class `Anonymize` for details.)

    Args:
        list_of_str:  A list of strings to be anonymized.

    Returns:
        The anonymized list of strings encoded in base64.
    """
    l2 = list()

    if self.digest_bits != 0:
        digest_len = math.ceil(self.digest_bits / 6)

        for s in list_of_str:
            hash = hashes.Hash(hashes.BLAKE2b(64))
            hash.update(self.hash_key)
            hash.update(s.encode('utf-8'))

            l2.append(base64.encodebytes(hash.finalize()).decode()[0:digest_len])
    else:
        aes = AESGCM(self.hash_key)

        if self.safe_crypto:
            for s in list_of_str:
                nonce = os.urandom(12)		# Must be >8 (min requirement) and multiple of 6 (fixed length in)
                cipher = aes.encrypt(nonce, s.encode('utf-8'), None)

                l2.append(base64.encodebytes(nonce + cipher).decode())
        else:
            nonce = b'12345678'
            for s in list_of_str:
                cipher = aes.encrypt(nonce, s.encode('utf-8'), None)

                l2.append(base64.encodebytes(cipher).decode())

    return l2
anonymize_list_any_type(list_of_any)

Anonymize a list of anything that supports conversion to string.

This is a wrapper function over anonymize_list(). It verifies is any element in the list is not a string first. If all elements are strings, it passes the list to anonymize_list(). Otherwise, it creates a new list of string elements and passes that to anonymize_list().

Parameters:

Name Type Description Default
list_of_any

A list of any data type that supports string conversion via str() to be anonymized.

required

Returns:

Type Description

The anonymized list of strings encoded in base64.

Source code in mercury/dataschema/anonymize.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
def anonymize_list_any_type(self, list_of_any):
    """Anonymize a list of anything that supports conversion to string.

    This is a wrapper function over anonymize_list(). It verifies is any element in the list is
    not a string first. If all elements are strings, it passes the list to anonymize_list().
    Otherwise, it creates a new list of string elements and passes that to anonymize_list().

    Args:
        list_of_any:  A list of any data type that supports string conversion via str() to be anonymized.

    Returns:
        The anonymized list of strings encoded in base64.
    """

    assert type(list_of_any) == list

    all_str = True
    for s in list_of_any:
        if type(s) != str:
            all_str = False
            break

    if all_str:
        return self.anonymize_list(list_of_any)

    return self.anonymize_list([str(e) for e in list_of_any])
deanonymize_list(list_of_str)

Deanonymize a list of strings.

Deanonymizes a list of anonymized strings recovering the original text. This can only be applied if the encryption is reversible (The object was created with digest_bits = 0) and the key is the same key used for encryption.

Parameters:

Name Type Description Default
list_of_str

A list of strings anonymized using a previous .anonymize_list() call.

required

Raises:

Type Description
ValueError

When called on an object that does hashing (is created with digest_bits > 0)

Returns:

Type Description

The original deanonymized list of strings.

Source code in mercury/dataschema/anonymize.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def deanonymize_list(self, list_of_str):
    """Deanonymize a list of strings.

    Deanonymizes a list of anonymized strings recovering the original text. This can only be applied if
    the encryption is reversible (The object was created with `digest_bits = 0`) and the key is the same
    key used for encryption.

    Args:
        list_of_str:  A list of strings anonymized using a previous .anonymize_list() call.

    Raises:
        ValueError: When called on an object that does hashing (is created with `digest_bits > 0`)
        rather than encryption.

    Returns:
        The original deanonymized list of strings.
    """
    if self.digest_bits != 0:
        raise ValueError("deanonymize_list() requires passing 'digest_bits = 0' to the constructor.")

    l2 = list()

    aes = AESGCM(self.hash_key)

    if self.safe_crypto:
        for s in list_of_str:
            raw = base64.decodebytes(s.encode())
            nonce = raw[0:12]
            cipher = raw[12:]

            l2.append(aes.decrypt(nonce, cipher, None).decode('utf-8'))
    else:
        nonce = b'12345678'
        for s in list_of_str:
            cipher = base64.decodebytes(s.encode())

            l2.append(aes.decrypt(nonce, cipher, None).decode('utf-8'))

    return l2
set_key(encryption_key)

Set the encryption key of an existing Anonymize object.

This changes the encryption key overriding the key possibly defined using the environment variable MERCURY_ANONYMIZE_DATASCHEMA_KEY at construction. It can be called any number of times.

Parameters:

Name Type Description Default
encryption_key

The key as a string.

required
Source code in mercury/dataschema/anonymize.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def set_key(self, encryption_key):
    """Set the encryption key of an existing `Anonymize` object.

    This changes the encryption key overriding the key possibly defined using the environment variable
    MERCURY_ANONYMIZE_DATASCHEMA_KEY at construction. It can be called any number of times.

    Args:
        encryption_key:  The key as a string.
    """
    hash_key = hashes.Hash(hashes.BLAKE2s(32))

    hash_key.update(encryption_key.encode('utf-8'))

    self.hash_key = hash_key.finalize()[0:16]

calculator

FeatureCalculator

This is a base class with the operation definitions. Several classes must extend this, implementing its operations for each one of the supported frameworks (namely Pandas and Pyspark)

set_config(**kwargs)

Set attributes with the keys of the dictionary. These can be later used within specific calculator methods (like distribution() for specifying the number of bins).

For this to work, the parameter must have been explicitly declared during object's constructor. That is, you cannot pass here a parameter name which the calculator doesn't support (or this will raise a ValueError).

Parameters:

Name Type Description Default
**kwargs

The names and values of the desired parameters to set.

{}
Source code in mercury/dataschema/calculator.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def set_config(self, **kwargs):
    """ Set attributes with the keys of the dictionary. These can be later used within
    specific calculator methods (like `distribution()` for specifying the number of bins).

    For this to work, the parameter must have been explicitly declared during object's
    constructor. That is, you cannot pass here a parameter name which the calculator doesn't
    support (or this will raise a ValueError).

    Args:
        **kwargs: The names and values of the desired parameters to set.

    Raises:
        ValueError if any keyword argument does not exist among the existing attributes of
        the object.
    """
    if kwargs is None:
        return

    for key, val in kwargs.items():
        if not hasattr(self, key):
            raise ValueError(
                f"Error. This calculator doesn't support the `{key}` parameter. Available options are {self._registered_params}"
            )
        setattr(self, key, val)

PandasStatCalculator()

Bases: FeatureCalculator

Implementation of a Calculator for Pandas

Supported setting keys are the following
  • distribution_bins_method: The method for setting the number of bins when calling the distribution method. Note that this only has effect when feature is either discrete or continuous.
  • limit_categorical_perc: The method for truncating categorical variables with high cardinality
Source code in mercury/dataschema/calculator.py
66
67
68
69
def __init__(self):
    super().__init__()
    self.distribution_bins_method = 'sqrt'
    self.limit_categorical_perc = None
distribution(column, feature, bins=None)

Calculates the histogram for a given feature.

Parameters:

Name Type Description Default
column

Pandas column with the data

required
feature

Feature which holds the metadata

required
bins

(Only used for numerical features) If a number, the histogram will have bins bins. If a string, it will use an automatic NumPy method for estimating this number. See more about available methods here: https://numpy.org/devdocs/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges. If None is provided, it uses the default class' method, which is sqrt. For binary features it simply uses bins=2 and for categoricals, bins=|categories| if is not limited with 'limit_categorical_perc' in set_config method.

None
Source code in mercury/dataschema/calculator.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def distribution(self, column, feature, bins=None):
    """ Calculates the histogram for a given feature.

    Args:
        column: Pandas column with the data
        feature: Feature which holds the metadata
        bins: (Only used for numerical features) If a number, the histogram will
              have `bins` bins. If a string, it will use an automatic NumPy method for
              estimating this number. See more about available methods here:
              https://numpy.org/devdocs/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges.
              If None is provided, it uses the default class' method, which is `sqrt`.
              For binary features it simply uses bins=2 and for categoricals, bins=|categories| if is not limited
              with 'limit_categorical_perc' in set_config method.
    """
    if 'no_nan_filtered' not in feature.cache:
        no_na = column.dropna()
        feature.cache['no_nan_filtered'] = no_na
    else:
        no_na = feature.cache['no_nan_filtered']

    if isinstance(feature, (BinaryFeature, CategoricalFeature)):

        no_na = no_na[no_na.isin(feature.stats['domain'])]  # It may be truncated
        t = (no_na.value_counts() / len(no_na)).sort_index()
        feature.stats['distribution'] = t.values
        feature.stats['distribution'] = [float(x) for x in feature.stats['distribution']]
        feature.stats['distribution_bins'] = list(t.index)

    else:
        bins = self.distribution_bins_method if not bins else bins
        histo = np.histogram(no_na, bins=bins)
        feature.stats['distribution'] = list(histo[0] / no_na.count())
        feature.stats['distribution'] = [float(x) for x in feature.stats['distribution']]
        feature.stats['distribution_bins'] = list(histo[1])

StatCalculatorFactory

This static class receives a DataFrame and returns a particular implementation of a FeatureCalculator

feature

BinaryFeature(name=None, dtype=None)

Bases: Feature

This class represents a binary feature within a schema (i.e. only two possible values).

Parameters:

Name Type Description Default
name

Feature name

None
dtype

Data type of the feature

None
Source code in mercury/dataschema/feature.py
90
91
92
93
94
95
def __init__(self,
             name=None,
             dtype=None
             ):

    super().__init__(name, dtype)

CategoricalFeature(name=None, dtype=None)

Bases: Feature

This class represents a categorical feature within a schema (i.e. only N possible values).

Parameters:

Name Type Description Default
name

Feature name

None
dtype

Data type of the feature

None
Source code in mercury/dataschema/feature.py
121
122
123
124
125
126
def __init__(self,
             name=None,
             dtype=None
             ):

    super().__init__(name, dtype)

ContinuousFeature(name=None, dtype=None)

Bases: Feature

This class represents a continuous feature within a schema (e.g. a float).

Parameters:

Name Type Description Default
name

Feature name

None
dtype

Data type of the feature

None
Source code in mercury/dataschema/feature.py
202
203
204
205
206
207
def __init__(self,
             name=None,
             dtype=None
             ):

    super().__init__(name, dtype)

DiscreteFeature(name=None, dtype=None)

Bases: Feature

This class represents a discrete feature within a schema (i.e. any number without decimals).

Parameters:

Name Type Description Default
name

Feature name

None
dtype

Data type of the feature

None
Source code in mercury/dataschema/feature.py
170
171
172
173
174
175
def __init__(self,
             name=None,
             dtype=None
             ):

    super().__init__(name, dtype)

Feature(name=None, dtype=None)

This class represents a generic feature within a schema.

Parameters:

Name Type Description Default
name str

Feature name

None
dtype DataType

Data type of the feature

None
Source code in mercury/dataschema/feature.py
32
33
34
35
36
37
38
39
def __init__(self,
             name: str = None,
             dtype: DataType = None
             ):
    self.name = name
    self.dtype = dtype if dtype else DataType.UNKNOWN
    self.stats = {}
    self.cache = {}  # Intermediate heavy calculations

FeatureFactory()

Source code in mercury/dataschema/feature.py
230
231
def __init__(self):
    pass
build_feature(column, colname=None, threshold_categorical=1e-05, force_feat_type=None, verbose=True)

Builds a schema Feature object given a column.

Parameters:

Name Type Description Default
column Series

Column to be analyzed

required
colname str

Name of the column (feature)

None
threshold_categorical float

percentage of necessary unique values for a feature to be considered categorical. If the percentage of unique values < cat_threshold, the column will be taken as categorical. This parameter can be a single float (same threshold for all columns) or a dict in which each key is the name of the column. Use the later for custom thresholds per column.

1e-05
force_feat_type FeatType

If user wants to force a variable to be of certain type, he/she can use this parameter and its type will not be auto-inferred, but set to this.

None
verbose bool

If this is set to False, possible inner warnings won't be shown.

True

Returns:

Type Description
Feature

Feature with only the base statistics calculated

Source code in mercury/dataschema/feature.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
def build_feature(self,
                  column: 'pandas.Series',  # noqa: F821
                  colname: str = None,
                  threshold_categorical: float = 1e-5,
                  force_feat_type: FeatType = None,
                  verbose: bool = True
                  ) -> Feature:
    """ Builds a schema Feature object given a column.

    Args:
        column: Column to be analyzed
        colname: Name of the column (feature)
        threshold_categorical: percentage of necessary unique values for a feature to be considered
                       categorical. If the percentage of unique values < cat_threshold, the
                       column will be taken as categorical. This parameter can be a single float
                       (same threshold for all columns) or a dict in which each key is the name of
                       the column. Use the later for custom thresholds per column.
        force_feat_type: If user wants to force a variable to be of certain type, he/she can use
                        this parameter and its type will not be auto-inferred, but set to this.
        verbose: If this is set to False, possible inner warnings won't be shown.

    Returns:
        Feature with only the base statistics calculated
    """
    feat = Feature().build_stats(column)
    datatype = self.infer_datatype(column, feat)
    feat_type = FeatType.UNKNOWN

    # If user forces the feature type we kindly fulfill his/her wishes
    if force_feat_type is not None:
        feat = self._build_dummy_feature(datatype, force_feat_type, colname)
        feat.stats.update(feat.stats)
        return feat

    if feat.stats['cardinality'] == 2:
        feat_type = FeatType.BINARY
    else:
        # Data could still be either categorical, discrete or continuous
        if datatype is DataType.FLOAT:
            feat_type = self._infer_feature_type_from_float(feat, threshold_categorical, colname, verbose=verbose)

        if datatype is DataType.INTEGER:
            feat_type = self._infer_feature_type_from_int(feat, colname, threshold_categorical, verbose=verbose)

        if (datatype is DataType.STRING) or (datatype is DataType.CATEGORICAL):
            feat_type = FeatType.CATEGORICAL

    featret = self._build_dummy_feature(datatype, feat_type, colname)
    featret.stats.update(feat.stats)
    return featret
infer_datatype(column, feature)

Finds out the data type of the column.

Parameters:

Name Type Description Default
column Series

column which datatype will be inferred

required
feature Feature

Feature object. This is needed because we want to cache several internal operations, so future calls are faster.

required

Returns:

Type Description
DataType

Returns the datatype of the column

Source code in mercury/dataschema/feature.py
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def infer_datatype(self, column: "pandas.Series", feature: Feature) -> DataType:  # noqa: F821
    """ Finds out the data type of the column.

    Args:
        column: column which datatype will be inferred
        feature: Feature object. This is needed because we want to cache several internal
                 operations, so future calls are faster.

    Returns:
        Returns the datatype of the column
    """
    datatype = DataType.UNKNOWN

    if column.dtype.name == 'category':
        datatype = DataType.CATEGORICAL
    elif np.issubdtype(column, np.integer):
        datatype = DataType.INTEGER
    elif np.issubdtype(column, np.bool_):
        datatype = DataType.BOOL
    elif np.issubdtype(column, np.floating):
        datatype = DataType.FLOAT
    elif np.issubdtype(column, np.object_):
        sample = feature.cache['no_nan_filtered'].iloc[0]
        if type(sample) is str:
            datatype = DataType.STRING
        # TODO: Este tipo puede ser otro array
        # TODO: Este tipo puede ser un json (dict)
        # TODO: Este tipo puede ser un datetime

    return datatype

schemagen

DataSchema()

Dataset schema

This class takes a dataframe and generates its schema as a collection of feature. Feature objects. Each one of them will contain metadata and statistics about a column of the original dataframe that can be further exploded.

Example
>>> schma = DataSchema()        >>>            .generate(dataset)        >>>            .calculate_statistics()
 'DISBURSED_AMOUNT': Categorical Feature (NAME=DISBURSED_AMOUNT, dtype=DataType.INTEGER),
 'ASSET_COST': Categorical Feature (NAME=ASSET_COST, dtype=DataType.INTEGER),
 'LTV': Continuous Feature (NAME=LTV, dtype=DataType.FLOAT),
 'BUREAU_SCORE': Discrete Feature (NAME=BUREAU_SCORE, dtype=DataType.INTEGER),
 'BUREAU_SCORE_DESCRIPTION': Categorical Feature (NAME=BUREAU_SCORE_DESCRIPTION, dtype=DataType.STRING),
 'NEW_LOANS_IN_LAST_SIX_MONTHS': Discrete Feature (NAME=NEW_LOANS_IN_LAST_SIX_MONTHS, dtype=DataType.INTEGER),
 'DEFAULTED_LOANS_IN_LAST_SIX_MONTHS': Discrete Feature (NAME=DEFAULTED_LOANS_IN_LAST_SIX_MONTHS, dtype=DataType.INTEGER),
 'NUM_LOANS_TAKEN': Discrete Feature (NAME=NUM_LOANS_TAKEN, dtype=DataType.INTEGER),
 'NUM_ACTIVE_LOANS': Discrete Feature (NAME=NUM_ACTIVE_LOANS, dtype=DataType.INTEGER),
 'NUM_DEFAULTED_LOANS': Discrete Feature (NAME=NUM_DEFAULTED_LOANS, dtype=DataType.INTEGER),
 'AGE': Discrete Feature (NAME=AGE, dtype=DataType.INTEGER),
 'GENDER': Binary Feature (NAME=GENDER, dtype=DataType.STRING),
 'CIVIL_STATUS': Categorical Feature (NAME=CIVIL_STATUS, dtype=DataType.STRING),
 'ORIGIN': Binary Feature (NAME=ORIGIN, dtype=DataType.STRING),
 'DIGITAL': Binary Feature (NAME=DIGITAL, dtype=DataType.INTEGER),
 'SCORE': Continuous Feature (NAME=SCORE, dtype=DataType.FLOAT),
 'PREDICTION': Binary Feature (NAME=PREDICTION, dtype=DataType.INTEGER)}
>>> schma.feats['SCORE'].stats
{'num_nan': 0,
'percent_nan': 0.0,
'samples': 233154,
'percent_unique': 0.7967352050576014,
'cardinality': 185762,
'min': 0.17454321487679067,
'max': 0.9373813084029072,
'mean': 0.7625553210045813,
'std': 0.15401509786623635,
'distribution': array([7.48617716e-07, 1.07579979e-06, 1.40298186e-06, 1.73016394e-06,
        2.05734601e-06, 2.38452809e-06, 2.71171016e-06, 3.03889224e-06,
        3.36607431e-06, 3.69325638e-06, 4.02043846e-06])}
# Specifying custom parameters (shared among all features) for the calculate_statistics method
>>> schma = DataSchema()        ...    .generate(dataset)        ...    .calculate_statistics({'distribution_bins_method': 'sqrt'})  # Specify bin generation method (see numpy.hist)

# We can also specify granular statistic parameters per variable
>>> schma = DataSchema()        ...    .generate(dataset)        ...    .calculate_statistics({'SCORE': {'distribution_bins_method': 'sqrt'}})  # Specify bin generation method (see numpy.hist)

>>> schma = DataSchema()        ...    .generate(dataset)        ...    .calculate_statistics({'SCORE': {'distribution_bins_method': 5}})  # Specify 5 bins only for numerical features
Source code in mercury/dataschema/schemagen.py
80
81
82
83
84
def __init__(self):
    self.dataframe = None
    self.feats = {}
    self._feat_factory = None
    self._generated = False
binary_feats: List[str] property

List with the names of all binary features

categorical_feats: List[str] property

List with the names of all categorical features

continuous_feats: List[str] property

List with the names of all continuous features

discrete_feats: List[str] property

List with the names of all discrete features

anonymize(anonymize_params)

Anonymize the selected features of a data schema.

Parameters:

Name Type Description Default
anonymize_params dict

Dictionary where the keys are the names of the columns to be anonymized and the values are mercury.contrib.dataschema.Anonymize objects that can be used to anonymize them.

required
Source code in mercury/dataschema/schemagen.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def anonymize(self, anonymize_params: dict) -> "DataSchema":
    """
    Anonymize the selected features of a data schema.

    Args:
        anonymize_params: Dictionary where the keys are the names of the columns to be anonymized and the values
                          are mercury.contrib.dataschema.Anonymize objects that can be used to anonymize them.
    Raises:
        UserWarning, if anonymize_params is empty.
        ValueError, if the feature selected to deanonymize is not binary or categorical, or is not a feature of the dataschema.
    """
    if not anonymize_params:
        raise UserWarning("To anonymize, it is necessary to use a dictionary with the format: {'var1':anonymizer1, 'var2':anonymizer2}")

    if any(feat not in self.feats.keys() for feat in anonymize_params.keys()):
        raise ValueError("Input Error: Keys of 'anonymize_params' dictionary must be columns name of the data schema")

    for feature in list(self.feats.keys()):
        anon = anonymize_params.get(feature)

        if anon:
            if not isinstance(self.feats[feature], (BinaryFeature, CategoricalFeature)):
                raise ValueError(f"Input Error: Anonymze only supports Categorical or Binary variables -> {feature}, You can use \
                                    the `force_types` param in 'generate()' to specify which features should be categorical ")
            else:
                self.feats[feature].stats['distribution_bins'] = anon.\
                    anonymize_list_any_type(list(self.feats[feature].stats['distribution_bins']))
                self.feats[feature].stats['domain'] = anon.\
                    anonymize_list_any_type(list(self.feats[feature].stats['domain']))

    return self
calculate_statistics(calculator_configs=None)

Triggers the computation of all statistics for all registered features of the schema.

Parameters:

Name Type Description Default
calculator_configs dict

Optional configurations for each of the calculator parameters. This can be either a dict or a "dict of dicts". In the first case, the statistics for ALL FEATURES will be computed with those parameters. Additionally, you can specify a mapping of [feature_name: {config}] with granular configurations per feature. The supported configuration keys are the attributes declared within a calculator class. See mercury.contrib.dataschema.calculator.PandasStatCalculator (or Spark) for details.

None
Source code in mercury/dataschema/schemagen.py
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
def calculate_statistics(
    self,
    calculator_configs: dict = None
) -> "DataSchema":
    """ Triggers the computation of all statistics for all registered features
    of the schema.

    Args:
        calculator_configs: Optional configurations for each of the calculator parameters.
                            This can be either a dict or a "dict of dicts". In the first case,
                            the statistics for ALL FEATURES will be computed with those parameters.
                            Additionally, you can specify a mapping of [feature_name: {config}] with
                            granular configurations per feature.
                            The supported configuration keys are the attributes declared within a calculator class.
                            See mercury.contrib.dataschema.calculator.PandasStatCalculator (or Spark) for details.
    """
    featnames = list(self.feats.keys())

    calculator_configs = calculator_configs if calculator_configs else {}

    # User can pass us two  types:
    #  - {'param': 'value', 'param2': 'value'} -> Single config shared for all variables
    #  - {{config_var1}, {config_var2}, {config_var3}, ...} -> 1 config per variable
    multiple_configs = len(calculator_configs) > 0 and isinstance(list(calculator_configs.values())[0], dict)

    # Case when user pass a single shared config
    if not multiple_configs:
        calculator = StatCalculatorFactory.build_calculator(self.dataframe)
        calculator.set_config(**calculator_configs)

    for feature in featnames:
        if multiple_configs:
            # Case when user pass one config per variable
            calculator = StatCalculatorFactory.build_calculator(self.dataframe)
            if feature in calculator_configs:
                calculator.set_config(**(calculator_configs[feature]))

        # Calculate distributions
        self.feats[feature].build_stats(self.dataframe.loc[:, feature], calculator)

    return self
deanonymize(anonymize_params)

De-anonymize the selected features on a preloaded schema.

Parameters:

Name Type Description Default
anonymize_params dict

Dictionary where the keys are the names of the columns to be deanonymized and the values are mercury.contrib.dataschema.Anonymize objects that can be used to deanonymize them.

required
Source code in mercury/dataschema/schemagen.py
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
def deanonymize(self, anonymize_params: dict) -> "DataSchema":
    """
    De-anonymize the selected features on a preloaded schema.

    Args:
        anonymize_params: Dictionary where the keys are the names of the columns to be deanonymized and the values
                          are mercury.contrib.dataschema.Anonymize objects that can be used to deanonymize them.

    Raises:
        UserWarning, if anonymize_params is empty.
        ValueError, if the feature selected to deanonymize is not binary or categorical, or is not a feature of the dataschema.
    """
    if not anonymize_params:
        raise UserWarning("To De-anonymize, it is necessary to use a dictionary with the format: {'var1':anonym1, 'var2':anonym2}")

    if any(feat not in self.feats.keys() for feat in anonymize_params.keys()):
        raise ValueError("Input Error: Keys of 'anonymize_params' dictionary must be columns name of the data schema")

    for feature in list(self.feats.keys()):
        anon = anonymize_params.get(feature)

        if anon:

            if not isinstance(self.feats[feature], (BinaryFeature, CategoricalFeature)):
                raise ValueError(f"Input Error: Deanonymize only supports Categorical or Binary variables -> {feature} ")
            else:
                operation = int if self.feats[feature].dtype == DataType.INTEGER else str
                self.feats[feature].stats['distribution_bins'] = \
                    list(map(operation, anon.deanonymize_list(self.feats[feature].stats['distribution_bins'])))
                self.feats[feature].stats['domain'] = \
                    list(map(operation, anon.deanonymize_list(self.feats[feature].stats['domain'])))
    return self
from_json(json_obj) classmethod

Rebuilds an schema from a JSON representation.

Returns:

Type Description
DataSchema

The rebuild schema

Source code in mercury/dataschema/schemagen.py
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
@classmethod
def from_json(cls, json_obj: dict) -> "DataSchema":
    """ Rebuilds an schema from a JSON representation.

    Returns:
        The rebuild schema
    """
    schema = DataSchema()
    factory = FeatureFactory()

    for featname, feat in json_obj['feats'].items():
        ftype = FeatType[feat['feat_type']]
        dtype = DataType[feat['dtype']]
        feat_name = feat['name']
        dummy_feat = factory._build_dummy_feature(dtype, ftype, feat_name)
        dummy_feat.stats = feat['stats']
        schema.feats[featname] = dummy_feat

    return schema
generate(dataframe, force_types=None, custom_stats=None, verbose=True)

Builds the schema. For float and integer datatypes, by default the method tries to infer if a feature is categorical or numeric (Continuous or Discrete) depending on the percentage of unique values. However, that doesn't work in all the cases. In those cases, you can use the force_types param to specify which features should be categorical and which should be numeric independently of the percentage of unique values.

Parameters:

Name Type Description Default
dataframe Union[DataFrame, DataFrame]

DataFrame on which the schema will be inferred.

required
force_types Dict[str, FeatType]

Dictionary with the form that contains the features to be forced to a specific type (Continuous, Discrete, Categorical...)

None
custom_stats dict

Custom statistics to be calculated for each column

None
verbose bool

whether to show or filter all possible warning messages

True
Source code in mercury/dataschema/schemagen.py
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
def generate(
    self,
    dataframe: Union["pandas.DataFrame", "pyspark.sql.DataFrame"],  # noqa: F821
    force_types: Dict[str, FeatType] = None,
    custom_stats: dict = None,
    verbose: bool = True,
) -> "DataSchema":
    """ Builds the schema. For float and integer datatypes, by default the method tries to infer
        if a feature is categorical or numeric (Continuous or Discrete) depending on the percentage
        of unique values. However, that doesn't work in all the cases. In those cases, you can use
        the `force_types` param to specify which features should be categorical and which
        should be numeric independently of the percentage of unique values.

    Args:
        dataframe: DataFrame on which the schema will be inferred.
        force_types: Dictionary with the form <FEATURE_NAME, FeatType> that contains the features to be
                    forced to a specific type (Continuous, Discrete, Categorical...)
        custom_stats: Custom statistics to be calculated for each column
        verbose: whether to show or filter all possible warning messages
    """
    if "pyspark" in str(type(dataframe)):
        raise RuntimeError("Sorry, Pyspark is not supported yet...")

    self.dataframe = dataframe
    self._generated = True

    self._feat_factory = FeatureFactory()

    inferring_types = True if force_types is None else False

    for col in self.dataframe.columns:
        thresh = self._get_threshold(len(self.dataframe))

        # Look if the feature type has been specified
        forced_type = None
        if not inferring_types and col in force_types:
            forced_type = force_types[col]

        feat = self._feat_factory.build_feature(
            self.dataframe.loc[:, col],
            col,
            force_feat_type=forced_type,
            threshold_categorical=thresh,
            verbose=inferring_types and verbose  # Only show warnings (if any) when using default args.
        )
        self.feats[col] = feat

    return self
generate_manual(dataframe, categ_columns, discrete_columns, binary_columns, custom_stats=None)

Builds the schema manually. This acts like generate() but in a more restrictive way. All the names passed to categ_columns will be taken as categorical features, no more, no less. It will avoid making automatic type inference on every feature not in categ_columns. The same rule is applied on discrete_columns.

Note

This method is considered to be low level. If you use this, make sure the type assignment to each feature type is compatible with the datatypes (float, int, string,...) in the column or a later call to calculate_statistics could fail.

Parameters:

Name Type Description Default
dataframe Union[DataFrame, DataFrame]

DataFrame on which the schema will be inferred.

required
categ_columns List[str]

list of columns which will be forced to be taken as categorical. Warning: all features not in this list are guaranteed not being categorical

required
discrete_columns List[str]

list of columns which will be forced to be taken as discrete. Warning: all features not in this list are guaranteed not to be taken as discrete (i.e. they will be continuous).

required
binary_columns List[str]

list of column which will be forced to be taken as binary.

required
custom_stats dict

Custom statistics to be calculated for each column.

None
verbose

whether to show or filter all possible warning messages

required
Source code in mercury/dataschema/schemagen.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
def generate_manual(
    self,
    dataframe: Union["pandas.DataFrame", "pyspark.sql.DataFrame"],  # noqa: F821
    categ_columns: List[str],
    discrete_columns: List[str],
    binary_columns: List[str],
    custom_stats: dict = None,
) -> "DataSchema":
    """ Builds the schema manually. This acts like `generate()` but in a more restrictive way.
    All the names passed to `categ_columns` will be taken as categorical features, no more, no less.
    It will avoid making automatic type inference on every feature not in `categ_columns`.
    The same rule is applied on `discrete_columns`.

    Note:
        This method is considered to be low level. If you use this, make sure the type assignment
        to each feature type is compatible with the datatypes (float, int, string,...) in the column or
        a later call to `calculate_statistics` could fail.

    Args:
        dataframe: DataFrame on which the schema will be inferred.
        categ_columns: list of columns which will be forced to be taken as categorical. Warning:
                      all features not in this list are guaranteed not being categorical
        discrete_columns: list of columns which will be forced to be taken as discrete. Warning:
                      all features not in this list are guaranteed not to be taken as discrete (i.e.
                      they will be continuous).
        binary_columns: list of column which will be forced to be taken as binary.
        custom_stats: Custom statistics to be calculated for each column.
        verbose: whether to show or filter all possible warning messages
    """
    force_types = {}
    for col in dataframe.columns:
        if col in categ_columns:
            force_types[col] = FeatType.CATEGORICAL
        else:
            # Is in either binary, continuous or discrete lists
            if col in discrete_columns:
                force_types[col] = FeatType.DISCRETE
            elif col in binary_columns:
                force_types[col] = FeatType.BINARY
            else:
                force_types[col] = FeatType.CONTINUOUS

    return self.generate(
        dataframe=dataframe,
        force_types=force_types,
        verbose=False,
        custom_stats=custom_stats
    )
load(path) classmethod

Loads a previously serialized schema (as JSON)

Parameters:

Name Type Description Default
path str

path to the serialized schema

required

Returns:

Type Description
DataSchema

The rebuilt schema

Source code in mercury/dataschema/schemagen.py
367
368
369
370
371
372
373
374
375
376
377
378
379
380
@classmethod
def load(cls, path: str) -> "DataSchema":
    """ Loads a previously serialized schema (as JSON)

    Args:
        path: path to the serialized schema

    Returns:
        The rebuilt schema
    """
    with open(path, 'r') as file:
        json_obj = json.load(file)
    schema = cls.from_json(json_obj)
    return schema
save(path)

Saves a JSON with the schema representation

Parameters:

Name Type Description Default
path

where the JSON will be saved.

required
Source code in mercury/dataschema/schemagen.py
358
359
360
361
362
363
364
365
def save(self, path):
    """ Saves a JSON with the schema representation

    Args:
        path: where the JSON will be saved.
    """
    with open(path, 'w') as file:
        json.dump(self.to_json(), file)
to_json()

Converts the schema to a JSON representation

Returns:

Type Description
dict

dictionary with the features and their stats

Source code in mercury/dataschema/schemagen.py
346
347
348
349
350
351
352
353
354
355
356
def to_json(self) -> dict:
    """ Converts the schema to a JSON representation

    Returns:
        dictionary with the features and their stats
    """
    retdict = dict(feats=dict())
    for key, val in self.feats.items():
        retdict['feats'][key] = self.feats[key].to_json()

    return retdict
validate(other)

Validates other schema with this one. The other schema will be considered valid if it shares the same feature names and datatypes with this.

Parameters:

Name Type Description Default
other DataSchema

other schema to be checked from this one

required
Source code in mercury/dataschema/schemagen.py
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
def validate(self, other: "DataSchema"):
    """ Validates other schema with this one. The other schema will be considered
    valid if it shares the same feature names and datatypes with this.

    Args:
        other: other schema to be checked from this one

    Raises:
        RuntimeError if other schema differs from this one
    """
    # Check feature names match
    if list(self.feats.keys()) != list(other.feats.keys()):
        diff = set(self.feats.keys()) - set(other.feats.keys())
        raise RuntimeError(f"Features do not match. These ones are not present on both datasets {list(diff)}")

    # Check feature and data types are the same
    for key, item in other.feats.items():
        if not isinstance(item, self.feats[key].__class__):
            raise RuntimeError(f"""Feature types do not match. '{key}' in other is """
                               f"""{type(item)}. However, {type(self.feats[key])} is expected.""")

        if item.dtype != self.feats[key].dtype:
            raise RuntimeError(f"""Data types types do not match. '{key}' in other is """
                               f"""{item.dtype}. However, {self.feats[key].dtype} is expected.""")