Data Schema

`mercury.dataschema`

`anonymize`

`Anonymize(digest_bits=96, safe_crypto=False)`

Cryptographically secure anonymization.

This class encrypts or hashes lists of strings using cryptographically secure standardized algorithms. It can be used with a user defined key or without a key in which case it will produce identical hashes across different platforms.

The key can be given at construction time by setting the environment variable MERCURY_ANONYMIZE_DATASCHEMA_KEY or at any later time by calling the .set_key() method.

Parameters:

Name	Type	Description	Default
`digest_bits`	`int`	This determines the length in (effective) bits of the output hash. As it is encoded in base64, the number of characters will be 1/6 times this number. E.g., 96 (the default) produces 16 char long hashes. If this is set to a value other than zero, the output length is fixed, the output is irreversible (cannot be used with .deanonymize_list()) and the algorithm used for hashing is keyed BLAKE2 (https://www.blake2.net/). If this is set to zero, you will get a variable length secure encryption using Galois/Counter Mode AES. (see the argument `safe_crypto`) and the result can be deanonymized with the same key using .deanonymize_list().	`96`
`safe_crypto`	`bool`	This argument selects how the encryption is randomized. If True, the same original text with the same key produces different encrypted texts each time. Note that this will change the cardinality of the set of values to the length of the list. If false (the default) the same text will produce the same output with the same key. This preserves cardinality, but can be a target of attacks when the attacker has access to encoded pairs.	`False`

Source code in mercury/dataschema/anonymize.py

def __init__(self, digest_bits=96, safe_crypto=False):
    self.digest_bits = digest_bits
    self.safe_crypto = safe_crypto

    plain_key = os.environ.get('MERCURY_ANONYMIZE_DATASCHEMA_KEY')
    plain_key = '<void>' if plain_key is None else plain_key

    hash_key = hashes.Hash(hashes.BLAKE2s(32))

    hash_key.update(plain_key.encode('utf-8'))

    self.hash_key = hash_key.finalize()[0:16]

`anonymize_list(list_of_str)`

Anonymize a list of strings.

This hashes or encrypts a list of strings. The precise function is defined at object construction. (See the doc of the class Anonymize for details.)

Parameters:

Name	Type	Description	Default
`list_of_str`	`list`	A list of strings to be anonymized.	required

Returns (list): The anonymized list of strings encoded in base64.

Source code in mercury/dataschema/anonymize.py

def anonymize_list(self, list_of_str):
    """Anonymize a list of strings.

    This hashes or encrypts a list of strings. The precise function is defined at object construction.
    (See the doc of the class `Anonymize` for details.)

    Args:
        list_of_str (list):  A list of strings to be anonymized.

    Returns (list):
        The anonymized list of strings encoded in base64.
    """
    l2 = list()

    if self.digest_bits != 0:
        digest_len = math.ceil(self.digest_bits / 6)

        for s in list_of_str:
            hash = hashes.Hash(hashes.BLAKE2b(64))
            hash.update(self.hash_key)
            hash.update(s.encode('utf-8'))

            l2.append(base64.encodebytes(hash.finalize()).decode()[0:digest_len])
    else:
        aes = AESGCM(self.hash_key)

        if self.safe_crypto:
            for s in list_of_str:
                nonce = os.urandom(12)		# Must be >8 (min requirement) and multiple of 6 (fixed length in)
                cipher = aes.encrypt(nonce, s.encode('utf-8'), None)

                l2.append(base64.encodebytes(nonce + cipher).decode())
        else:
            nonce = b'12345678'
            for s in list_of_str:
                cipher = aes.encrypt(nonce, s.encode('utf-8'), None)

                l2.append(base64.encodebytes(cipher).decode())

    return l2

`anonymize_list_any_type(list_of_any)`

Anonymize a list of anything that supports conversion to string.

This is a wrapper function over anonymize_list(). It verifies if any element in the list is not a string first. If all elements are strings, it passes the list to anonymize_list(). Otherwise, it creates a new list of string elements and passes that to anonymize_list().

Parameters:

Name	Type	Description	Default
`list_of_any`	`list`	A list of any data type that supports string conversion via str() to be anonymized.	required

Returns (list): The anonymized list of strings encoded in base64.

Source code in mercury/dataschema/anonymize.py

def anonymize_list_any_type(self, list_of_any):
    """Anonymize a list of anything that supports conversion to string.

    This is a wrapper function over anonymize_list(). It verifies if any element in the list is
    not a string first. If all elements are strings, it passes the list to anonymize_list().
    Otherwise, it creates a new list of string elements and passes that to anonymize_list().

    Args:
        list_of_any (list):  A list of any data type that supports string conversion via str() to be anonymized.

    Returns (list):
        The anonymized list of strings encoded in base64.
    """

    assert type(list_of_any) == list

    all_str = True
    for s in list_of_any:
        if type(s) != str:
            all_str = False
            break

    if all_str:
        return self.anonymize_list(list_of_any)

    return self.anonymize_list([str(e) for e in list_of_any])

`deanonymize_list(list_of_str)`

Deanonymize a list of strings.

Deanonymizes a list of anonymized strings recovering the original text. This can only be applied if the encryption is reversible (The object was created with digest_bits = 0) and the key is the same key used for encryption.

Raises ValueError when called on an object that does hashing (is created with digest_bits > 0) rather than encryption.

Parameters:

Name	Type	Description	Default
`list_of_str`	`list`	A list of strings anonymized using a previous .anonymize_list() call.	required

Returns (list): The original deanonymized list of strings.

Source code in mercury/dataschema/anonymize.py

def deanonymize_list(self, list_of_str):
    """Deanonymize a list of strings.

    Deanonymizes a list of anonymized strings recovering the original text. This can only be applied if
    the encryption is reversible (The object was created with `digest_bits = 0`) and the key is the same
    key used for encryption.

    Raises ValueError when called on an object that does hashing (is created with `digest_bits > 0`)
    rather than encryption.


    Args:
        list_of_str (list):  A list of strings anonymized using a previous .anonymize_list() call.

    Returns (list):
        The original deanonymized list of strings.
    """
    if self.digest_bits != 0:
        raise ValueError("deanonymize_list() requires passing 'digest_bits = 0' to the constructor.")

    l2 = list()

    aes = AESGCM(self.hash_key)

    if self.safe_crypto:
        for s in list_of_str:
            raw = base64.decodebytes(s.encode())
            nonce = raw[0:12]
            cipher = raw[12:]

            l2.append(aes.decrypt(nonce, cipher, None).decode('utf-8'))
    else:
        nonce = b'12345678'
        for s in list_of_str:
            cipher = base64.decodebytes(s.encode())

            l2.append(aes.decrypt(nonce, cipher, None).decode('utf-8'))

    return l2

`set_key(encryption_key)`

Set the encryption key of an existing Anonymize object.

This changes the encryption key overriding the key possibly defined using the environment variable MERCURY_ANONYMIZE_DATASCHEMA_KEY at construction. It can be called any number of times.

Parameters:

Name	Type	Description	Default
`encryption_key`	`list`	The key as a string.	required

Source code in mercury/dataschema/anonymize.py

def set_key(self, encryption_key):
    """Set the encryption key of an existing `Anonymize` object.

    This changes the encryption key overriding the key possibly defined using the environment variable
    MERCURY_ANONYMIZE_DATASCHEMA_KEY at construction. It can be called any number of times.

    Args:
        encryption_key (list):  The key as a string.
    """
    hash_key = hashes.Hash(hashes.BLAKE2s(32))

    hash_key.update(encryption_key.encode('utf-8'))

    self.hash_key = hash_key.finalize()[0:16]

`calculator`

`FeatureCalculator`

This is a base class with the operation definitions. Several classes must extend this, implementing its operations for each one of the supported frameworks (namely Pandas and Pyspark)

`set_config(**kwargs)`

Set attributes with the keys of the dictionary. These can be later used within specific calculator methods (like distribution() for specifying the number of bins).

For this to work, the parameter must have been explicitly declared during object's constructor. That is, you cannot pass here a parameter name which the calculator doesn't support (or this will raise a ValueError).

Parameters:

Name	Type	Description	Default
`**kwargs`	`dict`	The names and values of the desired parameters to set.	`{}`

Raises ValueError if any keyword argument does not exist among the existing attributes of the object.

Source code in mercury/dataschema/calculator.py

def set_config(self, **kwargs):
    """ Set attributes with the keys of the dictionary. These can be later used within
    specific calculator methods (like `distribution()` for specifying the number of bins).

    For this to work, the parameter must have been explicitly declared during object's
    constructor. That is, you cannot pass here a parameter name which the calculator doesn't
    support (or this will raise a ValueError).

    Args:
        **kwargs (dict): The names and values of the desired parameters to set.

    Raises ValueError if any keyword argument does not exist among the existing attributes of
    the object.
    """
    if kwargs is None:
        return

    for key, val in kwargs.items():
        if not hasattr(self, key):
            raise ValueError(
                f"Error. This calculator doesn't support the `{key}` parameter. Available options are {self._registered_params}"
            )
        setattr(self, key, val)

`PandasStatCalculator()`

Bases: FeatureCalculator

Implementation of a Calculator for Pandas

Supported setting keys are the following:

- `distribution_bins_method`: The method for setting the number of bins when
  calling the `distribution` method. Note that this only has effect when feature is
  either discrete or continuous.
- `limit_categorical_perc`: The method for truncating categorical variables with
   high cardinality

Source code in mercury/dataschema/calculator.py

def __init__(self):
    super().__init__()
    self.distribution_bins_method = 'sqrt'
    self.limit_categorical_perc = None

`distribution(column, feature, bins=None)`

Calculates the histogram for a given feature.

Parameters:

Name	Type	Description	Default
`column`	`Series`	Pandas column with the data	required
`feature`	`Feature`	Feature which holds the metadata	required
`bins`	`Union[int, str, None]`	(Only used for numerical features) If a number, the histogram will have `bins` bins. If a string, it will use an automatic NumPy method for estimating this number. See more about available methods here: https://numpy.org/devdocs/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges. If None is provided, it uses the default class' method, which is `sqrt`. For binary features it simply uses bins=2 and for categoricals, bins=\|categories\| if is not limited with 'limit_categorical_perc' in set_config method.	`None`

Source code in mercury/dataschema/calculator.py

def distribution(self, column, feature, bins=None):
    """ Calculates the histogram for a given feature.

    Args:
        column (pd.Series): Pandas column with the data
        feature (Feature): Feature which holds the metadata
        bins (Union[int, str, None]): (Only used for numerical features) If a number, the histogram will
              have `bins` bins. If a string, it will use an automatic NumPy method for
              estimating this number. See more about available methods here:
              https://numpy.org/devdocs/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges.
              If None is provided, it uses the default class' method, which is `sqrt`.
              For binary features it simply uses bins=2 and for categoricals, bins=|categories| if is not limited
              with 'limit_categorical_perc' in set_config method.
    """
    if 'no_nan_filtered' not in feature.cache:
        no_na = column.dropna()
        feature.cache['no_nan_filtered'] = no_na
    else:
        no_na = feature.cache['no_nan_filtered']

    if isinstance(feature, (BinaryFeature, CategoricalFeature)):

        no_na = no_na[no_na.isin(feature.stats['domain'])]  # It may be truncated
        t = (no_na.value_counts() / len(no_na)).sort_index()
        feature.stats['distribution'] = t.values
        feature.stats['distribution'] = [float(x) for x in feature.stats['distribution']]
        feature.stats['distribution_bins'] = list(t.index)

    else:
        bins = self.distribution_bins_method if not bins else bins
        histo = np.histogram(no_na, bins=bins)
        feature.stats['distribution'] = list(histo[0] / no_na.count())
        feature.stats['distribution'] = [float(x) for x in feature.stats['distribution']]
        feature.stats['distribution_bins'] = list(histo[1])

`StatCalculatorFactory`

This static class receives a DataFrame and returns a particular implementation of a FeatureCalculator

`create_tutorials`

`create_tutorials(destination, silent=False)`

Copies mercury.dataschema tutorial notebooks to destination. A folder will be created inside destination, named 'dataschema_tutorials'. The folder destination must exist.

Parameters:

Name	Type	Description	Default
`destination`	`str`	The destination directory	required
`silent`	`bool`	If True, suppresses output on success.	`False`

Raises:

Type	Description
`ValueError`	If `destination` is equal to source path.

Examples:

>>> # copy tutorials to /tmp/dataschema_tutorials
>>> from mercury.dataschema import create_tutorials
>>> create_tutorials('/tmp')

Source code in mercury/dataschema/create_tutorials.py

def create_tutorials(destination, silent = False):
    """
    Copies mercury.dataschema tutorial notebooks to `destination`. A folder will be created inside
    destination, named 'dataschema_tutorials'. The folder `destination` must exist.

    Args:
        destination (str): The destination directory
        silent (bool): If True, suppresses output on success.

    Raises:
        ValueError: If `destination` is equal to source path.

    Examples:
        >>> # copy tutorials to /tmp/dataschema_tutorials
        >>> from mercury.dataschema import create_tutorials
        >>> create_tutorials('/tmp')

    """
    src = pkg_resources.resource_filename(__package__, 'tutorials')
    dst = os.path.abspath(destination)

    assert src != dst, 'Destination (%s) cannot be the same as source.' % src

    assert os.path.isdir(dst), 'Destination (%s) must be a directory.' % dst

    dst = os.path.join(dst, 'dataschema_tutorials')

    assert not os.path.exists(dst), 'Destination (%s) already exists' % dst

    shutil.copytree(src, dst)

    if not silent:
        print('Tutorials copied to: %s' % dst)

`feature`

`BinaryFeature(name=None, dtype=None)`

Bases: Feature

This class represents a binary feature within a schema (i.e. only two possible values).

Parameters:

Name	Type	Description	Default
`name`	`str`	Feature name	`None`
`dtype`	`str`	Data type of the feature	`None`

Source code in mercury/dataschema/feature.py

def __init__(self, name = None, dtype = None                 ):

    super().__init__(name, dtype)

`CategoricalFeature(name=None, dtype=None)`

Bases: Feature

This class represents a categorical feature within a schema (i.e. only N possible values).

Parameters:

Name	Type	Description	Default
`name`	`str`	Feature name	`None`
`dtype`	`str`	Data type of the feature	`None`

Source code in mercury/dataschema/feature.py

def __init__(self, name = None, dtype = None):

    super().__init__(name, dtype)

`ContinuousFeature(name=None, dtype=None)`

Bases: Feature

This class represents a continuous feature within a schema (e.g. a float).

Parameters:

Name	Type	Description	Default
`name`	`str`	Feature name	`None`
`dtype`	`str`	Data type of the feature	`None`

Source code in mercury/dataschema/feature.py

def __init__(self, name = None, dtype = None):

    super().__init__(name, dtype)

`DiscreteFeature(name=None, dtype=None)`

Bases: Feature

This class represents a discrete feature within a schema (i.e. any number without decimals).

Parameters:

Name	Type	Description	Default
`name`	`str`	Feature name	`None`
`dtype`	`str`	Data type of the feature	`None`

Source code in mercury/dataschema/feature.py

def __init__(self, name = None, dtype = None):

    super().__init__(name, dtype)

`Feature(name=None, dtype=None)`

This class represents a generic feature within a schema.

Parameters:

Name	Type	Description	Default
`name`	`str`	Feature name	`None`
`dtype`	`DataType`	Data type of the feature	`None`

Source code in mercury/dataschema/feature.py

def __init__(self,
             name: str = None,
             dtype: DataType = None
             ):
    self.name = name
    self.dtype = dtype if dtype else DataType.UNKNOWN
    self.stats = {}
    self.cache = {}  # Intermediate heavy calculations

`FeatureFactory()`

Source code in mercury/dataschema/feature.py

def __init__(self):
    pass

`_build_dummy_feature(datatype, feat_type, name)`

Returns a dummy and uninitialized feature. This method is not intended to be used apart from serialization purposes.

Source code in mercury/dataschema/feature.py

def _build_dummy_feature(self, datatype: DataType, feat_type: FeatType, name: str) -> Feature:
    """ Returns a dummy and uninitialized feature. This method is not intended to be
    used apart from serialization purposes.
    """
    feat = Feature()
    if feat_type == FeatType.BINARY:
        feat = BinaryFeature()
    if feat_type == FeatType.CATEGORICAL:
        feat = CategoricalFeature()
    if feat_type == FeatType.DISCRETE:
        feat = DiscreteFeature()
    if feat_type == FeatType.CONTINUOUS:
        feat = ContinuousFeature()
    feat.dtype = datatype
    feat.name = name

    return feat

`build_feature(column, colname=None, threshold_categorical=1e-05, force_feat_type=None, verbose=True)`

Builds a schema Feature object given a column.

Parameters:

Name	Type	Description	Default
`column`	`Series`	Column to be analyzed	required
`colname`	`str`	Name of the column (feature)	`None`
`threshold_categorical`	`float`	percentage of necessary unique values for a feature to be considered categorical. If the percentage of unique values < cat_threshold, the column will be taken as categorical. This parameter can be a single float (same threshold for all columns) or a dict in which each key is the name of the column. Use the later for custom thresholds per column.	`1e-05`
`force_feat_type`	`FeatType`	If user wants to force a variable to be of certain type, he/she can use this parameter and its type will not be auto-inferred, but set to this.	`None`
`verbose`	`bool`	If this is set to False, possible inner warnings won't be shown.	`True`

Returns:

Type	Description
`Feature`	Feature with only the base statistics calculated

Source code in mercury/dataschema/feature.py

def build_feature(self,
                  column: 'pandas.Series',  # noqa: F821
                  colname: str = None,
                  threshold_categorical: float = 1e-5,
                  force_feat_type: FeatType = None,
                  verbose: bool = True
                  ) -> Feature:
    """ Builds a schema Feature object given a column.

    Args:
        column: Column to be analyzed
        colname: Name of the column (feature)
        threshold_categorical: percentage of necessary unique values for a feature to be considered
                       categorical. If the percentage of unique values < cat_threshold, the
                       column will be taken as categorical. This parameter can be a single float
                       (same threshold for all columns) or a dict in which each key is the name of
                       the column. Use the later for custom thresholds per column.
        force_feat_type: If user wants to force a variable to be of certain type, he/she can use
                        this parameter and its type will not be auto-inferred, but set to this.
        verbose: If this is set to False, possible inner warnings won't be shown.

    Returns:
        Feature with only the base statistics calculated
    """
    feat = Feature().build_stats(column)
    datatype = self.infer_datatype(column, feat)
    feat_type = FeatType.UNKNOWN

    # If user forces the feature type we kindly fulfill his/her wishes
    if force_feat_type is not None:
        feat = self._build_dummy_feature(datatype, force_feat_type, colname)
        feat.stats.update(feat.stats)
        return feat

    if feat.stats['cardinality'] == 2:
        feat_type = FeatType.BINARY
    else:
        # Data could still be either categorical, discrete or continuous
        if datatype is DataType.FLOAT:
            feat_type = self._infer_feature_type_from_float(feat, threshold_categorical, colname, verbose = verbose)

        if datatype is DataType.INTEGER:
            feat_type = self._infer_feature_type_from_int(feat, threshold_categorical, colname, verbose = verbose)

        if (datatype is DataType.STRING) or (datatype is DataType.CATEGORICAL):
            feat_type = FeatType.CATEGORICAL

    featret = self._build_dummy_feature(datatype, feat_type, colname)
    featret.stats.update(feat.stats)
    return featret

`infer_datatype(column, feature)`

Finds out the data type of the column.

Parameters:

Name	Type	Description	Default
`column`	`Series`	column which datatype will be inferred	required
`feature`	`Feature`	Feature object. This is needed because we want to cache several internal operations, so future calls are faster.	required

Returns:

Type	Description
`DataType`	Returns the datatype of the column

Source code in mercury/dataschema/feature.py

def infer_datatype(self, column: "pandas.Series", feature: Feature) -> DataType:  # noqa: F821
    """ Finds out the data type of the column.

    Args:
        column: column which datatype will be inferred
        feature: Feature object. This is needed because we want to cache several internal
                 operations, so future calls are faster.

    Returns:
        Returns the datatype of the column
    """
    datatype = DataType.UNKNOWN

    if column.dtype.name == 'category':
        datatype = DataType.CATEGORICAL
    elif np.issubdtype(column, np.integer):
        datatype = DataType.INTEGER
    elif np.issubdtype(column, np.bool_):
        datatype = DataType.BOOL
    elif np.issubdtype(column, np.floating):
        datatype = DataType.FLOAT
    elif np.issubdtype(column, np.object_):
        sample = feature.cache['no_nan_filtered'].iloc[0]
        if type(sample) is str:
            datatype = DataType.STRING
        # TODO: Este tipo puede ser otro array
        # TODO: Este tipo puede ser un json (dict)
        # TODO: Este tipo puede ser un datetime

    return datatype

`schemagen`

`DataSchema()`

Dataset schema

This class takes a dataframe and generates its schema as a collection of feature. Feature objects. Each one of them will contain metadata and statistics about a column of the original dataframe that can be further exploded.

Example

>>> schma = DataSchema()        >>>            .generate(dataset)        >>>            .calculate_statistics()
 'DISBURSED_AMOUNT': Categorical Feature (NAME=DISBURSED_AMOUNT, dtype=DataType.INTEGER),
 'ASSET_COST': Categorical Feature (NAME=ASSET_COST, dtype=DataType.INTEGER),
 'LTV': Continuous Feature (NAME=LTV, dtype=DataType.FLOAT),
 'BUREAU_SCORE': Discrete Feature (NAME=BUREAU_SCORE, dtype=DataType.INTEGER),
 'BUREAU_SCORE_DESCRIPTION': Categorical Feature (NAME=BUREAU_SCORE_DESCRIPTION, dtype=DataType.STRING),
 'NEW_LOANS_IN_LAST_SIX_MONTHS': Discrete Feature (NAME=NEW_LOANS_IN_LAST_SIX_MONTHS, dtype=DataType.INTEGER),
 'DEFAULTED_LOANS_IN_LAST_SIX_MONTHS': Discrete Feature (NAME=DEFAULTED_LOANS_IN_LAST_SIX_MONTHS, dtype=DataType.INTEGER),
 'NUM_LOANS_TAKEN': Discrete Feature (NAME=NUM_LOANS_TAKEN, dtype=DataType.INTEGER),
 'NUM_ACTIVE_LOANS': Discrete Feature (NAME=NUM_ACTIVE_LOANS, dtype=DataType.INTEGER),
 'NUM_DEFAULTED_LOANS': Discrete Feature (NAME=NUM_DEFAULTED_LOANS, dtype=DataType.INTEGER),
 'AGE': Discrete Feature (NAME=AGE, dtype=DataType.INTEGER),
 'GENDER': Binary Feature (NAME=GENDER, dtype=DataType.STRING),
 'CIVIL_STATUS': Categorical Feature (NAME=CIVIL_STATUS, dtype=DataType.STRING),
 'ORIGIN': Binary Feature (NAME=ORIGIN, dtype=DataType.STRING),
 'DIGITAL': Binary Feature (NAME=DIGITAL, dtype=DataType.INTEGER),
 'SCORE': Continuous Feature (NAME=SCORE, dtype=DataType.FLOAT),
 'PREDICTION': Binary Feature (NAME=PREDICTION, dtype=DataType.INTEGER)}
>>> schma.feats['SCORE'].stats
{'num_nan': 0,
'percent_nan': 0.0,
'samples': 233154,
'percent_unique': 0.7967352050576014,
'cardinality': 185762,
'min': 0.17454321487679067,
'max': 0.9373813084029072,
'mean': 0.7625553210045813,
'std': 0.15401509786623635,
'distribution': array([7.48617716e-07, 1.07579979e-06, 1.40298186e-06, 1.73016394e-06,
        2.05734601e-06, 2.38452809e-06, 2.71171016e-06, 3.03889224e-06,
        3.36607431e-06, 3.69325638e-06, 4.02043846e-06])}
# Specifying custom parameters (shared among all features) for the calculate_statistics method
>>> schma = DataSchema()        ...    .generate(dataset)        ...    .calculate_statistics({'distribution_bins_method': 'sqrt'})  # Specify bin generation method (see numpy.hist)

# We can also specify granular statistic parameters per variable
>>> schma = DataSchema()        ...    .generate(dataset)        ...    .calculate_statistics({'SCORE': {'distribution_bins_method': 'sqrt'}})  # Specify bin generation method (see numpy.hist)

>>> schma = DataSchema()        ...    .generate(dataset)        ...    .calculate_statistics({'SCORE': {'distribution_bins_method': 5}})  # Specify 5 bins only for numerical features

Source code in mercury/dataschema/schemagen.py

def __init__(self):
    self.dataframe = None
    self.feats = {}
    self._feat_factory = None
    self._generated = False

`binary_feats` `property`

List with the names of all binary features

`categorical_feats` `property`

List with the names of all categorical features

`continuous_feats` `property`

List with the names of all continuous features

`discrete_feats` `property`

List with the names of all discrete features

`_get_threshold(dataset_size)`

Calculates a dynamic threshold for determining whether a variable is categorical given the dataset. It uses an asymptotic function (whose lim->0) clipped to a maximum value of 1.

Source code in mercury/dataschema/schemagen.py

def _get_threshold(self, dataset_size):
    """ Calculates a dynamic threshold for determining whether a variable is categorical
    given the dataset. It uses an asymptotic function (whose lim->0) clipped to a maximum value of 1.
    """
    return np.minimum(1, 50 / (dataset_size))

`anonymize(anonymize_params)`

Anonymize the selected features of a data schema.

Parameters:

Name	Type	Description	Default
`anonymize_params`	`dict`	Dictionary where the keys are the names of the columns to be anonymized and the values are mercury.contrib.dataschema.Anonymize objects that can be used to anonymize them.	required

Raises: UserWarning, if anonymize_params is empty. ValueError, if the feature selected to deanonymize is not binary or categorical, or is not a feature of the dataschema.

Source code in mercury/dataschema/schemagen.py

def anonymize(self, anonymize_params: dict) -> "DataSchema":
    """
    Anonymize the selected features of a data schema.

    Args:
        anonymize_params: Dictionary where the keys are the names of the columns to be anonymized and the values
                          are mercury.contrib.dataschema.Anonymize objects that can be used to anonymize them.
    Raises:
        UserWarning, if anonymize_params is empty.
        ValueError, if the feature selected to deanonymize is not binary or categorical, or is not a feature of the dataschema.
    """
    if not anonymize_params:
        raise UserWarning("To anonymize, it is necessary to use a dictionary with the format: {'var1':anonymizer1, 'var2':anonymizer2}")

    if any(feat not in self.feats.keys() for feat in anonymize_params.keys()):
        raise ValueError("Input Error: Keys of 'anonymize_params' dictionary must be columns name of the data schema")

    for feature in list(self.feats.keys()):
        anon = anonymize_params.get(feature)

        if anon:
            if not isinstance(self.feats[feature], (BinaryFeature, CategoricalFeature)):
                raise ValueError(f"Input Error: Anonymze only supports Categorical or Binary variables -> {feature}, You can use \
                                    the `force_types` param in 'generate()' to specify which features should be categorical ")
            else:
                self.feats[feature].stats['distribution_bins'] = anon.\
                    anonymize_list_any_type(list(self.feats[feature].stats['distribution_bins']))
                self.feats[feature].stats['domain'] = anon.\
                    anonymize_list_any_type(list(self.feats[feature].stats['domain']))

    return self

`calculate_statistics(calculator_configs=None)`

Triggers the computation of all statistics for all registered features of the schema.

Parameters:

Name	Type	Description	Default
`calculator_configs`	`dict`	Optional configurations for each of the calculator parameters. This can be either a dict or a "dict of dicts". In the first case, the statistics for ALL FEATURES will be computed with those parameters. Additionally, you can specify a mapping of [feature_name: {config}] with granular configurations per feature. The supported configuration keys are the attributes declared within a calculator class. See mercury.contrib.dataschema.calculator.PandasStatCalculator (or Spark) for details.	`None`

Source code in mercury/dataschema/schemagen.py

def calculate_statistics(
    self,
    calculator_configs: dict = None
) -> "DataSchema":
    """ Triggers the computation of all statistics for all registered features
    of the schema.

    Args:
        calculator_configs: Optional configurations for each of the calculator parameters.
                            This can be either a dict or a "dict of dicts". In the first case,
                            the statistics for ALL FEATURES will be computed with those parameters.
                            Additionally, you can specify a mapping of [feature_name: {config}] with
                            granular configurations per feature.
                            The supported configuration keys are the attributes declared within a calculator class.
                            See mercury.contrib.dataschema.calculator.PandasStatCalculator (or Spark) for details.
    """
    featnames = list(self.feats.keys())

    calculator_configs = calculator_configs if calculator_configs else {}

    # User can pass us two  types:
    #  - {'param': 'value', 'param2': 'value'} -> Single config shared for all variables
    #  - {{config_var1}, {config_var2}, {config_var3}, ...} -> 1 config per variable
    multiple_configs = len(calculator_configs) > 0 and isinstance(list(calculator_configs.values())[0], dict)

    # Case when user pass a single shared config
    if not multiple_configs:
        calculator = StatCalculatorFactory.build_calculator(self.dataframe)
        calculator.set_config(**calculator_configs)

    for feature in featnames:
        if multiple_configs:
            # Case when user pass one config per variable
            calculator = StatCalculatorFactory.build_calculator(self.dataframe)
            if feature in calculator_configs:
                calculator.set_config(**(calculator_configs[feature]))

        # Calculate distributions
        self.feats[feature].build_stats(self.dataframe.loc[:, feature], calculator)

    return self

`deanonymize(anonymize_params)`

De-anonymize the selected features on a preloaded schema.

Raises UserWarning, if anonymize_params is empty. Raises ValueError, if the feature selected to deanonymize is not binary or categorical, or is not a feature of the dataschema.

Parameters:

Name	Type	Description	Default
`anonymize_params`	`dict`	Dictionary where the keys are the names of the columns to be deanonymized and the values are mercury.contrib.dataschema.Anonymize objects that can be used to deanonymize them.	required

Source code in mercury/dataschema/schemagen.py

def deanonymize(self, anonymize_params: dict) -> "DataSchema":
    """
    De-anonymize the selected features on a preloaded schema.

    Raises UserWarning, if anonymize_params is empty.
    Raises ValueError, if the feature selected to deanonymize is not binary or categorical, or is not a feature of the dataschema.

    Args:
        anonymize_params: Dictionary where the keys are the names of the columns to be deanonymized and the values
                          are mercury.contrib.dataschema.Anonymize objects that can be used to deanonymize them.
    """
    if not anonymize_params:
        raise UserWarning("To De-anonymize, it is necessary to use a dictionary with the format: {'var1':anonym1, 'var2':anonym2}")

    if any(feat not in self.feats.keys() for feat in anonymize_params.keys()):
        raise ValueError("Input Error: Keys of 'anonymize_params' dictionary must be columns name of the data schema")

    for feature in list(self.feats.keys()):
        anon = anonymize_params.get(feature)

        if anon:

            if not isinstance(self.feats[feature], (BinaryFeature, CategoricalFeature)):
                raise ValueError(f"Input Error: Deanonymize only supports Categorical or Binary variables -> {feature} ")
            else:
                operation = int if self.feats[feature].dtype == DataType.INTEGER else str
                self.feats[feature].stats['distribution_bins'] = \
                    list(map(operation, anon.deanonymize_list(self.feats[feature].stats['distribution_bins'])))
                self.feats[feature].stats['domain'] = \
                    list(map(operation, anon.deanonymize_list(self.feats[feature].stats['domain'])))
    return self

`from_json(json_obj)` `classmethod`

Rebuilds an schema from a JSON representation.

Returns:

Type	Description
`DataSchema`	The rebuild schema

Source code in mercury/dataschema/schemagen.py

@classmethod
def from_json(cls, json_obj: dict) -> "DataSchema":
    """ Rebuilds an schema from a JSON representation.

    Returns:
        The rebuild schema
    """
    schema = DataSchema()
    factory = FeatureFactory()

    for featname, feat in json_obj['feats'].items():
        ftype = FeatType[feat['feat_type']]
        dtype = DataType[feat['dtype']]
        feat_name = feat['name']
        dummy_feat = factory._build_dummy_feature(dtype, ftype, feat_name)
        dummy_feat.stats = feat['stats']
        schema.feats[featname] = dummy_feat

    return schema

`generate(dataframe, force_types=None, custom_stats=None, verbose=True)`

Builds the schema. For float and integer datatypes, by default the method tries to infer if a feature is categorical or numeric (Continuous or Discrete) depending on the percentage of unique values. However, that doesn't work in all the cases. In those cases, you can use the force_types param to specify which features should be categorical and which should be numeric independently of the percentage of unique values.

Parameters:

Name	Type	Description	Default
`dataframe`	`Union[DataFrame, DataFrame]`	DataFrame on which the schema will be inferred.	required
`force_types`	`Dict[str, FeatType]`	Dictionary with the form that contains the features to be forced to a specific type (Continuous, Discrete, Categorical...)	`None`
`custom_stats`	`dict`	Custom statistics to be calculated for each column	`None`
`verbose`	`bool`	whether to show or filter all possible warning messages	`True`

Source code in mercury/dataschema/schemagen.py

def generate(
    self,
    dataframe: Union["pandas.DataFrame", "pyspark.sql.DataFrame"],  # noqa: F821
    force_types: Dict[str, FeatType] = None,
    custom_stats: dict = None,
    verbose: bool = True,
) -> "DataSchema":
    """ Builds the schema. For float and integer datatypes, by default the method tries to infer
        if a feature is categorical or numeric (Continuous or Discrete) depending on the percentage
        of unique values. However, that doesn't work in all the cases. In those cases, you can use
        the `force_types` param to specify which features should be categorical and which
        should be numeric independently of the percentage of unique values.

    Args:
        dataframe: DataFrame on which the schema will be inferred.
        force_types: Dictionary with the form <FEATURE_NAME, FeatType> that contains the features to be
                    forced to a specific type (Continuous, Discrete, Categorical...)
        custom_stats: Custom statistics to be calculated for each column
        verbose: whether to show or filter all possible warning messages
    """
    if "pyspark" in str(type(dataframe)):
        raise RuntimeError("Sorry, Pyspark is not supported yet...")

    self.dataframe = dataframe
    self._generated = True

    self._feat_factory = FeatureFactory()

    inferring_types = True if force_types is None else False

    for col in self.dataframe.columns:
        thresh = self._get_threshold(len(self.dataframe))

        # Look if the feature type has been specified
        forced_type = None
        if not inferring_types and col in force_types:
            forced_type = force_types[col]

        feat = self._feat_factory.build_feature(
            self.dataframe.loc[:, col],
            col,
            force_feat_type=forced_type,
            threshold_categorical=thresh,
            verbose=inferring_types and verbose  # Only show warnings (if any) when using default args.
        )
        self.feats[col] = feat

    return self

`generate_manual(dataframe, categ_columns, discrete_columns, binary_columns, custom_stats=None)`

Builds the schema manually. This acts like generate() but in a more restrictive way. All the names passed to categ_columns will be taken as categorical features, no more, no less. It will avoid making automatic type inference on every feature not in categ_columns. The same rule is applied on discrete_columns.

Note

This method is considered to be low level. If you use this, make sure the type assignment to each feature type is compatible with the datatypes (float, int, string,...) in the column or a later call to calculate_statistics could fail.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	DataFrame on which the schema will be inferred.	required
`categ_columns`	`List[str]`	list of columns which will be forced to be taken as categorical. Warning: all features not in this list are guaranteed not being categorical	required
`discrete_columns`	`List[str]`	list of columns which will be forced to be taken as discrete. Warning: all features not in this list are guaranteed not to be taken as discrete (i.e. they will be continuous).	required
`binary_columns`	`List[str]`	list of column which will be forced to be taken as binary.	required
`custom_stats`	`Optional[Dict[str, Any]]`	Custom statistics to be calculated for each column.	`None`

Source code in mercury/dataschema/schemagen.py

def generate_manual(
    self,
    dataframe: Union["pandas.DataFrame", "pyspark.sql.DataFrame"],  # noqa: F821
    categ_columns: List[str],
    discrete_columns: List[str],
    binary_columns: List[str],
    custom_stats: dict = None,
) -> "DataSchema":
    """ Builds the schema manually. This acts like `generate()` but in a more restrictive way.
    All the names passed to `categ_columns` will be taken as categorical features, no more, no less.
    It will avoid making automatic type inference on every feature not in `categ_columns`.
    The same rule is applied on `discrete_columns`.

    Note:
        This method is considered to be low level. If you use this, make sure the type assignment
        to each feature type is compatible with the datatypes (float, int, string,...) in the column or
        a later call to `calculate_statistics` could fail.

    Args:
        dataframe (pd.DataFrame): DataFrame on which the schema will be inferred.
        categ_columns (List[str]): list of columns which will be forced to be taken as categorical. Warning:
                      all features not in this list are guaranteed not being categorical
        discrete_columns (List[str]): list of columns which will be forced to be taken as discrete. Warning:
                      all features not in this list are guaranteed not to be taken as discrete (i.e.
                      they will be continuous).
        binary_columns (List[str]): list of column which will be forced to be taken as binary.
        custom_stats (Optional[Dict[str, Any]]): Custom statistics to be calculated for each column.
    """
    force_types = {}
    for col in dataframe.columns:
        if col in categ_columns:
            force_types[col] = FeatType.CATEGORICAL
        else:
            # Is in either binary, continuous or discrete lists
            if col in discrete_columns:
                force_types[col] = FeatType.DISCRETE
            elif col in binary_columns:
                force_types[col] = FeatType.BINARY
            else:
                force_types[col] = FeatType.CONTINUOUS

    return self.generate(
        dataframe=dataframe,
        force_types=force_types,
        verbose=False,
        custom_stats=custom_stats
    )

`load(path)` `classmethod`

Loads a previously serialized schema (as JSON)

Parameters:

Name	Type	Description	Default
`path`	`str`	path to the serialized schema	required

Returns:

Type	Description
`DataSchema`	The rebuilt schema

Source code in mercury/dataschema/schemagen.py

@classmethod
def load(cls, path: str) -> "DataSchema":
    """ Loads a previously serialized schema (as JSON)

    Args:
        path: path to the serialized schema

    Returns:
        The rebuilt schema
    """
    with open(path, 'r') as file:
        json_obj = json.load(file)
    schema = cls.from_json(json_obj)
    return schema

`save(path)`

Saves a JSON with the schema representation

Parameters:

Name	Type	Description	Default
`path`	`str`	where the JSON will be saved.	required

Source code in mercury/dataschema/schemagen.py

def save(self, path):
    """ Saves a JSON with the schema representation

    Args:
        path (str): where the JSON will be saved.
    """
    with open(path, 'w') as file:
        json.dump(self.to_json(), file)

`to_json()`

Converts the schema to a JSON representation

Returns:

Type	Description
`dict`	dictionary with the features and their stats

Source code in mercury/dataschema/schemagen.py

def to_json(self) -> dict:
    """ Converts the schema to a JSON representation

    Returns:
        dictionary with the features and their stats
    """
    retdict = dict(feats=dict())
    for key, val in self.feats.items():
        retdict['feats'][key] = self.feats[key].to_json()

    return retdict

`validate(other)`

Validates other schema with this one. The other schema will be considered valid if it shares the same feature names and datatypes with this.

Raises RuntimeError if other schema differs from this one

Parameters:

Name	Type	Description	Default
`other`	`DataSchema`	other schema to be checked from this one	required

Source code in mercury/dataschema/schemagen.py

def validate(self, other: "DataSchema"):
    """ Validates other schema with this one. The other schema will be considered
    valid if it shares the same feature names and datatypes with this.

    Raises RuntimeError if other schema differs from this one

    Args:
        other: other schema to be checked from this one
    """
    # Check feature names match
    if list(self.feats.keys()) != list(other.feats.keys()):
        diff = set(self.feats.keys()) - set(other.feats.keys())
        raise RuntimeError(f"Features do not match. These ones are not present on both datasets {list(diff)}")

    # Check feature and data types are the same
    for key, item in other.feats.items():
        if not isinstance(item, self.feats[key].__class__):
            raise RuntimeError(f"""Feature types do not match. '{key}' in other is """
                               f"""{type(item)}. However, {type(self.feats[key])} is expected.""")

        if item.dtype != self.feats[key].dtype:
            raise RuntimeError(f"""Data types types do not match. '{key}' in other is """
                               f"""{item.dtype}. However, {self.feats[key].dtype} is expected.""")

Data Schema

mercury.dataschema

anonymize

Anonymize(digest_bits=96, safe_crypto=False)

anonymize_list(list_of_str)

anonymize_list_any_type(list_of_any)

deanonymize_list(list_of_str)

set_key(encryption_key)

calculator

FeatureCalculator

set_config(**kwargs)

PandasStatCalculator()

distribution(column, feature, bins=None)

StatCalculatorFactory

create_tutorials

create_tutorials(destination, silent=False)

feature

BinaryFeature(name=None, dtype=None)

CategoricalFeature(name=None, dtype=None)

ContinuousFeature(name=None, dtype=None)

DiscreteFeature(name=None, dtype=None)

Feature(name=None, dtype=None)

FeatureFactory()

_build_dummy_feature(datatype, feat_type, name)

build_feature(column, colname=None, threshold_categorical=1e-05, force_feat_type=None, verbose=True)

infer_datatype(column, feature)

schemagen

DataSchema()

binary_feats property

categorical_feats property

continuous_feats property

discrete_feats property

_get_threshold(dataset_size)

anonymize(anonymize_params)

calculate_statistics(calculator_configs=None)

deanonymize(anonymize_params)

from_json(json_obj) classmethod

generate(dataframe, force_types=None, custom_stats=None, verbose=True)

generate_manual(dataframe, categ_columns, discrete_columns, binary_columns, custom_stats=None)

load(path) classmethod

save(path)

to_json()

validate(other)

`mercury.dataschema`

`anonymize`

`Anonymize(digest_bits=96, safe_crypto=False)`

`anonymize_list(list_of_str)`

`anonymize_list_any_type(list_of_any)`

`deanonymize_list(list_of_str)`

`set_key(encryption_key)`

`calculator`

`FeatureCalculator`

`set_config(**kwargs)`

`PandasStatCalculator()`

`distribution(column, feature, bins=None)`

`StatCalculatorFactory`

`create_tutorials`

`create_tutorials(destination, silent=False)`

`feature`

`BinaryFeature(name=None, dtype=None)`

`CategoricalFeature(name=None, dtype=None)`

`ContinuousFeature(name=None, dtype=None)`

`DiscreteFeature(name=None, dtype=None)`

`Feature(name=None, dtype=None)`

`FeatureFactory()`

`_build_dummy_feature(datatype, feat_type, name)`

`build_feature(column, colname=None, threshold_categorical=1e-05, force_feat_type=None, verbose=True)`

`infer_datatype(column, feature)`

`schemagen`

`DataSchema()`

`binary_feats` `property`

`categorical_feats` `property`

`continuous_feats` `property`

`discrete_feats` `property`

`_get_threshold(dataset_size)`

`anonymize(anonymize_params)`

`calculate_statistics(calculator_configs=None)`

`deanonymize(anonymize_params)`

`from_json(json_obj)` `classmethod`

`generate(dataframe, force_types=None, custom_stats=None, verbose=True)`

`generate_manual(dataframe, categ_columns, discrete_columns, binary_columns, custom_stats=None)`

`load(path)` `classmethod`

`save(path)`

`to_json()`

`validate(other)`