Data Transformers#

Table of Contents

Using Data Transformers
TableTransformer API
Column Transformers Reference

All synthesizers take an optional transformer argument, which accepts a TableTransformer object. The transformer is used to transform the data before synthesis and then reverse the transformation after synthesis. A TableTransformer manages a list of ColumnTransformer objects, one for each column in the table. Multiple transformations of a column can be chained together with a ChainTransformer.

Using Data Transformers #

Inferring a TableTransformer #

The create factory method can be used to create a TableTransformer based on a data set, which can be a pandas dataframe, a numpy array, or a list of tuples. The following exmaple creates a transformer that converts categorical columns to one-hot encoding.

from snsynth.transform.table import TableTransformer

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/
pums = pums.drop(['income', 'age'], axis=1)
cat_cols = list(pums.columns)

tt = TableTransformer.create(pums, style='gan', categorical_columns=cat_cols)
pums_encoded = tt.fit_transform(pums)

# 26 columns in one-hot encodind
assert(len(pums_encoded[0]) == 26)
assert(len(pums_encoded) == len(pums))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(pums.equals(pums_decoded))

The default one-hot style is useful for neural networks, but is wasteful for cube-style synthesizer, such as MWEW, MST, and PAC-SYNTH. The style argument can be used to specify a different style. The following example creates a transformer that converts categorical columns into sequential label encoding.

from snsynth.transform.table import TableTransformer

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/
pums = pums.drop(['income', 'age'], axis=1)
cat_cols = list(pums.columns)

tt = TableTransformer.create(pums, style='cube', categorical_columns=cat_cols)
pums_encoded = tt.fit_transform(pums)

# 4 columns in sequential label encoding
assert(len(pums_encoded[0]) == 4)
assert(len(pums_encoded) == len(pums))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(pums.equals(pums_decoded))

Inferring Bounds for Continuous Columns #

In the examples above, we used only categorical columns, since continuous values need a min and max value to be transformed. The create method can infer the min and max values from the data set. Inferring the min and max requires some privacy budget, specified by the epsilon argument.

from snsynth.transform.table import TableTransformer

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/
cat_cols = list(pums.columns)
cat_cols.remove('income')
cat_cols.remove('age')

tt = TableTransformer.create(pums, style='cube', categorical_columns=cat_cols, continuous_columns=['age', 'income'])
pums_encoded = tt.fit_transform(pums, epsilon=3.0)

# 6 columns in sequential label encoding
assert(len(pums_encoded[0]) == 6)
assert(len(pums_encoded) == len(pums))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(round(pums['age'].mean()) == round(pums_decoded['age'].mean()))
print(f"We used {tt.odometer.spent} when fitting the transformer")

Declaring a TableTransformer #

In the above example, the transformer used some privacy budget to infer approximate bounds for the two continuous columns. When bounds are known in advance, this is wasteful and can impact the accuracy of the synthesizer. In cases where you want maximum control, you can specify your TableTransformer declaratively:

from snsynth.transform import *

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/

tt = TableTransformer([
    MinMaxTransformer(lower=18, upper=70), # age
    LabelTransformer(), # sex
    LabelTransformer(), # educ
    LabelTransformer(), # race
    MinMaxTransformer(lower=0, upper=420000), # income
    LabelTransformer(), # married
])

pums_encoded = tt.fit_transform(pums)

# no privacy budget used
assert(tt.odometer.spent == (0.0, 0.0))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(round(pums['age'].mean()) == round(pums_decoded['age'].mean()))

Individual column transformers can be chained together with a ChainTransformer. For example, we might want to convert each categorical column to a sequential label encoding, but then convert the resulting columns to one-hot encoding. And we might want to log-transform the income column. The following example shows how to do this:

from snsynth.transform import *

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/

tt = TableTransformer([
    MinMaxTransformer(lower=18, upper=70), # age
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # sex
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # educ
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # race
    ChainTransformer([
        LogTransformer(),
        MinMaxTransformer(lower=0, upper=np.log(420000)) # income
    ]),
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # married
])

pums_encoded = tt.fit_transform(pums)

# no privacy budget used
assert(tt.odometer.spent == (0.0, 0.0))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(round(pums['age'].mean()) == round(pums_decoded['age'].mean()))

Default TableTransformer #

If this argument is not provided, the synthesizer will attempt to infer the most appropriate transformer to map the data into the format expected by the synthesizer.

from snsynth.pytorch.nn import DPCTGAN
from snsynth.mwem import MWEMSynthesizer
import pandas as pd

pums_csv_path = "PUMS.csv"
pums = pd.read_csv(pums_csv_path, index_col=None) # in datasets/
pums = pums.drop(['income', 'age'], axis=1)
cat_cols = list(pums.columns)

mwem = MWEMSynthesizer(epsilon=2.0)
mwem.fit(pums, categorical_columns=cat_cols)
print(f"MWEM inferred a cube transformer with {mwem._transformer.output_width} columns")

dpctgan = DPCTGAN(epsilon=2.0)
dpctgan.fit(pums, categorical_columns=cat_cols)
print(f"DPCTGAN inferred a onehot transformer with {dpctgan._transformer.output_width} columns")

Anonymize personally identifiable information (PII)#

To prevent leakage of sensitive information PII can be anonymized by generating fake data. The AnonymizationTransformer can be used with builtin methods of the Faker library or with a custom callable. By default, existing values are discarded and new values will be generated during inverse transformation. If fake_inbound=True is provided, the new values are injected during transformation.

import random
from snsynth.transform import *

# example data set with columns: user ID, email, age
pii_data = [(1, "email_1", 29), (2, "email_2", 42), (3, "email_3", 18)]

tt = TableTransformer([
    AnonymizationTransformer(lambda: random.randint(0, 1_000)),  # generate random user ID
    AnonymizationTransformer("email"),  # fake email
    ChainTransformer([
        AnonymizationTransformer(lambda: random.randint(0, 100), fake_inbound=True),  # generate random age
        MinMaxTransformer(lower=0, upper=100)  # then use another transformer
    ])
])

pii_data_transformed = tt.fit_transform(pii_data)
assert all(len(t) == 1 for t in pii_data_transformed)  # only the faked age column could be used by a synthesizer

pii_data_inversed = tt.inverse_transform(pii_data_transformed)
assert all(a != b for a, b in zip(pii_data, pii_data_inversed))

Mixing Inferred and Declared Transformers #

In many cases, the inferred transformers will be mostly acceptable, with only a few columns requiring special handling. In this case, the TableTransformer can set constraints on the inference to make sure that specific columns are handled differently. For example, the following code will use the inferred transformer for all columns except for the income column, which will be transformed using LogTransformer and BinTransformer:

import pandas as pd
import math
from snsynth.transform import *

pums = pd.read_csv('PUMS_pid.csv', index_col=None)

tt = TableTransformer.create(
    pums,
    style='cube',
    constraints={
        'income':
            ChainTransformer([
                LogTransformer(),
                BinTransformer(bins=20, lower=0, upper=math.log(400_000))
            ])
    }
)
tt.fit(pums, epsilon=1.0)
print(tt.odometer.spent)
income = tt.transformers[4]
assert(isinstance(income, ChainTransformer))
pid = tt.transformers[6]
assert(isinstance(pid, AnonymizationTransformer))
print(f"ID column is a {pid.fake} anonymization")

In the above example, the budget spent will be 0.0, because the bounds were specified for the income column. All other columns are correctly inferred, with the identifier column using a sequence faker.

Note that the inferred columns will use cube style, so we use a BinTransformer to discretize the income column. If we had used gan style, we would have used something more appropriate for a GAN, such as a OneHotEncoder or MinMaxTransformer.

Constraints can also be specified with shortcut strings. For example, if we want the identifier column to be faked with a random GUID rather than an integer sequence, we could manually construct the AnonymizationTransformer similar to the above, or we can just use the string "uuid4":

tt = TableTransformer.create(
    pums,
    style='cube',
    constraints={
        'pid': 'uuid4'
    }
)

Because this is a faker, it works the same regardless of the style. Likewise, constraints can be specified as ordinal, categorical, or continuous to use the appropriate transformer for the column regardless of style.

In the below example, we use the "drop" constraint to drop the identifier column entirely. We also specify that educ should be treated as continuous, even though it in an integer with only 13 levels. This will cause the inferred transformer to use continuous transformers rather than discrete. Both of these constraints will do the right thing regardless of the style.

tt = TableTransformer.create(
    pums,
    style='cube',
    constraints={
        'educ': 'continuous',
        'pid': 'drop'
    }
)

TableTransformer API #

class snsynth.transform.table.TableTransformer(transformers=[], *ignore, odometer=None)[source]#

Transforms a table of data.

Parameters

transformers – a list of transformers, one per column
odometer – an optional odometer to use to track privacy spent when fitting the data

property cardinality#: Returns the cardinality of each output column. Returns None for continuous columns.

classmethod create(data, style='gan', *ignore, nullable=False, categorical_columns=[], ordinal_columns=[], continuous_columns=[], special_types={}, constraints=None)[source]#

Creates a transformer for the given data. Infers all columns if the provided lists are empty. Columns that are referenced in a constraint will be excluded from the inference.

Parameters

data (pd.DataFrame, np.ndarray, or list of tuples) – The private data to construct a transformer for.
style (string, optional) – The style influences the choice of ColumnTransformers. Can either be ‘gan’ or ‘cube’. Defaults to ‘gan’ which results in one-hot style.
nullable (bool, optional) – Whether to allow null values in the data. This is used as a hint when inferring transformers. Defaults to False.
categorical_columns (list[], optional) – List of column names or indixes to be treated as categorical columns, used as hint.
ordinal_columns (list[], optional) – List of column names or indices to be treated as ordinal columns, used as hint.
continuous_columns (list[], optional) – List of column names or indices to be treated as continuous columns, used as hint.
constraints (dict, optional) – Dictionary that maps from column names or indixes to constraints. There are multiple ways to specify a constraint. It can be a ColumnTransformer object, type or class name. Another possiblity is the string keyword ‘drop’ which enforces a DropTransformer. Also, a string alias for any of the lists like ‘categorical’ can be provided. All other values e.g. a callable or Faker method will be passed into an AnonymizationTransformer.

Returns

The transformer object

Return type

TableTransformer

fit(data, *ignore, epsilon=None)[source]#

Fits the transformer to the data.

Parameters

data – a table represented as a list of tuples, a numpy.ndarray, or a pandas DataFrame
epsilon – the privacy budget to spend fitting the data

property fit_complete#: Returns True if the transformer has been fit.

fit_transform(data, *ignore, epsilon=None)[source]#

Fits the transformer to the data, then transforms.

Parameters

data (a list of tuples, a numpy.ndarray, or a pandas DataFrame) – tabular data to transform
epsilon (float, optional) – the privacy budget to spend fitting the data

Returns

the transformed data

Return type

a list of tuples

property needs_epsilon#: Returns True if the transformer needs to spend privacy budget when fitting.

transform(data)[source]#

Transforms the data.

Parameters: data (a list of tuples, a numpy.ndarray, or a pandas DataFrame) – tabular data to transform
Returns: the transformed data
Return type: a list of tuples

class snsynth.transform.table.NoTransformer(*ignore)[source]#: A pass-through table transformer that does nothing. Note that the transform and inverse_transform methods will simply return the data that is passed in, rather than transforming to and from a list of tuples. This transformer is suitable when you know that your input data is exactly what is needed for a specific synthesizer, and you want to skip all pre-processing steps. If you want a passthrough transformer that is slightly more adaptable to multiple synthesizers, you can make a new TableTransformer with a list of IdentityTransformer column transformers.

Column Transformers Reference #

LabelTransformer #

class snsynth.transform.label.LabelTransformer(nullable=True)[source]#

Transforms categorical values into integer-indexed labels. Labels will be sorted if possible, so that the output can be used as an ordinal. The indices will be 0-based.

Parameters: nullable – If null values are expected, a second output will be generated indicating null.

OneHotEncoder #

class snsynth.transform.onehot.OneHotEncoder[source]#: Transforms integer-labeled data into one-hot encoding. Inputs are assumed to be 0-based. To convert from unstructured categorical data, chain with LabelTransformer first.

MinMaxTransformer #

class snsynth.transform.minmax.MinMaxTransformer(*, lower=None, upper=None, negative=True, epsilon=0.0, nullable=False, odometer=None)[source]#

Transforms a column of values to scale between -1.0 and +1.0.

Parameters

lower – The minimum value to scale to.
upper – The maximum value to scale to.
negative – If True, scale between -1.0 and 1.0. Otherwise, scale between 0.0 and 1.0.
epsilon – The privacy budget to use to infer bounds, if none provided.
nullable – If null values are expected, a second output will be generated indicating null.
odometer – The optional odometer to use to track privacy budget.

StandardScaler #

class snsynth.transform.standard.StandardScaler(*, lower=None, upper=None, epsilon=0.0, nullable=False, odometer=None)[source]#

Transforms a column of values to scale with mean centered on 0 and unit variance. Some privacy budget is always used to estimate the mean and variance. If upper and lower are not supplied, the budget will also be used to estimate the bounds of the column.

Parameters

lower – The minimum value to scale to.
upper – The maximum value to scale to.
epsilon – The privacy budget to use.
nullable – Whether the column can contain null values. If True, the output will be a tuple of (value, null_flag).
odometer – The optional privacy odometer to use to track privacy budget spent.

BinTransformer #

class snsynth.transform.bin.BinTransformer(*, bins=10, lower=None, upper=None, epsilon=0.0, nullable=False, odometer=None)[source]#

Transforms continuous values into a discrete set of bins.

Parameters

bins – The number of bins to create.
lower – The minimum value to scale to.
upper – The maximum value to scale to.
epsilon – The privacy budget to use to infer bounds, if none provided.
nullable – If null values are expected, a second output will be generated indicating null.
odometer – The optional odometer to use to track privacy budget.

LogTransformer #

class snsynth.transform.log.LogTransformer[source]#: Logarithmic transformation of values. Useful for transforming skewed data.

ChainTransformer #

class snsynth.transform.chain.ChainTransformer(transformers)[source]#

Sequentially process a column through multiple transforms. When reversed, the inverse transforms are applied in reverse order.

Parameters: transforms – A list of ColumnTransformers to apply sequentially.

ClampTransformer #

class snsynth.transform.clamp.ClampTransformer(upper=None, lower=None)[source]#

Clamps values to be within a specified range.

Parameters

lower – The minimum value to scale to. If None, no lower bound is applied.
upper – The maximum value to scale to. If None, no upper bound is applied.

AnonymizationTransformer #

class snsynth.transform.anonymization.AnonymizationTransformer(fake, *args, faker_setup=None, fake_inbound=False, **kwargs)[source]#

Transformer that can be used to anonymize personally identifiable information (PII) or other values. By default, the existing values are discarded during transformation and not used by a synthesizer. During inverse transformation new values will be generated according to the specified fake.

If fake_inbound is true, the new values will be injected during transformation and passed through on inverse. This might be useful for e.g. operation in a ChainTransformer.

Beware that the provided fake is called once to verify that the provided (keyword) arguments are valid.

Parameters

fake (str or callable, required) – Text reference to Faker method (e.g. ‘email’) or custom callable
args (args, optional) – Arguments for the method
faker_setup (dict, optional) – Dictionary with keyword arguments for Faker initialization e.g. {‘locale’: ‘de_DE’}
fake_inbound (bool, optional) – Defaults to False.
kwargs (kwargs, optional) – Keyword arguments for the method

DropTransformer #

class snsynth.transform.drop.DropTransformer[source]#: Transformer that ignores a column completely. All values will be dropped during transformation. Inverse transformation is a no-op.