Data Transformers

All synthesizers take an optional transformer argument, which accepts a TableTransformer object. The transformer is used to transform the data before synthesis and then reverse the transformation after synthesis. A TableTransformer manages a list of ColumnTransformer objects, one for each column in the table. Multiple transformations of a column can be chained together with a ChainTransformer.

Using Data Transformers

Inferring a TableTransformer

The create factory method can be used to create a TableTransformer based on a data set, which can be a pandas dataframe, a numpy array, or a list of tuples. The following exmaple creates a transformer that converts categorical columns to one-hot encoding.

from snsynth.transform.table import TableTransformer

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/
pums = pums.drop(['income', 'age'], axis=1)
cat_cols = list(pums.columns)

tt = TableTransformer.create(pums, style='gan', categorical_columns=cat_cols)
pums_encoded = tt.fit_transform(pums)

# 26 columns in one-hot encodind
assert(len(pums_encoded[0]) == 26)
assert(len(pums_encoded) == len(pums))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(pums.equals(pums_decoded))

The default one-hot style is useful for neural networks, but is wasteful for cube-style synthesizer, such as MWEW, MST, and PAC-SYNTH. The style argument can be used to specify a different style. The following example creates a transformer that converts categorical columns into sequential label encoding.

from snsynth.transform.table import TableTransformer

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/
pums = pums.drop(['income', 'age'], axis=1)
cat_cols = list(pums.columns)

tt = TableTransformer.create(pums, style='cube', categorical_columns=cat_cols)
pums_encoded = tt.fit_transform(pums)

# 4 columns in sequential label encoding
assert(len(pums_encoded[0]) == 4)
assert(len(pums_encoded) == len(pums))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(pums.equals(pums_decoded))

Inferring Bounds for Continuous Columns

In the examples above, we used only categorical columns, since continuous values need a min and max value to be transformed. The create method can infer the min and max values from the data set. Inferring the min and max requires some privacy budget, specified by the epsilon argument.

from snsynth.transform.table import TableTransformer

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/
cat_cols = list(pums.columns)
cat_cols.remove('income')
cat_cols.remove('age')

tt = TableTransformer.create(pums, style='cube', categorical_columns=cat_cols, continuous_columns=['age', 'income'])
pums_encoded = tt.fit_transform(pums, epsilon=3.0)

# 6 columns in sequential label encoding
assert(len(pums_encoded[0]) == 6)
assert(len(pums_encoded) == len(pums))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(round(pums['age'].mean()) == round(pums_decoded['age'].mean()))
print(f"We used {tt.odometer.spent} when fitting the transformer")

Declaring a TableTransformer

In the above example, the transformer used some privacy budget to infer approximate bounds for the two continuous columns. When bounds are known in advance, this is wasteful and can impact the accuracy of the synthesizer. In most non-trivial cases, you will want to specify your TableTransformer declaratively:

from snsynth.transform import *

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/

tt = TableTransformer([
    MinMaxTransformer(lower=18, upper=70), # age
    LabelTransformer(), # sex
    LabelTransformer(), # educ
    LabelTransformer(), # race
    MinMaxTransformer(lower=0, upper=420000), # income
    LabelTransformer(), # married
])

pums_encoded = tt.fit_transform(pums)

# no privacy budget used
assert(tt.odometer.spent == (0.0, 0.0))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(round(pums['age'].mean()) == round(pums_decoded['age'].mean()))

Individual column transformers can be chained together with a ChainTransformer. For example, we might want to convert each categorical column to a sequential label encoding, but then convert the resulting columns to one-hot encoding. And we might want to log-transform the income column. The following example shows how to do this:

from snsynth.transform import *

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/

tt = TableTransformer([
    MinMaxTransformer(lower=18, upper=70), # age
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # sex
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # educ
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # race
    ChainTransformer([
        LogTransformer(),
        MinMaxTransformer(lower=0, upper=np.log(420000)) # income
    ]),
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # married
])

pums_encoded = tt.fit_transform(pums)

# no privacy budget used
assert(tt.odometer.spent == (0.0, 0.0))

# round-trip
pums_decoded = tt.inverse_transform(pums_encoded)
assert(round(pums['age'].mean()) == round(pums_decoded['age'].mean()))

Default TableTransformer

If this argument is not provided, the synthesizer will attempt to infer the most appropriate transformer to map the data into the format expected by the synthesizer.

from snsynth.pytorch.nn import DPCTGAN
from snsynth.mwem import MWEMSynthesizer
import pandas as pd

pums_csv_path = "PUMS.csv"
pums = pd.read_csv(pums_csv_path, index_col=None) # in datasets/
pums = pums.drop(['income', 'age'], axis=1)
cat_cols = list(pums.columns)

mwem = MWEMSynthesizer(epsilon=2.0)
mwem.fit(pums, categorical_columns=cat_cols)
print(f"MWEM inferred a cube transformer with {mwem._transformer.output_width} columns")

dpctgan = DPCTGAN(epsilon=2.0)
dpctgan.fit(pums, categorical_columns=cat_cols)
print(f"DPCTGAN inferred a onehot transformer with {dpctgan._transformer.output_width} columns")

TableTransformer API

class snsynth.transform.table.TableTransformer(transformers=[], *ignore, odometer=None)[source]

Transforms a table of data.

Parameters
  • transformers – a list of transformers, one per column

  • odometer – an optional odometer to use to track privacy spent when fitting the data

property cardinality

Returns the cardinality of each output column. Returns None for continuous columns.

classmethod create(data, style='gan', *ignore, nullable=False, categorical_columns=[], ordinal_columns=[], continuous_columns=[])[source]

Creates a transformer from data.

fit(data, *ignore, epsilon=None)[source]

Fits the transformer to the data.

Parameters
  • data – a table represented as a list of tuples, a numpy.ndarray, or a pandas DataFrame

  • epsilon – the privacy budget to spend fitting the data

property fit_complete

Returns True if the transformer has been fit.

fit_transform(data, *ignore, epsilon=None)[source]

Fits the transformer to the data, then transforms.

Parameters
  • data (a list of tuples, a numpy.ndarray, or a pandas DataFrame) – tabular data to transform

  • epsilon (float, optional) – the privacy budget to spend fitting the data

Returns

the transformed data

Return type

a list of tuples

property needs_epsilon

Returns True if the transformer needs to spend privacy budget when fitting.

transform(data)[source]

Transforms the data.

Parameters

data (a list of tuples, a numpy.ndarray, or a pandas DataFrame) – tabular data to transform

Returns

the transformed data

Return type

a list of tuples

class snsynth.transform.table.NoTransformer(*ignore)[source]

A pass-through table transformer that does nothing. Note that the transform and inverse_transform methods will simply return the data that is passed in, rather than transforming to and from a list of tuples. This transformer is suitable when you know that your input data is exactly what is needed for a specific synthesizer, and you want to skip all pre-processing steps. If you want a passthrough transformer that is slightly more adaptable to multiple synthesizers, you can make a new TableTransformer with a list of IdentityTransformer column transformers.

Column Transformers Reference

LabelTransformer

class snsynth.transform.label.LabelTransformer(nullable=True)[source]

Transforms categorical values into integer-indexed labels. Labels will be sorted if possible, so that the output can be used as an ordinal. The indices will be 0-based.

Parameters

nullable – If null values are expected, a second output will be generated indicating null.

OneHotEncoder

class snsynth.transform.onehot.OneHotEncoder[source]

Transforms integer-labeled data into one-hot encoding. Inputs are assumed to be 0-based. To convert from unstructured categorical data, chain with LabelTransformer first.

MinMaxTransformer

class snsynth.transform.minmax.MinMaxTransformer(*, lower=None, upper=None, negative=True, epsilon=0.0, nullable=False, odometer=None)[source]

Transforms a column of values to scale between -1.0 and +1.0.

Parameters
  • lower – The minimum value to scale to.

  • upper – The maximum value to scale to.

  • negative – If True, scale between -1.0 and 1.0. Otherwise, scale between 0.0 and 1.0.

  • epsilon – The privacy budget to use to infer bounds, if none provided.

  • nullable – If null values are expected, a second output will be generated indicating null.

  • odometer – The optional odometer to use to track privacy budget.

StandardScaler

class snsynth.transform.standard.StandardScaler(*, lower=None, upper=None, epsilon=0.0, nullable=False, odometer=None)[source]

Transforms a column of values to scale with mean centered on 0 and unit variance. Some privacy budget is always used to estimate the mean and variance. If upper and lower are not supplied, the budget will also be used to estimate the bounds of the column.

Parameters
  • lower – The minimum value to scale to.

  • upper – The maximum value to scale to.

  • epsilon – The privacy budget to use.

  • nullable – Whether the column can contain null values. If True, the output will be a tuple of (value, null_flag).

  • odometer – The optional privacy odometer to use to track privacy budget spent.

BinTransformer

class snsynth.transform.bin.BinTransformer(*, bins=10, lower=None, upper=None, epsilon=0.0, nullable=False, odometer=None)[source]

Transforms continuous values into a discrete set of bins.

Parameters
  • bins – The number of bins to create.

  • lower – The minimum value to scale to.

  • upper – The maximum value to scale to.

  • epsilon – The privacy budget to use to infer bounds, if none provided.

  • nullable – If null values are expected, a second output will be generated indicating null.

  • odometer – The optional odometer to use to track privacy budget.

LogTransformer

class snsynth.transform.log.LogTransformer[source]

Logarithmic transformation of values. Useful for transforming skewed data.

ChainTransformer

class snsynth.transform.chain.ChainTransformer(transformers)[source]

Sequentially process a column through multiple transforms. When reversed, the inverse transforms are applied in reverse order.

Parameters

transforms – A list of ColumnTransformers to apply sequentially.

ClampTransformer

class snsynth.transform.clamp.ClampTransformer(upper=None, lower=None)[source]

Clamps values to be within a specified range.

Parameters
  • lower – The minimum value to scale to. If None, no lower bound is applied.

  • upper – The maximum value to scale to. If None, no upper bound is applied.