SmartNoise Synthesizers#

Getting Started#

Create a synthesizer with the Synthesizer.create() method, passing in the name of the sythesizer you want to create, along with a privacy budget and any synthesizer-specific hyperparameters. To see a list of available synthesizers, use the Synthesizer.list_synthesizers() method or read the Synthesizer Reference.

Each synthesizer has a fit() method that fits the synthesizer to a private data set, and a sample() method that generates synthetic data from the fitted synthesizer. Each synthesizer also has a fit_sample() helper method that combines the fit() and sample() methods into a single call. By using the sample_conditional() method one can generate samples that satisfy certain conditions. It performs rejection sampling and can enable analytics without prior retention of the synthetic data.

from snsynth import Synthesizer
import pandas as pd

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/
pums = pums.drop(['income', 'age'], axis=1)

synth = Synthesizer.create('mwem', epsilon=1.0)
sample = synth.fit_sample(pums)
print(sample)

sample_conditional = synth.sample_conditional(100, "age < 50 AND income > 1000")
print(sample_conditional)

Preprocessing Privacy Budget#

The synthesizer will attempt to automatically prepocess the data into a format suitable for that synthesizer. For example, the mwem synthesizer requires that categorical variables be encoded as integers, and the dpctgan synthesizer requires categories to be one-hot encoded. In some cases, pre-processing will consume some privacy budget. For example, binning or scaling continuous columns requires bounds, and approximate bounds will be computed with some privacy cost if no bounds are provided by the analyst. To specify the amount of budget to be used for preprocessing, you can pass a preprocessor_eps argument to the Synthesizer.create() method.

from snsynth import Synthesizer
import pandas as pd

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/
synth = Synthesizer.create('dpctgan', epsilon=1.0, verbose=True)
sample = synth.fit_sample(pums, preprocessor_eps=0.5)
print(sample)

In the above, the dpctgan synthesizer will use 0.5 privacy budget for preprocessing, and 0.5 privacy budget for synthesizing the data. In the above, we also set the synthesizer to be verbose, so we can see if the preprocessor budget was actually spent. If, for example, mwem decided that all columns in the data were categorical, then no budget would be spent for preprocessing, and the full epsilon of 1.0 would be available to fit the synthesizer.

Preprocessor Hints#

When inferring a preprocessor, the synthesizer first attempts to determine whether each column is categorical, ordinal, or continuous. To skip this step and tell the synthesizer how to treat each column, you can pass in the categorical_columns, ordinal_columns, and continuous_columns arguments to the fit() method. Additionally, if you know that columns can have missing values, you can specify nullable=True.

Note

Before using the MST synthesizer, please install mbi by running pip install git+https://github.com/ryan112358/private-pgm.git.

from snsynth import Synthesizer
import pandas as pd

pums = pd.read_csv('PUMS_null.csv', index_col=None) # in datasets/
pums.drop(['pid'], axis=1, inplace=True)
categorical_columns = list(pums.columns)
categorical_columns.remove(['income', 'age')
synth = Synthesizer.create('mst', epsilon=1.0, verbose=True)
sample = synth.fit_sample(
  pums,
  categorical_columns=categorical_columns,
  continuous_columns=['income', 'age'],
  preprocessor_eps=0.5,
  nullable=True
)
print(sample)

In the above, we tell the synthesizer that all columns are categorical, except for income and age, which are continuous. We also tell the synthesizer that the data may contain missing values, so the synthesizer will use a special preprocessor that can handle missing values.

Data Transforms#

Even with preprocessing hints, the preprocessor inferred by the synthesizer may not be exactly what you want. For example, the mwem synthesizer will automatically bin continuous columns into 10 bins. And spending epsilon to infer bounds is wasteful and reduces accuracy when you already have public bounds for continuous columns. In most cases, you will get the best performance by manually specifying the preprocessor you want to use. Prepreocessing is done by a TableTransformer object, which implements a differentially private reversible data transform.

In this example, we provide fixed bounds for age and income, and log-transform income before scaling. We use the one-hot encoding style, because we will be using this transformer in a GAN:

from snsynth import Synthesizer
from snsynth.transform import *

pums = pd.read_csv('PUMS.csv', index_col=None) # in datasets/

tt = TableTransformer([
    MinMaxTransformer(lower=18, upper=70), # age
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # sex
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # educ
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # race
    ChainTransformer([
        ClampTransformer(lower=1),
        LogTransformer(),
        MinMaxTransformer(lower=0, upper=np.log(420000)) # income
    ]),
    ChainTransformer([LabelTransformer(), OneHotEncoder()]), # married
])

synth = Synthesizer.create('dpctgan', epsilon=1.0, verbose=True)
sample = synth.fit_sample(pums, transformer=tt, preprocessor_eps=0.0)

assert (synth.odometer.spent == (0.0, 0.0))

For more information on the different transforms, see the Data Transformers Reference.

Synthesizers Reference#

Data Transformers Reference#

This is version 1.0.2 of the guides, last built on Nov 03, 2023.