Private Aggregate Seeded from PAC-Synth#

A differentially-private synthesizer that computes differentially private marginals to build synthetic data. It will aggregate n-way marginals up to and including a specified reporting length, and synthesize data based on the computed aggregated counts.

Based on the Synthetic Data Showcase project. DP documentation available here.

import pandas as pd
from snsynth import Synthesizer

pums = pd.read_csv("PUMS.csv")

synth = Synthesizer.create("pacsynth", epsilon=3.0, verbose=True)
synth.fit(pums, preprocessor_eps=1.0)
pums_synth = synth.sample(1000)

The pac-synth synthesizer will suppress marginal combinations that could uniquely fingerprint individuals. Unlike the other synthesizers, however, this synthesizer attempts to minimize the number of spurious dimension combinations that are generated. This may be desirable in some settings, where the goal is to generate synthetic data with dimensions that are as similar as possible to the original data. To achieve this dimensional fidelity, the pac-synth synthesizer will sometimes generate rows with missing values.

from snsynth.aggregate_seeded import AggregateSeededSynthesizer

# this generates a random pandas data frame with 5000 records
# replace this with your own data
sensitive_df = gen_data_frame(5000)

# build synthesizer
synth = AggregateSeededSynthesizer(epsilon=0.5)
synth.fit(sensitive_df)

# sample 5000 records and build a data frame
synthetic_df = sensitive_df.sample(5000)

# show 10 example records
print(synthetic_df.sample(10))

# this will output
#      H1 H2  H3 H4 H5 H6 H7 H8 H9 H10
# 2791  2         1  0  1  1  1  0   1
# 2169  1  3   4  1  0  1  0  1  1   0
# 4607     4   7  1  1  0  1  1  1   0
# 4803  1      8  0  0  0  1  1  1   1
# 2635         8  0  1  1  1  0  1   0
# 537   1         1  1  1  1  1  0   0
# 3495     6   7  0  0  1  0  0  1   0
# 2009  1  3   3  0  0  1  0  1  1   0
# 3214  1  5      1  1  1  1  1  0   1
# 4879  2  5  10  0  1  1  1  1  1   1

For more, see the samples notebook.

Parameters#

class snsynth.aggregate_seeded.AggregateSeededSynthesizer(reporting_length=3, epsilon=4.0, delta=None, percentile_percentage=99, percentile_epsilon_proportion=0.01, accuracy_mode=<builtins.AccuracyMode object>, number_of_records_epsilon_proportion=0.005, fabrication_mode=<builtins.FabricationMode object>, empty_value='', use_synthetic_counts=False, weight_selection_percentile=95, aggregate_counts_scale_factor=None, verbose=False)[source]#

SmartNoise class wrapper for Private Aggregate Seeded Synthesizer from pac-synth. Works with Pandas data frames, raw data and follows norms set by other SmartNoise synthesizers.

Parameters

reporting_length (int) – The maximum length of the combinations to be synthesized. For example, if reporting length is 2, the synthesizer will compute DP marginals for all two-column combinations in the dataset.
epsilon (float) – The privacy budget to be used for the synthesizer.
delta – The delta value to be used for the synthesizer. If set, should be small, in the range of 1/(n * sqrt(n)), where n is the approximate number of records in the dataset.
percentile_percentage (int) – Because the synthesizer computes multiple n-way marginals, each individual may affect multiple marginals. The percentile_percentage can remove the influence of outliers to reduce sensitivity and improve the accuracy of the synthesizer. For example, if percentile_percentage is 99, the synthesizer will use a sensitivity that can accomodate 99% of the individuals, and will ensure that the records of the outlier 1% are sampled to conform to this sensitivity.
percentile_epsilon_proportion (float) – The proportion of the epsilon budget to be used to estimate the percentile sensitivity.
verbose (bool) – Show diagnostic information about the synthesizer’s progress.

See the pac-synth documentation: for more details about these and other hyperparameters.
Reuses code and modifies it lightly from: pac-synth.