Multiplicative Weights Exponential Mechanism (MWEM)#
Multiplicative Weights Exponential Mechanism. From “A Simple and Practical Algorithm for Differentially Private Data Release”.
import pandas as pd
from snsynth import Synthesizer
pums = pd.read_csv("PUMS.csv")
synth = Synthesizer.create("mwem", epsilon=3.0, verbose=True)
synth.fit(pums, preprocessor_eps=1.0)
pums_synth = synth.sample(1000)
MWEM maintains an in-memory copy of the full joint distribution, initialized to a uniform distribution, and updated with each iteration. The size of the joint distribution in memory is the product of the cardinalities of columns of the input data, which may be much larger than the number of rows. For example, in the code above, the dimensionality inferred will be about 300,000 cells, and training will take several minutes. In the PUMS dataset with income column dropped, the size of the in-memory histogram is 29,184 cells. The size of the histogram can explode rapidly with multiple columns with high cardinality. You can provide splits to divide the columns into independent subsets, which may dramatically reduce the memory requirement. In the code below, MWEM will split the data into multiple disjoint cubes with 3 columns each (per the split_factor
argument), and train a separate model for each cube. The size of the in-memory histogram will be lower than 3,000 cells, and training will be relatively fast.
import pandas as pd
from snsynth import Synthesizer
pums = pd.read_csv("PUMS.csv")
synth = Synthesizer.create("mwem", epsilon=3.0, split_factor=3, verbose=True)
synth.fit(pums, preprocessor_eps=1.0)
pums_synth = synth.sample(1000)
The MWEM algorithm operates by alternating between a step that selects a poorly-performing query via exponential mechanism, and then updating the estimated joint distribution via the laplace mechanism, to perform better on the selected query. Over multiple iterations, the estimated joint distribution will become better at answering the selected workload of queries.
Because of this, the performance of MWEM is highly dependent on the quality of the candidate query workload. The implementation tries to generate a query workload that will perform well. You can provide some hints to influence the candidate queries. By default, MWEM will generate workloads with all one-way and two-way marginals. If you want to ensure three-way or higher marginals are candidates, you can use marginal_width
. In cases where columns contain ordinal data, particularly in columns that are binned from continuous values, you can use add_ranges
to ensure that candidate queries include range queries. If range queries are more important that two-way marginals, for example, you can combine add_ranges
with a marginal_width=1
to suppress two-way marginals.
Each iteration spends a fraction of the budgeted epsilon. Very large numbers of iterations may divide the epsilon too small, resulting in large measurement error on each measurement. Conversely, using too few iterations will reduce error of individual measurements, but may interefere with the algorithm converging. For example, if your data set has 15 total one-way and two-way cuboids, and you use iterations=15, every cuboid will be measured with the maximum possible epsilon, but the fit may be poor.
By default, the implementation chooses a number of iterations based on the size of the data. The number of candidate queries generated will be based on the number of iterations, and will focus on marginals, unless add_ranges
is specified. You can also specify a q_count
to ensure that a certain number of candidate queries is always generated, regardless of the number of iterations. This can be useful when you care more about range queries than marginals, because range queries each touch a single slice, while marginal cuboids may update hundreds or thousands of slices.
If you are confident that the candidate query workload is high quality, you can use measure_only
to skip the query selection step, and just sample uniformly from the workload. This can double the budget available to measure the queries, but typically is not useful for anything but the smallest datasets.
Parameters#
- class snsynth.mwem.MWEMSynthesizer(epsilon=3.0, *ignore, q_count=None, iterations=None, splits=[], split_factor=None, marginal_width=None, add_ranges=False, measure_only=False, max_retries_exp_mechanism=10, mult_weights_iterations=20, verbose=False)[source]#
N-Dimensional numpy implementation of MWEM. (http://users.cms.caltech.edu/~katrina/papers/mwem-nips.pdf)
From the paper: “[MWEM is] a broadly applicable, simple, and easy-to-implement algorithm, capable of substantially improving the performance of linear queries on many realistic datasets… (circa 2012)…MWEM matches the best known and nearly optimal theoretical accuracy guarantees for differentially private data analysis with linear queries.”
Linear queries used for sampling in this implementation are random contiguous slices of the n-dimensional numpy array.
- Parameters
epsilon (float, optional) – Privacy budget.
q_count (int, optional) – Number of random queries in the pool to generate. Must be more than # of iterations
iterations (int, optional) – Number of iterations of MWEM to run. MWEM will guess a reasonable number of iterations if this is not specified.
splits (list, optional) – Allows you to specify feature dependence when creating internal histograms. Columns that are known to be dependent can be kept together. Example: splits=[[0,1],[2,3]] where columns 0 and 1 are dependent, columns 2 and 3 are dependent, and between groupings there is independence, defaults to []
split_factor (int, optional) – If splits not specified, can instead subdivide pseudo-randomly. For example, split_factor=3 will make groupings of features of size 3 for the histograms. Note: this will likely make synthetic data worse. defaults to None
marginal_width (int, optional) – MWEM by default will create cuboids to measure marginals, and will use a heuristic to determine the maximum width of the marginals. This parameter allows you to specify that MWEM should always measure marginals up to width marginal_width.
add_ranges – In addition to measuring cuboids, MWEM can measure randomly generated range queries. Range queries work well on columns that are binned from continuous values.
measure_only (bool, optional) – MWEM operates by spending some privacy budget to select the best query. This parameter allows you to specify that MWEM should uniformly measure all queries, and not spend any privacy budget on the query selection. This is useful in limited cases. Allowing MWEM to select will usually work best.
max_retries_exp_mechanism (int, optional) – In each iteration, MWEM tries to select a poorly-peforming query that hasn’t yet been measured. If it fails, it will select one of the remaining queries uniformly at random.
mult_weights_iterations (int, optional) – Number of iterations of multiplicative weights, per iteration of MWEM, defaults to 20
verbose (bool, optional) – Set to True to print debug information.