Quailified Architecture to Improve Labeling (QUAIL)#

The QUAIL synthesizer combines a differentially private classifier and a differentially private synthesizer to produce synthetic data that can perform well on both classification and analytics tasks. The synthesizer first fits a differentially private classifier on the private data, to produce a model that can predict the labels from the other columns of the data. The synthesizer then uses a differentially private synthesizer to learn the distribution of the feature columns from the private data. Synthetic data is then generated by sampling feature rows from the fitted synthesizer, and generating labels using the previously learned classifier. With this hybrid approach, the analyst can control how much privacy budget to spend on classifiication versus learning the feature distribution.

QUAIL is described in Differentially Private Synthetic Data: Applied Evaluations and Enhancements.

Parameters#

class snsynth.quail.QUAILSynthesizer(epsilon, dp_synthesizer, dp_classifier, target, test_size=0.2, seed=None, eps_split=0.9)[source]#

Quailified Architecture to Improve Labeling. Divide epsilon in a known classification task between a differentially private synthesizer and classifier. Train DP classifier on real, fit DP synthesizer to features (excluding the target label), and use synthetic data from the DP synthesizer with the DP classifier to create artificial labels. Produces complete synthetic data.

More information here: Differentially Private Synthetic Data: Applied Evaluations and Enhancements https://arxiv.org/abs/2011.05537

Parameters
  • epsilon (float) – Total epsilon used across the DP Synthesizer and DP Classifier

  • dp_synthesizer (function (epsilon) -> SDGYMBaseSynthesizer) – A function that returns an instance of a DP Synthesizer for a specified epsilon value

  • dp_classifier (function (epsilon) -> classifier) – A function that returns an instance of a DP Classifier for a specified epsilon value

  • target (str) – The column name of the target column

  • test_size (float, optional) – Percent of the data that should be used for the test set, defaults to 0.2

  • seed (int, optional) – Seed for controlling randomness for testing, defaults to None

  • eps_split (float, optional) – Percent of epsilon used for the classifier. 1 - eps_split is used for the Synthesizer., defaults to 0.9