SQL API Reference#
- connect.from_connection(*ignore, privacy, metadata, engine=None, **kwargs)#
Open a private SQL connection against an established database connection.
from snsql import from_connection, Privacy conn = pyodbc.connect(dsn) metadata = 'datasets/PUMS.yaml' privacy = Privacy(epsilon=0.1, delta=1/10000) reader = from_connection(conn, metadata=metadata, privacy=privacy) result = reader.execute('SELECT educ, COUNT(*) AS n FROM PUMS.PUMS GROUP BY educ')
- Parameters
conn – An established database connection. Can be pyodbc, psycopg2, spark, pandas, or presto.
privacy – A Privacy object with the desired privacy parameters
metadata – The metadata describing the data source. This will typically be a path to a yaml metadata file, but can also be a dictionary.
engine – Specifies the engine to use. Can be ‘sqlserver’, ‘postgres’, ‘spark’, ‘pandas’, or ‘presto’. If not supplied, from_connection will probe the supplied connection to decide what dialect to use.
- Returns
A PrivateReader that can be used to execute differentially private queries against the database.
- connect.from_df(*ignore, privacy, metadata, **kwargs)#
Open a private SQL connection against a Pandas DataFrame.
from snsql import from_df, Privacy csv = 'datasets/PUMS.csv' pums = pd.read_csv(csv) metadata = 'datasets/PUMS.yaml' privacy = Privacy(epsilon=0.1, delta=1/10000) reader = from_df(pums, metadata=metadata, privacy=privacy) result = reader.execute('SELECT educ, COUNT(*) AS n FROM PUMS.PUMS GROUP BY educ')
- Parameters
df – The Pandas DataFrame to be queried
privacy – A Privacy object with the desired privacy parameters
metadata – The metadata describing the data source. This will typically be a path to a yaml metadata file, but can also be a dictionary.
- Returns
A PrivateReader that can be used to execute differentially private queries against the pandas dataframe.
PrivateReader#
- class snsql.sql.private_reader.PrivateReader(reader, metadata, privacy=None)[source]#
Bases:
snsql.reader.base.Reader
Executes SQL queries against tabular data sources and returns differentially private results.
PrivateReader should be created using the from_connection method.
- property engine: str#
The engine being used by this private reader.
df = pd.read_csv('datasets/PUMS.csv') reader = from_connection(df, metadata=metadata, privacy=privacy) assert(reader.engine == 'pandas')
- execute(query_string, accuracy=False, *ignore, pre_aggregated=None, postprocess=True)[source]#
Executes a query and returns a recordset that is differentially private.
Follows ODBC and DB_API convention of consuming query as a string and returning recordset as tuples. This is useful for cases where existing DB_API clients want to swap out API calls with minimal changes.
- Parameters
query_string – A query string in SQL syntax
pre_aggregated – By default, execute will use the underlying database engine to compute exact aggregates. To use exact aggregates from a different source, pass in the exact aggregates here as an iterable of tuples.
postprocess (bool) – If False, the intermediate result, immediately after adding noise and censoring dimensions, will be returned. All post-processing that does not impact privacy, such as clamping negative counts, LIMIT, HAVING, and ORDER BY, will be skipped.
accuracy (bool) –
- Returns
A recordset structured as an array of tuples, where each tuple represents a row, and each item in the tuple is typed. The first row will contain column names.
result = reader.execute('SELECT sex, AVG(age) AS age FROM PUMS.PUMS GROUP BY sex')
- execute_with_accuracy(query_string)[source]#
Executes a private SQL query, returning accuracy bounds for each column and row. This should only be used if you need analytic bounds for statistics where the bounds change based on partition size, such as AVG and VARIANCE. In cases where simple statistics such as COUNT and SUM are used,
get_simple_accuracy
is recommended. The analytic bounds for AVG and VARIANCE can be quite wide, so it’s better to determine accuracy through simulation, whenever that’s an option.Executes query and advances privacy odometer. Returns accuracies for multiple alphas, using
alphas
property on thePrivacy
object that was passed in when the reader was instantiated.Note that the tuple format of
execute_with_accuracy
is not interchangeable withexecute
, because the accuracy tuples need to be nested in the output rows to allow streamed processing.- Parameters
query_string (str) – The query to execute.
- Returns
A tuple with a dataframe showing row results, and a nested tuple with a dataframe for each set of accuracies. The accuracy dataframes will have the same number of rows and columns as the result dataframe.
# alphas for 95% and 99% intervals privacy = Privacy(epsilon=0.1, delta=1/1000, alphas=[0.05, 0.01]) reader = from_connection(db, metadata=metadata, privacy=privacy) query = 'SELECT educ, AVG(age) AS age FROM PUMS.PUMS GROUP BY educ' res = reader.execute_with_accuracy(query) age_col = 2 for row, accuracies in res: acc95, acc99 = accuracies print(f'Noisy average is {row[age_col]} with 95% +/- {acc95[age_col]} and 99% +/- {acc99[age_col]}')
- execute_with_accuracy_df(query_string, *ignore)[source]#
Executes a private SQL query, returning accuracy bounds for each column and row. This should only be used if you need analytic bounds for statistics where the bounds change based on partition size, such as AVG and VARIANCE. In cases where simple statistics such as COUNT and SUM are used,
get_simple_accuracy
is recommended. The analytic bounds for AVG and VARIANCE can be quite wide, so it’s better to determine accuracy through simulation, whenever that’s an option.Executes query and advances privacy odometer. Returns accuracies for multiple alphas, using
alphas
property on thePrivacy
object that was passed in when the reader was instantiated.Note that the tuple format of
execute_with_accuracy_df
is not interchangeable withexecute
, because the accuracy tuples need to be nested in the output rows to allow streamed processing.- Parameters
query_string (str) – The query to execute.
- Returns
A list of tuples, with each item in the list representing a row. each row has a tuple of the result values, and a nested tuple with each of the column accuracies for that row, for each alpha.
# alphas for 95% and 99% intervals privacy = Privacy(epsilon=0.1, delta=1/1000, alphas=[0.05, 0.01]) reader = from_connection(db, metadata=metadata, privacy=privacy) query = 'SELECT educ, AVG(age) AS age FROM PUMS.PUMS GROUP BY educ' res (acc95, acc99) = reader.execute_with_accuracy_df(query) print(res) print(acc95) print(acc99)
- classmethod from_connection(conn, *ignore, privacy, metadata, engine=None, **kwargs)[source]#
Create a private reader over an established SQL connection. If engine is not passed in, the engine will be automatically detected.
- Parameters
conn – An established database connection. Can be pyodbc, psycopg2, SparkSession, Pandas DataFrame, or Presto.
privacy – A Privacy object with epsilon, delta, and other privacy properties. Keyword-only.
metadata – The metadata describing the database. Metadata documentation is here. Keyword-only.
engine – Optional keyword-only argument that can be used to specify engine-specific rules if automatic detection fails. This should only be necessary when using an uncommon database or middleware.
- Returns
A PrivateReader object initialized to process queries against the supplied connection, using the supplied Privacy properties.
privacy = Privacy(epsilon=1.0, delta=1/1000) metadata = 'datasets/PUMS.yaml' pums = pd.read_csv('datasets/PUMS.csv') reader = PrivateReader.from_connection(pums, privacy=privacy, metadata=metadata)
- get_privacy_cost(query_strings)[source]#
Estimates the epsilon and delta cost for running the given query. Privacy cost is returned without running the query or incrementing the odometer.
- Parameters
query_string – The query string or strings to analyze
query_strings (Union[str, List[str]]) –
- Returns
A tuple of (epsilon, delta) estimating total privacy cost for running this query or queries.
# metadata specifies censor_dims: False privacy = Privacy(epsilon=0.1, delta=1/1000) reader = from_df(df, metadata=metadata, privacy=privacy) query = 'SELECT AVG(age) FROM PUMS.PUMS GROUP BY educ' eps_cost, delta_cost = reader.get_privacy_cost(query) # will be ~0.2 epsilon, since AVG computed from SUM and COUNT print(f'Total epsilon spent will be {eps_cost}') query = 'SELECT SUM(age), COUNT(age), AVG(age) FROM PUMS.PUMS GROUP BY educ' eps_cost, delta_cost = reader.get_privacy_cost(query) # will be ~0.2 epsilon, since noisy SUM and COUNT are re-used print(f'Total epsilon spent will be {eps_cost}') query = 'SELECT COUNT(*), AVG(age) FROM PUMS.PUMS GROUP BY educ' eps_cost, delta_cost = reader.get_privacy_cost(query) # will be ~0.3 epsilon, since COUNT(*) and COUNT(age) can be different print(f'Total epsilon spent will be {eps_cost}')
- get_simple_accuracy(query_string, alpha)[source]#
Return accuracy for each alpha and each mechanism in column order. Columns with no mechanism application return None. Returns accuracy without running the query.
- Parameters
query_string (str) – The SQL query
alpha (float) – The desired accuracy alpha. For example, alpha of 0.05 will return a 95% interval.
reader = from_df(df, metadata=metadata, privacy=privacy) query = 'SELECT COUNT(*) AS n, SUM(age) AS age FROM PUMS.PUMS GROUP BY income' accuracy = reader.get_simple_accuracy(query, 0.05) print(f'For 95% of query executions, n will be within +/- {accuracy[0]} of true value') print(f'For 95% of query executions, age will be within +/- {accuracy[1]} of true value')
- parse_query_string(query_string)[source]#
Parse a query string, returning an AST Query object.
reader = from_connection(db, metadata=metadata, privacy=privacy) query_string = 'SELECT STDDEV(age) AS age FROM PUMS.PUMS' query = reader.parse_query_string(query_string) age_node = query.xpath_first("//NamedExpression[@name='age']") dot = age_node.visualize() # visualize the formula in the AST dot.render('age', view=True, cleanup=True)
- Return type
snsql._ast.ast.Query
Privacy#
- class snsql.sql.privacy.Privacy(*ignore, epsilon=1.0, delta=1e-15, alphas=[], mechanisms=None)[source]#
Bases:
object
Privacy parameters. The Privacy object is passed in when creating any private SQL connection, and applies to all queries executed against that connection.
- Parameters
epsilon (float) – The epsilon value for each statistic returned by the private SQL connection.
delta (float) – The delta value for each query processed by the private SQL connection. Most counts and sums will use delta of 0, but dimension censoring and Gaussian mechanism require delta. Set delta to something like 1/n*sqrt(n), where n is the approximate number of rows in the data source.
alphas (List[float]) – A list of floats representing desired accuracy bounds. Only set this parameter if you plan to use execute_with_accuracy for row-based accuracy. For simple column accuracy bounds, you can pass an alpha directly to get_simple_accuracy, which ignores these alphas.
mechanisms (snsql.sql.privacy.Mechanisms) – A property bag specifying which mechanisms to use for which types of statistics. You will only set this parameter if you want to override default mechanism mapping.