Metadata#
To ensure differential privacy over tabular data, the system needs some metadata with sensitivity, identifiers, and other information about the data source.
This metadata is loaded from a YAML file path during instantiation of a private SQL connections. The same metadata is used for all queries issued against that instantiated connection.
import pandas as pd
from snsql import *
pums = df.read_csv('PUMS.csv')
meta_path = 'PUMS.yaml'
reader = from_df(pums, privacy=Privacy(), metadata=meta_path)
YAML Format#
The YAML metadata format looks like this:
"":
MySchema:
MyTable:
max_ids: 1
user_id:
private_id: True
type: int
age:
type: int
lower: 0
upper: 100
The root node is a collection of tables that all exist in the same namespace. The root node can also be referred to as a “database” or a “schema”.
Each schema
node can have multiple table
nodes. Tables represent tabular data.
Each table
node can have an assortment of properties that can be set by the data curator to control per-table differential privacy. In addition, each table
node has multiple column
nodes.
Each column
node has attributes specifying the atomic datatype for use in differential privacy, sensitivity bounds, and more.
Dictionary Format#
Although loading from file path is preferred, you can also pass in the metadata as a nested dictionary.
import pandas as pd
from snsql import *
pums = df.read_csv('PUMS.csv')
metadata = {
'':{
'MySchema': {
'MyTable': {
'max_ids': 1,
'row_privacy': False,
'user_id': {
'name': 'user_id',
'type': 'int',
'private_id': True
},
'age': {
'name': 'age',
'type': 'int',
'upper': 100,
'lower': 0
}
}
}
}
}
reader = from_df(pums, privacy=Privacy(), metadata=meta_path)
Table Options#
In many cases, the underlying database engine will be configured to enforce constraints that impact differential privacy. In these cases, the data curator can inform the system that these constraints are already enforced, allowing those steps to be optimized in the differential privacy layer.
These overrides should be used with caution, because they may affect privacy if used incorrectly.
max_ids
: Integer. Default is1
. Specifies how many rows each unique user can appear in. If any user appears in more rows than specified, the system will randomly sample to enforce this limit (seesample_max_ids
).row_privacy
: Boolean. Default isFalse
. Tells the system to treat each row as being a single individual. This is common with social science datasets.sample_max_ids
: Boolean. Default isTrue
. If the data curator can be certain that each user appears at mostmax_ids
times in the table, this setting can be enabled to skip the reservoir sampling step. This may be the case if theprivate_id
is the sole primary key of the database. This will not be the case when the source data are in a denormalized fact table where a singleprivate_id
can appear in multiple rows.censor_dims
: Boolean. Default isTrue
. Drops GROUP BY output rows that might reveal the presence of individuals in the database. For example, a query doing GROUP BY over last names would reveal the existence of an individual with a rare last name. Data curators may override this setting if the dimensions are public or non-sensitive.clamp_counts
: Boolean. Default isFalse
. Differentially private counts can sometimes be negative. Setting this option toTrue
will clamp negative counts to be 0. Does not affect privacy, but may impact utility.clamp_columns
: Boolean. Default isTrue
. By default, the system clamps all input data to ensure that it falls within thelower
andupper
bounds specified for that column. If the data curator can be certain that the data never fall outside the specified ranges, this step can be disabled.use_dpsu
: Boolean. Default isFalse
. Tells the system to use Differential Private Set Union for censoring of rare dimensions. Does not impact privacy.
Column Options#
type
: Required. The type attribute indicates the simple type for all values in the column. Type may be one of “int”, “float”, “string”, “boolean”, or “datetime”. The “datetime” type includes date or time types. If type is set to “unknown”, the column will be ignored by the system.private_id
: Boolean. Default isFalse
. indicates that this column is the private identifier (e.g. “UserID”, “Household”). This column is optional. Only columns which have private_id set to true are treated as individuals subject to privacy protection.lower
: Valid on numeric columns. Specifies the lower bound for values in this column.upper
: Valid on numeric columns. Specifies the upper bound for values in this column.nullable
: Boolean. Default isTrue
. Indicates that this column can contain null values. If set toFalse
, the system will assume that all values are set. This is useful when the data curator knows that all values are set, and will allow some budget to be preserved by sharing counts across columns.missing_value
: A value of the same type as thetype
for this column. Default isNone
. If set, the system will replace NULL with the specified value, ensuring that all values are set. If set,nullable
will be treated asFalse
, regardless of its value.sensitivity
: The sensitivity to be used when releasing sums from this column. Default isNone
. If not set, the system will compute the sensitivity from upper and lower bounds. Ifsensitivity
is set, the upper and lower bounds will be ignored for sensitivity, and this value will be used. The upper and lower bounds will still be used to clamp the columns. If this value is set, and no bounds are provided, the metadata must specifyclamp_columns
asFalse
. Note that counts will always use a sensitivity of 1, regardless of the value of this attribute.
Database Names#
Any table or column objects referenced in queries must resolve to objects described in the metadata. For example, the following describes a table named “Sales” in a database named “Finance”
Finance:
Sales:
Orders:
row_privacy: True
SaleAmount:
type: int
lower: 0
upper: 100000
The following two queries will work, and are equivalent:
SELECT SUM(SaleAmount) FROM Sales.Orders;
SELECT SUM(SaleAmount) FROM Finance.Sales.Orders;
If the database engine being used is not case-sensitive, the query can also use lower case.
SELECT SUM(saleamount) FROM sales.orders;
Note that the object names used in the SQL query must match the metadata, but the database name need not match the database name of the underlying connection. For example, if the the existing connection is to a database named “FinanceTest”, the following query will work:
SELECT SUM(SaleAmount) FROM Sales.Orders;
despite the database name in the metadata being “Finance”. However, in this scenario, the following query will not work:
SELECT SUM(SaleAmount) FROM Finance.Sales.Orders;
because the query will be executed against the connection using a database name that doesn’t match the currently connected database.
Data curators may choose to leave the database name blank, in which case any database name will match. To make a blank key in YAML, use double quotes:
"":
Sales:
Orders:
row_privacy: True
SaleAmount:
type: int
lower: 0
upper: 100000
With the above YAML, all three queries will work:
SELECT SUM(SaleAmount) FROM Sales.Orders;
SELECT SUM(SaleAmount) FROM Finance.Sales.Orders;
SELECT SUM(SaleAmount) FROM FinanceTest.Sales.Orders;
Special Characters#
The SQL-92 specification allows object identifiers to include special characters such as hyphens or spaces. For example:
"":
Sales:
Backlog-Orders:
row_privacy: True
Sale-Amount:
type: int
lower: 0
upper: 100000
A query against this table would look like:
SELECT SUM("Sale-Amount") FROM Sales."Backlog-Orders";
In the above example, Backlog-Orders
and Sale-Amount
are not escaped in the metadata, which might give unexpected results. For example, the lowercase versions of these identifiers would also work. When specifying objects in metadata that need to be escaped in the SQL query, it’s a good idea to escape the objects in the metadata:
"":
Sales:
'"Backlog-Orders"':
row_privacy: True
'"Sale-Amount"':
type: int
lower: 0
upper: 100000
Note the single-quotes wrapping the identifiers escaped with double-quotes. This is necessary to ensure the YAML parser preserves the escaping.
The above uses the SQL-92 syntax for quoting identifiers with special characters. Some engines may not support this syntax, and may instead use backticks or square brackets. You can use whatever syntax is appropriate for your engine. For example, the following YAML is suitable when backticks are the preferred escape character:
"":
Sales:
"`Backlog-Orders`":
row_privacy: True
'`Sale-Amount`':
type: int
lower: 0
upper: 100000
Since our escape character is not a double quote in this case, we can wrap the string for the YAML parser equivalently using either single or double quotes.
It’s good practice to use the same escape character in the metadata as is preferred by your database engine, but it’s not strictly necessary. As long as the escape character used in the queries is approproate for the engine, the system should be able to fine the correct metadata. For example, the following query will work fine against SQL Server, using square bracket escaping, despite the fact that the metadata uses backticks:
SELECT SUM([Sale-Amount]) FROM Sales.[Backlog-Orders];
Other Considerations#
The metadata described here is concerned only with differentially private processing, and is agnostic to storage engine (e.g. Spark, SQL, CSV). Engine-specific metadata, such as database connection strings or credentials, are beyond the scope of this metadata.
The root of the metadata is a collection, which represents a collection of named tabular datasets. Tabular datasets in a collection may be joined together, and budget is shared across all tables in the collection sharing a private identifier.
A table
element must specify at least one child column. If the data curator chooses not to expose specific columns in the source table via metadata, the existence of these columns is not revealed to the analyst, and analysts may not add references to data source columns not exposed in the metadata.
The table name is the identifier used to reference the tabular dataset. The name restrictions will depend on the semantics of the data source and implementation. For example, a SQL database may support a single dot-separated namespace name (e.g. “dbo.TableName”), while a CSV file encoded with UTF-8 may support arbitrary Unicode table names.
If present, max_ids
must be the same for all tables that share a private_id
.
Although row-level privacy is often assumed to be default behavior in the literature, we require this to be explicitly opted in, because incorrect assumption will compromise privacy. If row_privacy
is true, max_ids
must be absent or set to 1.
Queries on tables with row_privacy
may query only one table at a time. Joins or other combinations that include more than one relation with row_privacy
on any of the relations, are not allowed.
The analyst may not add columns not specified in the metadata.
Expressions in queries may combine and/or change types of values.
Note that private_id
is not necessarily the same as the primary key. For example, an “Orders” table might have a compound primary key consisting of “CustomerID” and “OrderID”. In this case, orders are not private individuals, so we would specify private_id = True
on the CustomerID, but not on the OrderID.
If private_id
is not set on any column in the table, and row_privacy
is false at the table level, no queries may be performed against this table. If row_privacy is set to true, private_id need not be set, but joins will be disabled, and budget will be shared across all queries touching any table in the collection.
If private_id is set on more than one column in a dataset, the combination of columns is considered to be the compounded private identifier. This should be uncommon.
We require that the same private_id
be used across all tables in a collection, because budget is shared across all queries that access the same private individuals. In some cases, the data curator may wish to allow data to be privatized based on multiple alternative identifiers. For example, a collection of datasets might be keyed by both CustomerID and CompanyID, and the data curator may wish to allow analysts to choose between one or the other. To support this scenario, the data curator can supply two different metadata files.
The data curator should take care not to use the actual min and max values from the dataset, when setting lower
and upper
, if these are sensitive, but instead should use a domain specific lower and upper (e.g. 0-100 for age).
If lower
and upper
are not provided, numeric aggregates such as SUM, AVG, STDDEV, will be unavailable for that column.
The collection element supports an optional engine
attribute, which specifies a set of name disambiguation rules to be used. For example, some data sources may be case-sensitive, such that ‘CustomerID’ and ‘customerID’ refer to two different columns. To provide deterministic evaluation of metadata, implementations must be able to determine how to handle case-sensitivity, character sets, namespaces, and escaping of special characters such as spaces in column names.
If the engine
attribute is not specified, implementations may define implementation-specific name disambiguation rules, presumably tied to a very limited number of supported data sources. Implementations should reject data sources with unknown engine specified.
Collection names, and switching between collections, are implementation-dependent and out of scope for this document.
In some cases, implementation may map the attribute names specified above to avoid collision with reserved keywords. For example, ‘type’ is a reserved keyword in some programming languages, so in-memory objects will use a different attribute name, such as ‘val_type’. Implementations may choose to serialize using different conventions, such as camel casing or snake casing. It is not expected that serializations from one implementation will be used in another.
All typed columns are assumed to allow NULL or missing values. In cases where the data curator knows that missing values are impossible, it may be desirable to allow specification of a ‘no_nulls’ attribute, to improve some calculation. This is out of scope for this document.
Example Metadata#
The following is an example of a collection containing 3 tables, representing Crashes, Rollouts, and Census for a population of devices. The collection name is “Telemetry”, and names are serialized as headings.
Collection:
Telemetry:
Crashes:
rows: 103000
Refurbished:
type: boolean
Temperature:
type: float
lower: 25.0
upper: 65.0
Building:
type: string
Region:
type: string
DeviceID:
type: int
private_id: True
Crashes:
type: int
lower: 0
upper: 10
Census:
DeviceID:
type: int
private_id: true
OEM:
type: string
Memory:
type: string
Disk:
type: int
lower: 100
upper: 10000
Rollouts:
DeviceID:
type: int
private_id: true
RolloutID:
type: int
StartTrial:
type: datetime
EndTrial:
type: datetime
TrialGroup:
type: int