Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying a set of possible values for categorical variables #534

Open
mhauru opened this issue Feb 27, 2023 · 2 comments
Open

Specifying a set of possible values for categorical variables #534

mhauru opened this issue Feb 27, 2023 · 2 comments

Comments

@mhauru
Copy link
Contributor

mhauru commented Feb 27, 2023

Is there a way in smartnoise-sql to specify the set of possible values a categorical variable could take, that would then affect count queries, so that all valid values would have a chance of getting a non-zero count reported for them, even if some may not appear in the data?

For a use case, imagine a people table with a gender column. In a small dataset the only values that feature might be 'male' and 'female' (as strings or enums, for instance), but I might want to specify that other values like 'N/A' would also be valid, and have queries like SELECT COUNT(*) AS num, gender FROM people GROUP BY gender have a chance of reporting non-zero values for them.

@joshua-oss
Copy link
Contributor

Hi @mhauru,

The way that SmartNoise SQL is currently designed, rare dimension combinations get censored, because of the possibility that they will "fingerprint" people. This means that the total number of dimension combinations in the output of a GROUP BY will always be less than or equal to the number that existed in the data. This could create bias. Censoring may be unnecessary if all the dimensions are public, and the total size of the product of cardinalities is manageable. For example, if ZIP code is public, and the list of allowed genders is public, we could instead use a zero count for all combinations of ZIP and gender that could possibly occur but didn't, and then add noise to the zero counts. As long as all possible combinations are used to create noisy counts, the censoring of rare dimensions can be ignored. In this situation, the number of dimensions that appear in the output could be considerably larger than the dimensions that actually appeared in the data.

The most direct way to do this in a SQL-friendly manner is to CROSS JOIN the public dimensions, then OUTER JOIN the exploded dimensions back against the original table. This will result in a table with the same schema and metadata as the original table, but with all possible dimension combinations receiving noisy counts. We provide the ability to set censor_dims=False to support this pattern, and you sometimes will want to do clamp_counts=False in this situation as well, because it will allow negative counts to reduce bias. The typical way to do that is by exposing the version with cross-joined dimensions as a view. There is a sample of doing this in Spark at https://github.com/opendp/smartnoise-sdk/blob/main/sql/samples/Synopsis.ipynb (under "Noisy Values for Missing Dimensions").

It's conceivable that this could be done automatically under the covers, but SmartNoise would need to know the cardinalities of the dimensions to avoid creating an explosion. I can think of a few ways to design this:

  1. Allow metadata to specify is_public_dim on a per-column basis. If the caller does a GROUP BY that involves only public dimensions, do a CROSS JOIN under the covers and skip dimension censoring. Additionally, allow a per-column cardinality that allows the engine to decide when a cross join would create too many dimensions.
  2. The above would still disallow dimensions combinations with any value from a single dimension that didn't appear in the data. To address this, we could add support for dimension tables as in a star schema or snowflake schema, then the dimension table would always be outer joined.
  3. Extend the metadata to allow each categorical column to have a list of allowed values. Append all missing dimensions combinations to the end of the query result after receiving the aggregates but before adding noise.

I don't love #3, because it's not very SQL-friendly and could lead to situations where the metadata gets out of date with the SQL dimensions. And we would need to shuffle the rows before returning, or prohibit results without an ORDER BY, because otherwise the adversary would be able to easily guess which rows were added in post-processing. But it might be closest to what you are asking for, and could be pretty fast, because no CROSS JOIN required

Even with #2, we could skip the CROSS JOIN and just use the dimension table to look up values which we then insert as in #3. So the question is which is a better pattern. I imagine for people who already have a star schema in a relational engine, #2 is better, but #3 is more natural for people who are querying a CSV or Pandas dataframe.

Would be interested in your feedback

@mhauru
Copy link
Contributor Author

mhauru commented Apr 12, 2023

Thanks for the extensive response @joshua-oss. I hadn't thought about the option of using CROSS JOINs to achieve this with just SQL. For my use case 3 would be easiest, because we don't have all the allowed values in a dimension table in the database, but it's rather domain knowledge external to the data. I hear what you say though about the benefit of keeping the metadata as part of the data to keep it consistent and up-to-date, and to avoid any accidental leaks with different types of queries.

Regarding using CROSS JOINs, I was trying to do some regular INNER JOINs yesterday, and encountered a ValueError: Support for JOIN queries is currently disabled (while working around that, also ValueError: Support for subqueries is currently disabled, though that's getting off-topic). Do you think this might be supported in the future? I would be interested in JOINs (and subqueries) for both specifying valid values as discussed in this issue, and for other uses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants