You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 16, 2020. It is now read-only.
This would be useful in my group (Laboratory Medicine department in the University of Washington Medical Center) which has databases with PHI (http://en.wikipedia.org/wiki/Protected_health_information). After extracting test data from such a database, the PHI must be mangled prior to storing in a repository.
Perhaps by specifying the PHI columns and their data types, the program could generate random data for that data type. This could work well for scalar types.
To be specific, my use case is that I'm the primary developer of Warehouse, which will replace the software that powers PyPI, and one of the challenges of that (as an OSS project itself) is how do we create a public dataset that is representative of the real data without being the entire set of real data and without exposing anything sensitive. Currently my method of doing this is basically manually copying some data and then going in and manually sanitizing it to remove data. It would be great to be able to rely on rdbms-subsetter to automate this for me though.
For protecting PII, etc. Should be able to integrate with an existing library to obscure data while preserving its overall "flavor".
The text was updated successfully, but these errors were encountered: