From 8744979debb5a4d32ebcbdcc08db18e5dea372ab Mon Sep 17 00:00:00 2001 From: Simon Willison <swillison@gmail.com> Date: Wed, 20 Mar 2024 13:49:53 -0700 Subject: [PATCH] Reviewing your history of public GitHub repositories using ClickHouse --- clickhouse/github-public-history.md | 51 +++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 clickhouse/github-public-history.md diff --git a/clickhouse/github-public-history.md b/clickhouse/github-public-history.md new file mode 100644 index 0000000000..b2b727d967 --- /dev/null +++ b/clickhouse/github-public-history.md @@ -0,0 +1,51 @@ +# Reviewing your history of public GitHub repositories using ClickHouse + +There's a story going around at the moment that people have found code from their private GitHub repositories in the AI training data known as The Stack, using this search tool: https://huggingface.co/spaces/bigcode/in-the-stack + +I'm very doubtful that private data has been included in that training set. I think it's far more likely that the repositories in question were public at some point in the time, and were gathered up by the https://www.softwareheritage.org/ project when they archived code from GitHub. + +But how can we tell if a private repository was public at some point in the past? + +GitHub have [a security audit log](https://github.com/settings/security-log) for logged in users, but sadly it appears to only cover the past six months. + +For a longer history, we can look things up in the [GitHub Archive](https://www.gharchive.org/) project, which has been recording public events from the GitHub API since 2011. + +The [ClickHouse](https://clickhouse.com/) team provide a public tool for querying that data using SQL as a demo of their software. We can use that to try and find out if a repository was public at some point in the past. + +Access the tool here - no login required: https://play.clickhouse.com/play + +Now execute the following SQL, replacing my username with yours in both places where it occurs: + +```sql +with public_events as ( + select + created_at as timestamp, + 'Private repo made public' as action, + repo_name + from github_events + where actor_login = 'simonw' + and event_type in ('PublicEvent') +), +most_recent_public_push as ( + select + max(created_at) as timestamp, + 'Most recent public push' as action, + repo_name + from github_events + where event_type = 'PushEvent' + and actor_login = 'simonw' + group by repo_name +), +combined as ( + select * from public_events + union all select * from most_recent_public_push +) +select * from combined order by timestamp +``` +The result is a combined timeline showing two things: +- `PublicEvent` events - which [GitHub describes](https://docs.github.com/en/rest/using-the-rest-api/github-event-types?apiVersion=2022-11-28#publicevent) as "When a private repository is made public. Without a doubt: the best GitHub event." +- The most recent `PushEvent` for each repository. Repositories which started life public won't show up in the `PublicEvent` list, so this aims to capture them. + +Here's an extract from the data I get back when I run the query for myself: + +![2017-09-10: Most recent public push, simonw/github-large-file-test - 2017-09-12: Most recent public push, simonw/Houston-Shelters - 2017-09-26: Private repo made public, simonw/squirrelspotter - 2017-10-01: Private repo made public, simonw/simonwillisonblog - 2017-10-12: Most recent public push, simonw/ratelimitcache - 2017-10-15: Most recent public push, simonw/irma-scraped-data - 2017-10-15: Most recent public push, simonw/fema-history - 2017-11-06: Most recent public push, simonw/factory_worker_python - 2017-11-13: Private repo made public, simonw/datasette](https://github.com/simonw/til/assets/9599/5541e0d0-9b34-4eb6-bb43-6a2fd91ce7d1)