Today github announced new BigQuery tables that let you query data about github repos using the BigQuery. I found about it from hacker news. Also see announcement from Google. The killer new feature is the ability to query the contents of all files stored on github.
I will use it to list top packages required in elisp files on github. If you are impatient you can jump directly to Results. To run a query go to BigQuery console and click “Compose Query”.
In contrast to old BigQuery with top starred repos I will list all packages required in any elisp file, rather than just repositories written in emacs lisp. This will include built in emacs packages or packages not stored on github.
Revision history of this blog post is stored on github.
As this hacker news post mentions, only repos with license are indexed. That will bias the result towards packages rather than personal emacs configs people put on github.
sample_contents only contains 10% sample files from top 130k repos according to stars. It also contain pre-joined data between tables contents and files. It’s good for data exploration before running the bigger query on the full dataset.
SELECT
line,
COUNT(DISTINCT(sample_repo_name)) as count
FROM (
SELECT
SPLIT(content, '\n') line,
sample_repo_name
FROM
[bigquery-public-data:github_repos.sample_contents]
WHERE
(sample_path LIKE '%.el'
OR sample_path LIKE '%.emacs')
HAVING
line LIKE '(require%)')
GROUP BY
1
ORDER BY
count DESC
LIMIT
10;
Top 10 results:
line | count |
---|---|
(require ‘cl-lib) | 32 |
(require ‘ert) | 22 |
(require ‘cl) | 20 |
(require ‘comint) | 19 |
(require ‘org) | 14 |
(require ‘font-lock) | 13 |
(require ‘compile) | 11 |
(require ‘eieio) | 10 |
(require ‘package) | 9 |
(require ‘dash) | 9 |
Since the top entry got only 32 hits it made me worry that sample size is too small.
Only 266 emacs repos are included in sample_contents:
SELECT
COUNT(DISTINCT(sample_repo_name))
FROM [bigquery-public-data:github_repos.sample_contents]
WHERE
(sample_path LIKE '%.el' OR
sample_path LIKE '%.emacs')
266
That made me think that I need to query the unsampled data.
contents contains all files, but require join with files to get file path. My query:
SELECT
TOP(line, 5000),
COUNT(*) AS c
FROM (
SELECT
SPLIT(contents.content, '\n') line,
contents.id AS id
FROM
[bigquery-public-data:github_repos.contents] AS contents
JOIN (
SELECT
path,
id
FROM
[bigquery-public-data:github_repos.files]
WHERE
path LIKE '%.el'
OR path LIKE '%.emacs') AS files
ON
(contents.id == files.id)
WHERE
contents.content CONTAINS '(require \''
HAVING
LEFT(line, 10) == "(require '"
AND RIGHT(line, 1) == ")");
Thanks to all optimizations getting top 5000 packages takes only 60 seconds.
I have been getting a few errors, including Error: Resources exceeded during query execution.
for a while.
It required a couple of BigQuery tricks to make it work
- TOP/COUNT is faster than GROUP BY/ORDER/LIMIT.
- Filtering data prior to join in sub-query reduces memory usage.
- Regexps and globs are expensive. LEFT/RIGHT is much faster.
- Avoid scanning files without .el or .emacs extension (thanks to filtering done by join).
Full list of top 5000 is at https://github.com/kozikow/kozikow-blog/blob/master/github_emacs_packages_5000.csv .
Packages with 1000+ usages:
package | count |
---|---|
(require ‘cl-lib) | 18901 |
(require ‘org) | 12144 |
(require ‘cl) | 11695 |
(require ‘ob) | 11073 |
(require ‘ert) | 7978 |
(require ‘semantic) | 6684 |
(require ‘easymenu) | 5821 |
(require ‘comint) | 5255 |
(require ‘eieio) | 4691 |
(require ‘gnus) | 4446 |
(require ‘font-lock) | 4272 |
(require ‘quail) | 3991 |
(require ‘helm) | 3839 |
(require ‘dash) | 3650 |
(require ‘calc-macs) | 3602 |
(require ‘compile) | 3524 |
(require ‘gnus-util) | 3418 |
(require ‘erc) | 3371 |
(require ‘calc-ext) | 3295 |
(require ‘thingatpt) | 3294 |
(require ‘org-compat) | 3282 |
(require ‘org-macs) | 3268 |
(require ‘shimbun) | 3019 |
(require ‘url-parse) | 2921 |
(require ‘format-spec) | 2696 |
(require ‘company) | 2607 |
(require ‘package) | 2512 |
(require ‘ring) | 2399 |
(require ‘message) | 2334 |
(require ‘cc-mode) | 2329 |
(require ‘python) | 2243 |
(require ‘nnheader) | 2232 |
(require ‘mm-util) | 2227 |
(require ‘custom) | 2223 |
(require ‘calendar) | 2190 |
(require ‘nnoo) | 2187 |
(require ‘gnus-sum) | 2044 |
(require ‘mh-e) | 2022 |
(require ‘ox) | 1978 |
(require ‘gyp) | 1964 |
(require ‘magit) | 1953 |
(require ‘ede) | 1900 |
(require ‘json) | 1874 |
(require ‘dired) | 1849 |
(require ‘nnmail) | 1799 |
(require ‘button) | 1781 |
(require ‘tramp) | 1693 |
(require ‘slime) | 1629 |
(require ‘etags) | 1619 |
(require ‘ansi-color) | 1618 |
(require ‘wid-edit) | 1611 |
(require ‘url) | 1566 |
(require ‘mm-decode) | 1562 |
(require ‘gnus-art) | 1543 |
(require ‘helm-help) | 1538 |
(require ‘semantic/format) | 1504 |
(require ‘outline) | 1495 |
(require ‘imenu) | 1493 |
(require ‘ob-eval) | 1457 |
(require ‘ob-core) | 1419 |
(require ‘url-util) | 1396 |
(require ‘ecb-util) | 1374 |
(require ‘pcomplete) | 1357 |
(require ‘url-vars) | 1337 |
(require ‘speedbar) | 1336 |
(require ‘widget) | 1273 |
(require ‘esh-util) | 1260 |
(require ‘s) | 1239 |
(require ‘helm-utils) | 1234 |
(require ‘auto-complete) | 1232 |
(require ‘xml) | 1231 |
(require ‘semantic/db) | 1219 |
(require ‘haskell-mode) | 1201 |
(require ‘semantic/analyze) | 1195 |
(require ‘ewoc) | 1152 |
(require ‘gnus-range) | 1151 |
(require ‘eshell) | 1065 |
(require ‘cider-client) | 1040 |
(require ‘eieio-base) | 1037 |
(require ‘find-func) | 1032 |
(require ‘semantic/ctxt) | 1008 |
(require ‘help-mode) | 1006 |
(require ‘gnus-int) | 1001 |
In sampled query I counted unique repos that reference given library. In full query I just counted number of times given line appeared in any emacs file. It is better to count unique repos, but calculating distinct repos on a full contents would require additional join.