Classification of HTTP Archive origins by the Topics API.
-
Clone this repository along with its submodule with:
git clone --recurse-submodules <HTTPS or SSH URL>
. -
Place the
.csv
files with the HA origins under ha_urls. -
Launch classification (we recommend using a
screen
session):
- (if you have dependencies installed):
./classify_origins.sh
- System Dependencies:
python3
, GNUparallel
,unzip
- Python Dependencies:
pandas
,tflite-support
- System Dependencies:
- (if using Docker):
docker build -t topics-image:latest . docker run --rm -it -v ${PWD}:/workspaces/topics \ -w /workspaces/topics --entrypoint ./classify_origins.sh topics-image:latest
- Refer to the created
.tsv
file for the classification results. Find the corresponding taxonomy under the corresponding folder intopics_classifier
(-2 stands for the Unknown topic).
To classify millions of domains, make sure to deploy a VM with a large number of vCPUs to leverage GNU parallel to its full extent. No special need for RAM or storage behind the minimum required for the instance chosen.
As a reference, classifying the latest CruX top 1M list on an c6g.8xlarge
(32 vCPUs) ec2
instance takes about 40 minutes.