HTTP Archive Topics API Classification

Classification of HTTP Archive origins by the Topics API.

Getting Started

Clone this repository along with its submodule with: git clone --recurse-submodules <HTTPS or SSH URL>.
Place the .csv files with the HA origins under ha_urls.
Launch classification (we recommend using a screen session):

(if you have dependencies installed): ./classify_origins.sh
- System Dependencies: python3, GNU parallel, unzip
- Python Dependencies: pandas, tflite-support

(if using Docker):

docker build -t topics-image:latest .
docker run --rm -it -v ${PWD}:/workspaces/topics \
    -w /workspaces/topics --entrypoint ./classify_origins.sh topics-image:latest

Refer to the created .tsv file for the classification results. Find the corresponding taxonomy under the corresponding folder in topics_classifier (-2 stands for the Unknown topic).

Parallelization

To classify millions of domains, make sure to deploy a VM with a large number of vCPUs to leverage GNU parallel to its full extent. No special need for RAM or storage behind the minimum required for the instance chosen.

As a reference, classifying the latest CruX top 1M list on an c6g.8xlarge (32 vCPUs) ec2 instance takes about 40 minutes.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
ha_urls		ha_urls
topics_classifier @ 948b2b7		topics_classifier @ 948b2b7
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
classify_origins.sh		classify_origins.sh
origins.txt		origins.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTTP Archive Topics API Classification

Getting Started

Parallelization

About

Releases

Packages

Languages

License

yohhaan/httparchive-topics-classification

Folders and files

Latest commit

History

Repository files navigation

HTTP Archive Topics API Classification

Getting Started

Parallelization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages