Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching hosts lists to avoid unessessery downloads for speed and searching the cache to find which source blocked which domain #88

Open
yuannan opened this issue Dec 2, 2021 · 1 comment

Comments

@yuannan
Copy link

yuannan commented Dec 2, 2021

Is your feature request related to a problem? Please describe:
Sometimes a list blocks a domain and I will want to either exclude that hostlist or report it up stream if it's a false positive.

This is hard if not impossible currently without the help of scripts as it involves the user downloading the host lists individually and then searching for it.

Describe the solution you'd like:
The host lists are cached within /etc/hblock/host_lists/
Using -fhs or --find-host-source it will find the host list in question that blocked the domain.

Describe alternatives you've considered:
So far I've written my own script to search the lists. This has to download them first and then search. Hence why I think a cache is a good idea.

#!/bin/env bash
block_list_path="/etc/hblock/sources.list"
dir=$(mktemp -d)
echo $dir

source=$(cat $block_list_path)

cd $dir

if [[ -z "$1" ]]
then
	echo "Needs domain to search for..."
else
	for s in $source
	do
		cd $dir
		if [[ ! $s =~ ^# ]]
		then
			wget $s 2>/dev/null &
		fi
	done
	wait

	grep "$1" $dir/*
fi

The cache deserves it's own feature request but I think it should either:

  1. Keep the hosts are they are right now but a meta file at https://raw.githubusercontent.com/hectorm/hmirror/master/data/lists.meta.txt to keep track of when they have been updated. This speeds up downloads and avoids unnecessary files. This moves hblock to be more like a package manager with host files.

  2. Have a header on top of the file indicating when it's been updated with a # like a comment. The header can be downloaded without downloading the entire file. The "raw" version of this file should be kept cached. When it's processed all line starting with '#' will be ignored before made into /etc/hosts. I've tested this with

curl -r 0-100 https://raw.githubusercontent.com/hectorm/hmirror/master/data/ublock/list.txt

to get the first 100 bytes of the file. This can easily be downloaded first checked against the device database. If the header within the file is newer then the rest is downloaded. I think this should be avoided for the first option as I imagine others have setup their own scripts with the assumption that your mirrored lists have no comments. Option 1 is much easier to implement as has less side effects.

@hectorm
Copy link
Owner

hectorm commented Dec 6, 2021

Hi, thank you for taking the time to write this request.

I think a cache is outside the scope of the project, I would not like hBlock to create more files than the one specified in the --output option.

However I understand that it would be useful to easily know which sources are blocking a particular domain, so I'm thinking about adding a feature that downloads the sources and prints this information. Quite similar to what your script does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants