Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose "Dead Weight" of a list #303

Closed
jawz101 opened this issue Jun 28, 2018 · 7 comments
Closed

expose "Dead Weight" of a list #303

jawz101 opened this issue Jun 28, 2018 · 7 comments
Labels
analytics service that provides various statistics about FilterLists url-validation service that validates URLs

Comments

@jawz101
Copy link
Contributor

jawz101 commented Jun 28, 2018

I would think displaying the number of dead hosts on the webpage may encourage list maintainers to make sure their lists are actively maintained and indicate to users when a list is poorly maintained.

https://pyfunceble.readthedocs.io/en/latest/

I came across this utility when looking for ways to clean up a tiny host file I maintain.

@collinbarrett
Copy link
Owner

Nice. I love the idea.

@collinbarrett collinbarrett changed the title Feature Request: Consider integrating PyFunceble to show % of dead entries add support for flagging a list of and/or displaying a count of unreachaple domains/IPs in lists Jun 28, 2018
@jawz101
Copy link
Contributor Author

jawz101 commented Jun 29, 2018

I played around with it last night on a list I maintain on GitHub. Really easy to use.

I saw it when reviewing a pull request on Adaway's host file (I hate that list, btw. The tool proves my point). 41% of the 410 hosts don't even exist anymore. The list hasn't been updated in 2 years. I tried once to submit some actual mobile ad domains and they reverted it because I'd accidentally blocked a URL shortener. Grr.

@collinbarrett
Copy link
Owner

PyFunceble looks cool, but would prefer to find a .NET approach or a 3rd-party API rather than incurring the extra overhead of rolling in a Python tool.

@collinbarrett collinbarrett changed the title add support for flagging a list of and/or displaying a count of unreachaple domains/IPs in lists support flagging a list of and/or displaying a count of unreachaple domains/IPs in lists Jun 30, 2018
@collinbarrett collinbarrett changed the title support flagging a list of and/or displaying a count of unreachaple domains/IPs in lists support flagging a list of and/or displaying a count of unreachable domains/IPs in lists Aug 13, 2018
@collinbarrett collinbarrett added the web front-end user interface label Aug 13, 2018
@collinbarrett collinbarrett changed the title support flagging a list of and/or displaying a count of unreachable domains/IPs in lists expose "Dead Weight" of a list Aug 21, 2018
@collinbarrett collinbarrett added the url-validation service that validates URLs label Aug 24, 2018
@funilrys
Copy link

Hi there,
may I ask what is the size of the list ?

I could add it to https://github.com/dead-hosts you'll then only have to pull the clean.list (generated at the end of the test of the whole list) and do your comparison, business logic or whatever you want with the generated data.

@collinbarrett
Copy link
Owner

Hey, @funilrys, @dead-hosts looks like a neat project. I hadn't heard about it.

The idea of this issue for FilterLists.com would be to monitor all of the ~800 (as of now) lists that we index for any domains, ips, and maybe even (but much harder to do) no longer valid static syntax rules. We could then expose this in various ways via a percentage of dead v. active, a listing of dead rules, etc.

I need to learn a bit more about how @dead-hosts works. Maybe we could partner up to not re-invent the wheel.

Is @dead-hosts scalable to the point where it could potentially monitor a majority or all of the lists that FilterLists indexes? (see here)

@funilrys
Copy link

Hi @collinbarrett ,

In reality, it depends on the frequency of the tests.
I'm part of @Ultimate-Hosts-Blacklist (main repository: https://github.com/mitchellkrogza/Ultimate.Hosts.Blacklist) and I can only say that will the big amount of information to test, it only depends on the frequency of the test.

Let me try to details some parts in order to help you understand.


About Dead-Hosts

The idea behind Dead-Hosts is to propose to list or hosts list maintainer a place where they can get the results of PyFunceble without having to think about the logistics behind.

Since recently, we generate a clean.list file which contains the list of ACTIVE domains caught by PyFunceble. I recently discovered that it is used by some maintainers to clean up their list.

How I work

Taking contact with maintainers

One of the ways for me is to contact a maintainer personally and ask them if they are interested in PyFunceble.
If they are interested, I propose them to create an instance into @dead-hosts. If they agree, we I create the repository structure and the whole process run automatically.

Maintainers contact me

As an example, https://github.com/dead-hosts/WindowsSpyBlocker-spy_git_crazy-max was a spontaneous question from the list maintainer. He asked me to create an instance so he can work with the results of PyFunceble. So I did 😸.

Issues or other talks about PyFunceble

I'm pretty sure you have already heard of https://github.com/StevenBlack/hosts.
All the lists that are parts of its unified hosts file are part of dead-hosts. Indeed, it all started with a discussion we had about dead domains. Steven does not agree with cleaning the unified as it is the role of the curator but he agrees that knowing some number may make some of the curators rethink about their list. So we did 😸 Now all new lists which are part of the unified hosts file have their instance under @dead-hosts.

Automation

How do I create a new instance?

For now, the creation of a new repository is done manually. But the construction of the repository structure I use a python script I wrote.

How are external user/maintainer/teams handled ?

One of the advantages of the GitHub organization system is that it allows us to create teams and permission. I use those feature in order to allow original maintainer (and their team member) to have write access to the repository.

Indeed, I work like follow:

  • I create the repository and the repository structure.
  • I assign the repository maintainer as the repository Admin.
  • I create a team with the name of the maintainer.
  • I invite the repository maintainer to the team.
  • Once he joins, I make him a maintainer of the team so he can add his team member.
  • Finally (and even before the maintainer joins the team) I assign the write permission to the team for the created repository.

How do we run the tests?

The tests are running inside Travis CI container.
We set up every repository automatically with the help of our internal script.

We simply call a script called update.py which is responsible for the whole process before we even start PyFunceble.

The update.py works like follow:

  • We check if we are currently under test
  • If so we launch of PyFunceble and it continues its work normally.
  • If we are not currently under test, we compare the number of day between each retest. Indeed, as some list are not updating every day, we set a number of days between each retest on the info.json of each repository.
  • If we are up to the number of days, we clean the whole repository (the output directory) and we test the list from the beginning.

The push process is done by PyFunceble thanks to the Travis CI "mode" which allow us to bypass the limitation of Travis CI.

Note: Once a month I check the whole process in order to see if something is not going well.

What are the limitation of Travis CI?

Travis CI is great for what we are doing. Indeed, it's free for public repositories and it has many IP which allows us to launch our test correctly without being constantly blocked by whois server for example.

The only limitation I can find with Travis CI is that it only allows 5 instances to be run at the same time. But it's not that hard to live with that limitation.

Note: In order to avoid being blocked by whois server (if we are allowed to use WHOIS records for our tests) we stop and continue our test in a new container after 10 minutes of the test.


About Ultimate-Hosts-Blacklist and Ultimate.Hosts.Blacklist

You have to understand that @Ultimate-Hosts-Blacklist is the backend of Ultimate.Hosts.Blacklist.

@Ultimate-Hosts-Blacklist work and run almost exactly like @dead-hosts. The only thing that differences both are the way they configured PyFunceble.

What we do with Ultimate.Hosts.Blacklist is:

  • We run everyday at T14:09:13Z our update script.
  • The update script:
    • Pull all clean.list
    • Pull all domains.list if the clean.list does not exist
    • Generate all hosts file, lists, deny files ...
    • Push everything to the repository

About your question:

Is @dead-hosts scalable to the point where it could potentially monitor a majority or all of the lists that FilterLists indexes?

For 800 entries !? No, we better create and set up a new organization only for your usage 😸

I hope that this may help you understand more. If you need something else or if you have any other questions, please let me know.

Have a nice day/night.

Cheers,
Nissar

@collinbarrett collinbarrett removed the web front-end user interface label Sep 29, 2018
@collinbarrett collinbarrett added the web front-end user interface label Sep 3, 2019
@collinbarrett collinbarrett added analytics service that provides various statistics about FilterLists and removed url-validation service that validates URLs web front-end user interface labels Sep 13, 2020
@collinbarrett collinbarrett added the url-validation service that validates URLs label Sep 21, 2020
@collinbarrett
Copy link
Owner

closing into #371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analytics service that provides various statistics about FilterLists url-validation service that validates URLs
Projects
None yet
Development

No branches or pull requests

3 participants