Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User community insight using an improved crawler #1750

Open
synctext opened this issue Nov 25, 2015 · 3 comments
Open

User community insight using an improved crawler #1750

synctext opened this issue Nov 25, 2015 · 3 comments

Comments

@synctext
Copy link
Member

Goal: a validated, documented and reliable crawler to understand user behavior. This enables the future step of measuring behavioral change.

We have an existing Crawler for Dispersy communities and Tribler. The general Tribler crawler stopped being updated in 2013. See: http://Statistics.tribler.org
This is annotated with our releases and major news events. However, totally unmaintained and difficult to maintain.
image

This crawler needs to move to a proxmox machine and improved. Improved insight will help us understand the network health and roadmap.

Expected results: real-time daily graphs of Tribler network size:
image

User upgrade behavior:
image

Examples taken from: http://crawler.doxu.org/uptimes.html

image

ToDo: NAT type as reported by Dispersy in our community and evolution in time.

According this Github downloads stats we have 302000 downloads of Tribler.
http://www.somsubhra.com/github-release-stats/?username=tribler&repository=tribler
However, our non-validated, many-years-old crawler only sees a few thousand users.

The thesis of Niels contains an extensive user community evaluation and "data science" portion.
http://www.tribler.org/SimilarityFunction/
Thesis.pdf: http://kayapo.tribler.org/trac/raw-attachment/wiki/SimilarityFunction/thesis.pdf

Current setup:
Kayapo web space: /var/www/statistics.tribler.org/htdocs/img/
Soft links to: /home/tribler/generate-periodic-statistics
kayapo:/home/tribler/generate-periodic-statistics# wc -l *.py
193 first_last.py
191 parse.py
169 reduce.py
553 total

Some crawlers have died a few years ago:

kayapo:/home/tribler/generate-periodic-statistics# ls -lah /collected/logs/superpeer
total 3.4M
drwxr-xr-x 26 tribler tribler 4.0K Jan 29  2015 .
drwxr-xr-x  8 tribler tribler 4.0K Feb  7  2014 ..
drwxr-xr-x  2 tribler tribler  12K May 23  2012 dispersy-tracker-1
drwxr-xr-x  2 tribler tribler  16K May 23  2012 dispersy-tracker-2
drwxr-xr-x  2 tribler tribler  12K Feb 10  2012 dispersy-tracker-3
drwxr-xr-x  2 tribler tribler  12K Feb 10  2012 dispersy-tracker-4
drwxr-xr-x  2 tribler tribler  20K May 23  2012 dispersy-tracker-5
drwxr-xr-x  2 tribler tribler  20K May 23  2012 dispersy-tracker-6
drwxr-xr-x  2 tribler tribler 764K Nov 25 05:15 dispersy-tracker-6421-kayapo
drwxr-xr-x  2 tribler tribler 740K Feb  9  2015 dispersy-tracker-6422-kayapo
drwxr-xr-x  2 tribler tribler  80K Sep 24  2012 dispersy-tracker-6423-om.cs.vu.nl
drwxr-xr-x  2 tribler tribler  20K Nov 25 05:16 dispersy-tracker-6424-leaseweb
drwxr-xr-x  2 tribler tribler  84K Sep 24  2012 dispersy-tracker-6424-om.cs.vu.nl
drwxr-xr-x  2 tribler tribler 180K Nov 21 05:16 dispersy-tracker-6425-asmat
drwxr-xr-x  2 tribler tribler 172K Nov 22 05:16 dispersy-tracker-6426-asmat
drwxr-xr-x  2 tribler tribler 340K Aug  3 05:17 dispersy-tracker-6427-pygmee
drwxr-xr-x  2 tribler tribler 340K Aug  3 05:17 dispersy-tracker-6428-pygmee
drwxr-xr-x  2 tribler tribler  20K Nov 25 05:16 dispersy-tracker-6434-leaseweb
drwxr-xr-x  2 tribler tribler  72K Nov 25 01:34 superpeer1
drwxr-xr-x  2 tribler tribler  80K Aug 16  2010 superpeer2
drwxr-xr-x  2 tribler tribler  96K Sep 24  2012 superpeer3
drwxr-xr-x  2 tribler tribler  20K Sep 24  2012 superpeer4
drwxr-xr-x  2 tribler tribler  48K Feb  2  2010 superpeer5
drwxr-xr-x  2 tribler tribler  72K Nov 25 04:04 superpeer6
drwxr-xr-x  2 tribler tribler  92K Nov 25 04:35 superpeer7
drwxr-xr-x  2 tribler tribler  84K Nov 25 05:05 superpeer8
@synctext
Copy link
Member Author

synctext commented Sep 10, 2016

Related to multichain crawling. We don't want to spy on our users for profit, but identify fault, failures, and points for improvements. Respect privacy, no exposure of any individual, and only provide insight into the global system behavior. #2532 #1429

@qstokkink
Copy link
Contributor

qstokkink commented Nov 10, 2017

http://statistics.tribler.org/ is back with IPv8 showing user communities, we just need longer term statistics now.

@qstokkink
Copy link
Contributor

A 2024 update: we now have multiple crawlers but they do not meet the original goal of OP. They are semi-validated, not documented, and not reliable. Frankly, we have too many crawlers: I have a hard time remembering what we even have running.

@qstokkink qstokkink removed this from the Backlog milestone Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants