-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External use of Internet.nl #1255
Comments
Hi, product owner of Digital Insights Platform here. We are happy to add (more) content about internet.nl within our product. As for now it is:
|
Hey @simonbesters, thanks for joining the conversation. Currently zero attribution is required, but we're thinking of some 'advised' attribution to make some clearer distinction between Internet.nl and tools using Internet.nl results. I wondered: do you currently 'scrape' the site or do you use the json REST API via batch.internet.nl? BTW Internet.nl is in the process of updating the API server to docker (see #1253), there are still a few release-blockers, but after that it should be fairly easy to setup a private Internet.nl-API instance (and then you could also setup a brand in the scraper UA, see #1257). |
We scrape, outside office hours. |
@simonbesters: Please note the new rule number 7 of the application form (which is not yet deployed on the online application form). |
Hi @bwbroersma, DIP programmer here. We really like the 'scraping' method, because it gives us a nice internet.nl page to give to the user (like a municipality ciso). We poll the request status very slowly, and we visit the result page only once. We request about 300 domains per day. In random order (in our checks queue), so your server gets them across 6 - 12 hours. We scrape some key results from the result page (eg. https://internet.nl/site/www.waterland.nl/2722343/) and save the permalink, and that's what we give the user if they want to see why their website didn't get a 100%. We really need that internet.nl result page for the ciso. I don't think the API makes one, or does it? And if it does, why use the API instead of the front-end? |
See other users of the API, batch-calls will - nest to JSON - also give a result page, e.g. https://batch.internet.nl/site/www.rijksoverheid.nl/5899563/ in this case. The benefit of using the API is that it performs a fair scheduling and better use of the resources, you will be able to make 1 request for 300 domains, or 2000 domains, and don't have to guess how to best handle the scheduling. Furthermore the batch resources are different from the single test (internet.nl) instance, so large batches will never slow down regular users of the site. |
What do you mean, scheduling? Our jobs run synchronously, so we will wait for results. (About 20 sec on average in 2024 and 2023.) Will the batch api requests be much slower, or more unpredictable? We will still do requests per 1 domain, not all 300 at the same time, because the queue doesn't know that. If possible we'll ask |
From the TOS:
That sounds like a problem... Even if I completely change the way the queue works, it would be 7 batches per week. And users do adhoc tests for a single site (site & mail), so there would also be batches of 1, OR those would still use the scraping method. I'm gonna sleep on it. I've requested API access, and I'll give the batch api a try soon. |
@bwbroersma Our batch API implementation is live, and it works beautifully ❤️ so this should unload the internet.nl instance somewhat. Thanks guys. |
The text was updated successfully, but these errors were encountered: