Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical URLs requested multiple times #38

Closed
fredcy opened this issue Aug 24, 2018 · 8 comments
Closed

Identical URLs requested multiple times #38

fredcy opened this issue Aug 24, 2018 · 8 comments
Labels
duplicate This issue or pull request already exists

Comments

@fredcy
Copy link

fredcy commented Aug 24, 2018

When I run muffet against a local site I see in the logs that some pages are being requested many times in a single run. This seem unnecessary and puts extra load on the server being tested.

Here is a simple example. Create "test.html" with this content:

<html><body>
<a href="/foo.html">foo</a>
<a href="/test2.html">test2</a>
</body></html>

and "test2.html" with this:

<html><body>
<a href="/foo.html">foo</a>
</body></html>

Then serve this content with python3 -m http.server.

And run muffet http://localhost:8000/test.html.

The python http.server output I get is this:

Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test2.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -
127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -

This shows that "/foo.html" was requested multiple times.

Strangely, small changes to those html files cause different results. If I add a link to test.html, muffet requests foo.html only once in the run.

@raviqqe raviqqe added the duplicate This issue or pull request already exists label Aug 24, 2018
@raviqqe
Copy link
Owner

raviqqe commented Aug 24, 2018

Please see #27 and #34.

@raviqqe raviqqe closed this as completed Aug 24, 2018
@fredcy
Copy link
Author

fredcy commented Aug 27, 2018

I assume you are referring to the message in #34 (comment) saying that the caching is "best effort". In my usage case that best effort is resulting in hitting the same URLs many many times. As shown in my test case above, even the most trivial possible case of multiple HTML files having the same link results in multiple hits. Oh well. I'll see what I can do on my fork then.

@raviqqe
Copy link
Owner

raviqqe commented Aug 28, 2018

Can you try this branch which visits each URL only once?

@fredcy
Copy link
Author

fredcy commented Aug 28, 2018

I just tried that branch and it does seem to solve the test case that I posted above.

When I run it against my real site, however, I'm still seeing some multiple hits to the same URLs. However, even in the master branch that duplicate effort is perhaps not as bad as I thought. I instrumented a fork of the code (as of yesterday's master) to log each cache hit and miss. In a run against my site under test there were 9870 distinct URLs that got a cache miss out of 11192 cache misses over all. Of the distinct URLs, 408 had more than one cache miss, the highest runners having 34 and 32 misses. I was running with -c 20 concurrency. So muffet did 1322 more requests than it needed to get responses for 9870 URLs. That 13% excess is something I can live with.

@raviqqe
Copy link
Owner

raviqqe commented Aug 28, 2018

Can you tell me the server program and your website you experimented with if you don't mind so that I can test on my machine? The branch should remove any duplicates of requests to the same URLs so I guess the branch has bugs.

@fredcy
Copy link
Author

fredcy commented Aug 28, 2018

Sorry, the web server I'm testing against is not yet public. It will be soon though; cleaning up broken links after a migration is one of our last steps before going live.

@fredcy
Copy link
Author

fredcy commented Aug 28, 2018

I should say, our content management team was delighted with the report that I provided based directly on muffet output. Thanks for making the tool available to us all.

@raviqqe
Copy link
Owner

raviqqe commented Aug 28, 2018

Thank you Fred. I'm glad to hear that.

I ran the new branch on my mid-sized website of around 40 pages and it didn't have any problem. But I could see some duplicate logs as you mentioned while they are expected.

For example, although the URLs https://foo.com/bar, https:/foo.com/bar/, https://foo.com/bar/index.html, and http://foo.com/bar are all different for muffet because they are different URLs basically, web servers might redirect all the URLs other than the first one to the first one depending on its implementation and configuration. As a result, we'll see 4 log lines of accesses to https://foo.com/bar while muffet of the client side queries for 4 different URLs.

So maybe the duplicate accesses you saw were because of those reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants