Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The library should validate the document before processing it #34

Open
sneko opened this issue Jan 9, 2024 · 2 comments
Open

The library should validate the document before processing it #34

sneko opened this issue Jan 9, 2024 · 2 comments

Comments

@sneko
Copy link

sneko commented Jan 9, 2024

Hi @samclarke ,

I have a script to watch multiple robots.txt from websites but in some case they have none but still display a fallback content. The issue is your library will tell isAllowed() -> true even if HTML code is passed.

  it('should not confirm it can be indexed', async () => {
    const body = `<html></html>`;

    const robots = robotsParser(robotsUrl, body);
    const canBeIndexed = robots.isAllowed(rootUrl);

    expect(canBeIndexed).toBeFalsy();
  });

(this test will fail, whereas it should pass, or better, it should throw since there are both isDisallowed and isAllowed)

Did I miss something to check the robots.txt format?

Does it make sense to throw an error instead of allowing/disallowing something based on nothing?

Thank you,

EDIT: a workaround could be to check if any HTML inside the file... hoping the website does not return another format (JSON, raw...). But it's a bit hacky, no?

EDIT2: a point of view https://stackoverflow.com/a/31598530/3608410

@samclarke
Copy link
Owner

Thanks for reporting!

It's a bit counter-intuitive but I believe the behaviour of isAllowed() -> true for invalid robots.txt files is correct.

A robots.txt file is part of the Robots Exclusion Protocol. The default behaviour is to assume URLs are allowed unless specifically excluded.

As an invalid robots.txt file doesn't exclude anything, and the default behaviour is to assume allow, then everything should be allowed.

You're right that an invalid robots.txt file is a sign something is misconfigured but I don't think this library can assume misconfigured means disallow. If the file is empty or returns 404 then nothing is excluded so being invalid shouldn't be treated differently.

The draft specification says invalid characters should be ignored but nothing about if the whole file is invalid. However, Google's implementation does specify that if given HTML they will ignore the invalid lines the same as this library.

I have a script to watch multiple robots.txt from websites

Are you using the library to validate the robots.txt files? If so, an isValid() and/or a getInvalidLines() method could be added. Every robots.txt parser I'm aware of will ignore invalid lines, but it could be useful for website owners to check that nothing is misconfigured.

@jasonbarry
Copy link

Chiming in that I would find an isValid() method useful. I'm caching robots.txt values in a KV store, but sometimes servers incorrectly return HTML at /robots.txt (likely an SPA fallthrough to index.html as a 200 rather than 4xx). If I could easily detect whether the response was a true robots.txt format, it would save me $ by preventing an unnecessary KV write op.

For now I'm checking if the first character starts with <, then it's likely not an actual robots.txt file, but that feels hacky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants