Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Storage location of XML Sitemap should be freely determinable #8728

Open
madmaharaja opened this issue Jun 13, 2017 · 22 comments
Open

Storage location of XML Sitemap should be freely determinable #8728

madmaharaja opened this issue Jun 13, 2017 · 22 comments
Assignees
Labels

Comments

@madmaharaja
Copy link

madmaharaja commented Jun 13, 2017

Currently all xml sitemaps in Contao are automatically stored in the /share folder. For multi-site installations that make use of subfolders rather than different (sub)domains this turns out to be a problem when wanting to submit a sitemap to Google Webmaster tools.

If you have the following structure:
www.example.com (main page)
www.example.com/location1/ (local site with it's own root page)
www.example.com/location2/
etc.

All sitemaps will be stored in /share:
www.example.com/share/main_sitemap.xml
www.example.com/share/location1_sitemap.xml
etc.

The problem
If you have individual Google properties for each site and you want to submit sitemaps for each site to Google Webmaster tools the form will look as can be seen on this screenshot.
screenshot-webmaster-tools
Currently the only workaround I could think of is to set a redirect in .htaccess:
RedirectPermanent location1/location1_sitemap.xml /share/location1_sitemap.xml
However, I don't find this very convinient to do for every new location we add online.

My suggestion
Give the option to select an alternative folder where the sitemap can be stored (e.g. see illustration)
screenshot-xml-speicherort

Or another alternative:
Override storing a sitemap in the standard /share folder by typing in folder + name of sitemap into the respective field:
sitemap-loesungsansatz2

@fritzmg
Copy link
Contributor

fritzmg commented Jun 14, 2017

Currently the only workaround I could think of is to set a redirect in .htaccess:
RedirectPermanent location1/location1_sitemap.xml /share/location1_sitemap.xml
However, I don't find this very convinient to do for every new location we add online.

I think that's sufficient. In a multidomain installation you might want to add redirects for sitemap.xml anyway, if you want search engines be able to find the correct sitemap.xml for each domain automatically, e.g.

http://www.example1.org/sitemap.xml
http://www.example2.org/sitemap.xml
http://www.example3.org/sitemap.xml

etc.

Also, why do you use these virtual subfolders instead of (sub)domains? How does that even work within Contao? I don't think that's a supported use case in general?

@madmaharaja
Copy link
Author

Yeah, that's how I do it (redirects), it's just not very user friendly and it "blows up" the htaccess over time.

I use virtual subfolders instead of subdomains for SEO reasons. Our main domain has been around for quite a while and has earned significant trust, backlinks and "ranking power" for certain keywords. Google used to treat subdomains pretty much as individual domains (so the subdomains hardly inherit any domain authority that the main domain has earned), while URLs with subfolders under the same domain benefitted much more from the authority of the main domain. We're ranked well for keywords x, y and z -- this way the new sites pretty quickly rank well for the keywords "location1 + x, y, z". Nowadays Google says it doesn't really make a difference anymore -- however, this is the system we started out with so that's why we're still operating that way. :-)

In Contao I simply set "example.com" as domain in every root page -- the "subfolder" is determined by the alias of the root page.

@frontendschlampe
Copy link

check https://github.com/hofff/contao-robots-txt-editor in combination with https://github.com/hofff/contao-htaccess

There are some more problems with the sitemap:

  1. It's great to have an entry in robots.txt with the direct link to the sitemap
  2. you need a robots.txt for each domain
  3. you need redirects in htaccess to access the various robots.txt

This 3 steps we solve with the 2 extensions.

@leofeyer
Copy link
Member

@madmaharaja Did you check the two extensions above?

@KaiserCh
Copy link

Has anybody considered that placing the sitemap.xml in a subdirectory of the webroot it is used for violates the standard? https://www.sitemaps.org/protocol.html#location

A solution like the one used in Contao always requires either a symlink or a redirect. Maybe the cross submit rule also applies to subfolders, so using a modified robots.txt would work, too. But anyway, relying on either extensions or on adminstrators actively working around things doesn't feel right.

@fritzmg
Copy link
Contributor

fritzmg commented Sep 27, 2017

So actually, /sitemap.xml should be a route that returns the appropriate sitemap depending on the domain.

@Toflar
Copy link
Member

Toflar commented Sep 27, 2017

So actually, /sitemap.xml should be a route that returns the appropriate sitemap depending on the domain.

Yes. That's something we should have by default. Makes no sense to enable a sitemap by checkbox etc. We just have to make sure the correct one is output. That would be a superbe feature ;)

@fritzmg
Copy link
Contributor

fritzmg commented Sep 27, 2017

Indeed :). Also - couldn't the (appropriate) sitemap simply be generated on the fly within that route instead of going through the trouble of generating the XML files in the cron whenever there was a change? On large sites this can cause memory overflow problems and (as discussed in contao/check#134) its generation blocks the response in Contao 4 (if you do not use php-fpm).

@Toflar
Copy link
Member

Toflar commented Sep 27, 2017

Yeah, it can be generated on demand but obviously not every time it is requested. So I'd still cache it somewhere in /cache/contao/sitemaps or so (with sitemap_<root_page_id>.xml maybe?) and just deleted when needed (pages updated etc. = same routine as we already have). Would you work on something like this?

@fritzmg
Copy link
Contributor

fritzmg commented Sep 27, 2017

I would like to - unfortunately we are overbooked currently ...

@Toflar
Copy link
Member

Toflar commented Sep 27, 2017

@leofeyer can you move that to contao/core-bundle please? Because it's not going to change for Contao 3.5 anyway but would be a super nice addition to any future Contao 4 version.

@leofeyer
Copy link
Member

There is no need to move the ticket. Do you want me to assign it to you?

@Toflar
Copy link
Member

Toflar commented Sep 27, 2017

Talked to @frontendschlampe about it, maybe they'll be working on a PR :)

@frontendschlampe
Copy link

I've talked to @Toflar via Mumble, because we're currently updating our hofff/contao-robots-txt-editor and hofff/contao-htaccess. If you want, we will make a PR for this:

  • create a sitemap.xml for every website root (there's no need to have an extra option for creating a sitemap)
  • create a robots.txt file
  • add all absolute sitemap urls to robots.txt
  • make separate routings (like @Toflar suggested)

For a website with various languages under the same domain, there will be a sitemap for every language (maybe we add the language to sitemap name) and one robots.txt with all absolute path to every sitemap. I hope, I described correctly. :-)

/cc @cliffparnitzky

@leofeyer
Copy link
Member

Very good, except the "create a robots.txt file" part. We have discussed this several times and decided not to mess with user generated files.

@KaiserCh
Copy link

Should the URL limit per sitemap be considered? A sitemap may not contain more than 50.000 URLs. Are use cases like a huge news portal, shop (e.g. Isotope), music catalogue,... with more than 50k "objects" relevant?

@Toflar
Copy link
Member

Toflar commented Sep 27, 2017

If it is a route, we do not mess with it at all :) It's sort of fallback. If you upload a robots.txt apache (or whatever server) will take this and otherwise rewrite to app.php and thus Contao 😄 It's a wonderful concept because you get a sane default without doing anything at all and if you want to, you can :)

@frontendschlampe
Copy link

Should the URL limit per sitemap be considered? A sitemap may not contain more than 50.000 URLs. Are use cases like a huge news portal, shop (e.g. Isotope), music catalogue,... with more than 50k "objects" relevant?

Yes ... we will do. Should we take the 50.000 URLs or less of them? Maybe 20.000?

@Toflar
Copy link
Member

Toflar commented Sep 27, 2017

Google recommends to split them up (did not check how exactly) and I'm sure there's some recommendation on the threshold somewhere :)

@frontendschlampe
Copy link

I will check!

@ghost
Copy link

ghost commented Sep 27, 2017 via email

@aschempp
Copy link
Member

So I'd still cache it somewhere in /cache/contao/sitemaps or so (with sitemap_<root_page_id>.xml maybe?) and just deleted when needed (pages updated etc. = same routine as we already have).

Please use the existing cache! The response simply needs appropriate cache headers, and everything's taken care of 😉 . No need to store the files anywhere. I've used debril/rss-atom-bundle to create something like this, though I'm not sure they support sitemap XMLs. But the principles are exactly the same.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants