Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a limit to be set on expiry of line segments #2185

Closed

Conversation

tomhughes
Copy link
Contributor

Accidentally or deliberately moving a node a long way creates a long line segment which can expire large numbers of tiles causing disruption to the rendering process as osm2pgsql either runs out of memory or generates a huge list of tiles to expire.

Limiting expiry of long line segments prevents that and also helps reduce disruption of the resulting rendered maps with long lines that are almost certainly incorrect.

Accidentally or deliberately moving a node a long way creates a
long line segment which can expire large numbers of tiles causing
disruption to the rendering process as osm2pgsql either runs out
of memory or generates a huge list of tiles to expire.

Limiting expiry of long line segments prevents that and also helps
reduce disruption of the resulting rendered maps with long lines
that are almost certainly incorrect.
@joto
Copy link
Collaborator

joto commented May 23, 2024

The PR is clean and technically looks okay to me. But I am concerned with a few things:

  1. We are moving away from configuration by command line options. I don't think it is a good idea to create more of them. In this case we would need it as long as we are using the pgsql output.
  2. The change seems to be rather specific to a very specific problem. There are many other cases possible which would ceate the same kind of damage and which are not caught by this code. Are we going to solve this by adding ever more tuning options for, say, the overall length of a line (as opposed to the segment length which we have here), or similar settings for polygons and so on?
  3. The value this option is set to looks totally "magic" to me. There is no clear way how to decide what this number should be. If we can't explain to a user why they should set an option to what value, that's problematic from a usability standpoint.
  4. If we had had this change in place for the recent vandalism it might have prevented some tiles showing the vandalism from rendering, but some tiles would have been re-rendered for other reasons so the vandalism would still show up. But than this very change prevents the tile from being re-rendered after a fix is applied. (Because expire is not only calculated based on the new geometry of an object but also the old geometry.)

I think we should clearly think about the two issues raised here, one is about protecting "the map", the other is about protecting "the system". osm2pgsql (potentially) running out of memory is bad and needs addressing. And if an excessive number of tiles in the expire list creates some situation where other parts of the system are likely going to fail, than that needs addressing, too. I'd rather see those issues solved in a different way though that's more directly tied to the problem, i.e. the excessive number of tiles, not to just one case which this PR addresses. And I am not sure whether osm2pgsql is the right place to solve system stability issues (or if there is even an issue), but if that is the case we should talk about that. To solve this problem we probably need some limits in the number of tiles that can show up in the expire list or so. But the goal of that would not be to protect "the map", but to make the system more robust. Protecting "the map" from changes in the data that we don't want to see rendered is a different issue we should not conflate with the first.

@lonvia
Copy link
Collaborator

lonvia commented May 29, 2024

Generally speaking, I'd be okay with accepting this as an operational quick-fix. The code is minimally invasive and in a part of osm2pgsql that's on the way out anyway. So we don't have to maintain that stuff forever.

My main concern here is with joto's point 4. The change not only prevents expiry when the vandalism occurs but also when the vandalism is fixed. This may not be a much of a problem on the server with average load. However, especially on the osm.org servers with their very high load, this may lead to a lot of stale tiles where the vandalism remains visible because another change caused the tiles to be rerendered while the vandalism was active. Opinions?

I wonder if it wouldn't be better to attack the problem a bit earlier and not allow to lines with very long segments to be written to the output tables. Sadly, this is only possible with the flex output right now, so we'd need the osm-carto port for flex first. But once done it can be flexibly handled in style, so that a sensible maximum length can be set depending on the type of object (allow more for administrative boundaries, less for highways).

@lonvia
Copy link
Collaborator

lonvia commented May 29, 2024

Administrative note: if accepted, this change would need backporting. We are not in a state right now to create a release from master.

@simonpoole
Copy link
Contributor

simonpoole commented Jun 16, 2024

My main concern here is with joto's point 4. The change not only prevents expiry when the vandalism occurs but also when the vandalism is fixed.

Maybe (and I would note that I'm suggesting this strctly as a quick hack) simply abort processing (aka exit osm2pgsql) when a new linestring would be generated that exceeds a certain length.

As long as the input to osm2pgsql at some point covers both vandalism and revert, osm2pgsql would then continue as normal.

@joto
Copy link
Collaborator

joto commented Jun 16, 2024

@simonpoole Something like this would be easily possible when the flex output is used. It allows you to get the linestring length in Lua and then you can do any check you want and simply call os.exit(1) and osm2pgsql is killed.

Of course this would mean switching OSM Carto to flex, but fortunately we are getting near that. There is an issue and PR.

@lonvia
Copy link
Collaborator

lonvia commented Sep 10, 2024

A new release of osm2pgsql is imminent, so we need to decide what happens to the this PR. As said above, I would be okay to merge if this is essential to the operations of the OSMF servers. However, I do share some of @joto's concerns that this isn't really addressing the root cause and thus only working around one very specific form of vandalism, which may never occur again in this form.

@tomhughes what is the status here? I understand that we have switched the servers to expire on metatile level. This should address the memory issue (#2190) and the issue of being stuck for hours writing out the expired tiles. Would that be sufficient for now?

The other possible advantage I can see is that the render servers wouldn't end up with basically all tiles marked as outdated. Was that a concern? Was the additional load on rerendering causing issues?

We do know that the chance will not prevent vandalised tiles from appearing. It just will be on less tiles. And after the change, it will be necessary to manually expire all affected tiles. Probably not much of a change in the case of global vandalism. However I can see where this might become a major pain when we run into the case where a user accidentally draws a single line over half the globe and then globally all tiles rendered in a certain timespan need to be invalidated because finding just the ones that touch the single line is difficult.

Any other pros and cons?

@tomhughes
Copy link
Contributor Author

We are now expiring at the metatile level, yes.

The idea of the length limit was really twofold, to reduce the amount of time we spent re-rendering tiles and to reduce the visibility of the damage. I was kind of thinking that we wouldn't then need to do a global dirty after the revert but I guess that we probably would as you say.

I don't think we've really had an incident that allows to test the effectiveness of either change though.

Maybe we should drop this for now and see how things go with the other change - at least then the revert should correctly dirty the tiles that are changed avoiding the need for a global dirty.

@lonvia
Copy link
Collaborator

lonvia commented Sep 10, 2024

Okay, sounds good to me. I'll close this for now and we can revisit if necessary.

@lonvia lonvia closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants