Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip more kinds of whitespace from news titles and summaries #128

Open
Mr0grog opened this issue Sep 18, 2020 · 0 comments
Open

Strip more kinds of whitespace from news titles and summaries #128

Mr0grog opened this issue Sep 18, 2020 · 0 comments
Labels
bug Something isn't working news Related to scraping news (rather than data)

Comments

@Mr0grog
Copy link
Collaborator

Mr0grog commented Sep 18, 2020

In a recent news update, I noticed that Contra Costa county had an entry where the title started with a zero-width space (\u200b) (sfbrigade/stop-covid19-sfbayarea#392 (comment)).

We already strip whitespace around titles and summaries in most news feeds, but it turns out string.strip() doesn’t handle some of the more complex whitespace like \u200b. We should probably have a better text stripping function that does this for some of the more complex whitespace characters (e.g. \u2000 - \u200b).

@Mr0grog Mr0grog added bug Something isn't working news Related to scraping news (rather than data) labels Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working news Related to scraping news (rather than data)
Projects
None yet
Development

No branches or pull requests

1 participant