Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the finder return only published documents #158

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

octopusinvitro
Copy link
Contributor

@octopusinvitro octopusinvitro commented Apr 11, 2017

In order to show links to unlinked pages in the scraping view, we need to make a difference between published and unpublished posts. This is because posts/events that have a the published field off won't be linked in the site. In the case of info pages and summaries, they are always published, but they have to be linked manually in the site, so those who were not linked won't be scraped either.

At the moment, find_all doesn't make a distinction between documents with the published field on or off, which is a bug. To fix it, the class was modified so that find_all selects only the documents with a published set to true, and a new method was added for the unpublished documents, find_unpublished.

Finally, the find_all method was renamed to find_published.

After that we can send all the unpublished and unlinked documents to the scraper view, since those are the documents that will not be linked anywhere in the site.

Related issues

Closes #151

@octopusinvitro octopusinvitro force-pushed the scrape-unlinked-pages branch from 5cdb86c to 0eaba71 Compare April 11, 2017 15:38
@octopusinvitro octopusinvitro requested a review from mhl April 11, 2017 17:33
@octopusinvitro octopusinvitro force-pushed the scrape-unlinked-pages branch 3 times, most recently from 71e7098 to b452406 Compare April 18, 2017 15:17
Copy link
Contributor

@mhl mhl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments on the detail of this pull request, but I was quite tempted, along the guideline that one should do code review "outside-in", to stop at the pull request description, because it's not completely clear to me.

There are two distinct concepts here:

  1. Whether a document is "published" in the sense that it has a published: true attribute in the frontmatter, or being published is the default for that collection of documents.
  2. Whether a page is linked to from somewhere on the site, so will be scraped by default.

When I mentioned the issue of unlinked-to pages not being scraped, I was specifically thinking about info pages, and not worrying about whether they're considered published or not. (i.e. I was thinking of point 2, not point 1).

On the other hand, you have clearly been thinking mostly about the "published" status of documents, point 1, and might have been conflating 1 and 2 somewhat, from the way you added to the scraper start page unpublished posts described as "Unlinked"? I think you've also spotted and fixed a bug to do with the published status of posts, in that /blog would previously have included unpublished posts, which is clearly a bug!

The question of whether unpublished posts should be linked to on the scraper start page is one that I hadn't thought about, but I guess you could argue it either way ("it might be convenient to see unpublished posts on the live site if you know the URL" vs "someone could guess the URL of an unpublished blog post") The reason I particularly wanted the unlinked-to info pages to be scraped was to help with us testing the site. The same reasoning could apply to posts and events. And draft blog posts aren't really private anyway, since the GitHub repository is publicly browseable. Perhaps the best solution is to scrape the unpublished pages too, but make sure we let our partners know that drafts are possible for people to discover, so they should never put any sensitive information in them.

@@ -7,11 +7,32 @@
let(:finder) { Document::Finder.new(pattern: "#{filename}.md", baseurl: '/path/') }

it 'finds a single document' do
Dir.stub :glob, [filename] do
Dir.stub :glob, [new_tempfile('')] do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have to change, does it? If you want to change to using new_tempfile for all of these consistently, I'd make that a separate preparatory commit, and also change the remaining use of filename.

Copy link
Contributor Author

@octopusinvitro octopusinvitro Apr 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is because now things have changed: when you call find_single you are also calling find_all, so with the introduction of select(&:published?), the method find_all now has to look at the contents of the file, which before it didn't have to, so this test fails. But I wouldn't create a temp file for the other tests if it is not needed... 🤔 What do you think?

@@ -21,7 +21,7 @@ def find_or_empty
end

def find_all
filenames.map { |filename| create_document(filename) }
filenames.map { |filename| create_document(filename) }.select(&:published?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing to me that this is called find_all while it's not returning all the documents, just those for which published? is true. Perhaps you could call this find_published instead? (I realise that we don't have published fields for some of the collections, but I think it's less confusing to assume that every document of those is published to having something called find_all that isn't returning all documents.)

create_documents.select(&:published?)
end

def find_previews
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this find_previews adds a surprising interpretation to what it returns which is surprising, I think. They're not previews in any sense yet: it's still the full document. I'd call this find_unpublished.

@@ -33,6 +33,18 @@
end
end

it 'can find unpublished documents' do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of robustness of the tests, in this case even if you reversed the logic of find_previews, so that it only returned published documents, the tests would still pass. A simple improvement so that can't so easily happen is to stub it with [new_tempfile(published), new_tempfile(unpublished), new_tempfile(unpublished)] so that there are 2 unpublished and 1 published documents, and assert that find_previews must return 2.

(This is a general point that I try to think about when writing tests: "How could this test still pass if I were making plausible but incorrect changes to the code under test, and how likely is someone to make such a change?" In this case, getting the logic of a predicate the wrong way round is a very easy and common mistake to make, so I'd try to make sure the test would still fail in that case.)

get '/scraper-start-page.html'
refute_empty(subject.xpath('//a[contains(@href, "/blog/")]'))
refute_empty(subject.xpath('//a[contains(@href, "/events/")]'))
refute_empty(subject.xpath('//a[contains(@href, "/info/") and not(contains(@href, "events"))][1]'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this tests what the test says it tests. e.g. any info page other than /info/events existing will allow the test to pass, not just unlinked-to ones.

@octopusinvitro octopusinvitro force-pushed the scrape-unlinked-pages branch 3 times, most recently from b044632 to 8575c96 Compare April 20, 2017 12:06
@octopusinvitro octopusinvitro force-pushed the scrape-unlinked-pages branch from 8575c96 to 858d571 Compare April 27, 2017 16:36
This method allows the scraper route to send the unpublished
documents to the view so that they can be scraped and pushed to
the static repo.

Documents that are unpublished are not linked anywhere on the
site. With this change, they can be loaded as a preview without
being explicitly linked from the site.

This means that all documents, linked or unlinked, are published.
There is a small nuance between "published" documents and "unlinked"
documents that may not be obvious to a newcomer to the codebase. So
an explanation was filled under the tribal knowledge section of the
contributing guide.
@octopusinvitro octopusinvitro force-pushed the scrape-unlinked-pages branch from 858d571 to e7e0ad0 Compare July 12, 2017 22:26
@coveralls
Copy link

Coverage Status

Coverage increased (+0.003%) to 99.758% when pulling e7e0ad0 on scrape-unlinked-pages into ce9d5bd on master.

@theyworkforyou theyworkforyou deleted a comment from coveralls Jul 12, 2017
@octopusinvitro
Copy link
Contributor Author

@mhl I rebased this branch as it was ready to be reviewed after addressing feedback 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add info and event pages to the scraper start page
3 participants