-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the finder return only published documents #158
base: master
Are you sure you want to change the base?
Conversation
5cdb86c
to
0eaba71
Compare
71e7098
to
b452406
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments on the detail of this pull request, but I was quite tempted, along the guideline that one should do code review "outside-in", to stop at the pull request description, because it's not completely clear to me.
There are two distinct concepts here:
- Whether a document is "published" in the sense that it has a
published: true
attribute in the frontmatter, or being published is the default for that collection of documents. - Whether a page is linked to from somewhere on the site, so will be scraped by default.
When I mentioned the issue of unlinked-to pages not being scraped, I was specifically thinking about info pages, and not worrying about whether they're considered published or not. (i.e. I was thinking of point 2, not point 1).
On the other hand, you have clearly been thinking mostly about the "published" status of documents, point 1, and might have been conflating 1 and 2 somewhat, from the way you added to the scraper start page unpublished posts described as "Unlinked"? I think you've also spotted and fixed a bug to do with the published status of posts, in that /blog
would previously have included unpublished posts, which is clearly a bug!
The question of whether unpublished posts should be linked to on the scraper start page is one that I hadn't thought about, but I guess you could argue it either way ("it might be convenient to see unpublished posts on the live site if you know the URL" vs "someone could guess the URL of an unpublished blog post") The reason I particularly wanted the unlinked-to info pages to be scraped was to help with us testing the site. The same reasoning could apply to posts and events. And draft blog posts aren't really private anyway, since the GitHub repository is publicly browseable. Perhaps the best solution is to scrape the unpublished pages too, but make sure we let our partners know that drafts are possible for people to discover, so they should never put any sensitive information in them.
@@ -7,11 +7,32 @@ | |||
let(:finder) { Document::Finder.new(pattern: "#{filename}.md", baseurl: '/path/') } | |||
|
|||
it 'finds a single document' do | |||
Dir.stub :glob, [filename] do | |||
Dir.stub :glob, [new_tempfile('')] do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't have to change, does it? If you want to change to using new_tempfile
for all of these consistently, I'd make that a separate preparatory commit, and also change the remaining use of filename
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is because now things have changed: when you call find_single
you are also calling find_all
, so with the introduction of select(&:published?)
, the method find_all
now has to look at the contents of the file, which before it didn't have to, so this test fails. But I wouldn't create a temp file for the other tests if it is not needed... 🤔 What do you think?
lib/document/finder.rb
Outdated
@@ -21,7 +21,7 @@ def find_or_empty | |||
end | |||
|
|||
def find_all | |||
filenames.map { |filename| create_document(filename) } | |||
filenames.map { |filename| create_document(filename) }.select(&:published?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confusing to me that this is called find_all
while it's not returning all the documents, just those for which published?
is true
. Perhaps you could call this find_published
instead? (I realise that we don't have published fields for some of the collections, but I think it's less confusing to assume that every document of those is published to having something called find_all
that isn't returning all documents.)
lib/document/finder.rb
Outdated
create_documents.select(&:published?) | ||
end | ||
|
||
def find_previews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this find_previews
adds a surprising interpretation to what it returns which is surprising, I think. They're not previews in any sense yet: it's still the full document. I'd call this find_unpublished
.
tests/document/finder.rb
Outdated
@@ -33,6 +33,18 @@ | |||
end | |||
end | |||
|
|||
it 'can find unpublished documents' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of robustness of the tests, in this case even if you reversed the logic of find_previews
, so that it only returned published documents, the tests would still pass. A simple improvement so that can't so easily happen is to stub it with [new_tempfile(published), new_tempfile(unpublished), new_tempfile(unpublished)]
so that there are 2 unpublished and 1 published documents, and assert that find_previews
must return 2.
(This is a general point that I try to think about when writing tests: "How could this test still pass if I were making plausible but incorrect changes to the code under test, and how likely is someone to make such a change?" In this case, getting the logic of a predicate the wrong way round is a very easy and common mistake to make, so I'd try to make sure the test would still fail in that case.)
tests/web/unlinked_pages.rb
Outdated
get '/scraper-start-page.html' | ||
refute_empty(subject.xpath('//a[contains(@href, "/blog/")]')) | ||
refute_empty(subject.xpath('//a[contains(@href, "/events/")]')) | ||
refute_empty(subject.xpath('//a[contains(@href, "/info/") and not(contains(@href, "events"))][1]')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this tests what the test says it tests. e.g. any info page other than /info/events
existing will allow the test to pass, not just unlinked-to ones.
b044632
to
8575c96
Compare
8575c96
to
858d571
Compare
This method allows the scraper route to send the unpublished documents to the view so that they can be scraped and pushed to the static repo. Documents that are unpublished are not linked anywhere on the site. With this change, they can be loaded as a preview without being explicitly linked from the site. This means that all documents, linked or unlinked, are published.
There is a small nuance between "published" documents and "unlinked" documents that may not be obvious to a newcomer to the codebase. So an explanation was filled under the tribal knowledge section of the contributing guide.
858d571
to
e7e0ad0
Compare
@mhl I rebased this branch as it was ready to be reviewed after addressing feedback 👍 |
In order to show links to unlinked pages in the scraping view, we need to make a difference between published and unpublished posts. This is because posts/events that have a the
published
field off won't be linked in the site. In the case of info pages and summaries, they are always published, but they have to be linked manually in the site, so those who were not linked won't be scraped either.At the moment,
find_all
doesn't make a distinction between documents with thepublished
field on or off, which is a bug. To fix it, the class was modified so thatfind_all
selects only the documents with apublished
set totrue
, and a new method was added for the unpublished documents,find_unpublished
.Finally, the
find_all
method was renamed tofind_published
.After that we can send all the unpublished and unlinked documents to the scraper view, since those are the documents that will not be linked anywhere in the site.
Related issues
Closes #151