Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch twitterbot on shared RECAP PDFs and insert twitter card data #863

Closed
mlissner opened this issue Aug 30, 2018 · 5 comments
Closed

Catch twitterbot on shared RECAP PDFs and insert twitter card data #863

mlissner opened this issue Aug 30, 2018 · 5 comments

Comments

@mlissner
Copy link
Member

So, when people share PDFs on twitter, we don't have any card data associated with them. This is a missed opportunity.

To address this, we can tweak the code here:

elif file_path.startswith('recap'):
# Create an empty object, and set it to blocked. No need to hit the DB
# since all RECAP documents are blocked.
item = RECAPDocument()
item.blocked = True
mimetype = 'application/pdf'

To watch for the Twitterbot user agent. When it's detected, we provide a useful HTML response page with twitter/facebook card information. When it's not, we serve up the PDF.

I guess the next question is, what info do we want in the card. This is what we show on our HTML pages for documents:

{% block title %}{{ title|safe|striptags }} – CourtListener.com{% endblock %}
{% block og_title %}{{ title|safe|striptags }} – CourtListener.com{% endblock %}
{% block description %}{{ title|safe|striptags }} — Brought to you by the RECAP Initiative and Free Law Project, a non-profit dedicated to creating high quality open legal information.{% endblock %}
{% block og_description %}{{ title|safe|striptags }} — Brought to you by the
  RECAP Initiative and Free Law Project, a non-profit dedicated to creating
  high quality open legal information.{% endblock %}

And the title variable is defined as:

    title = '%sDocument #%s%s in %s' % (
        '%s – ' % item.description if item.description else '',
        item.document_number,
        ', Attachment #%s' % item.attachment_number if
        item.document_type == RECAPDocument.ATTACHMENT else '',
        best_case_name(item.docket_entry.docket),
    )

So the result is something like:

  <meta property="og:type" content="website"/>
  <meta property="og:title" content="Complaint &ndash; Document #1 in NATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States – CourtListener.com"/>
  <meta property="og:description"
        content="Complaint &ndash; Document #1 in NATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States — Brought to you by the
  RECAP Initiative and Free Law Project, a non-profit dedicated to creating
  high quality open legal information.">
  <meta property="og:url" content="https://www.courtlistener.com/docket/4214664/1/national-veterans-legal-services-program-v-united-states/"/>
  <meta property="og:site_name" content="CourtListener"/>
  <meta property="og:image"
        content="https://www.courtlistener.com/static/png/og-image-300x300.png"/>
  <meta property="og:image:type" content="image/png"/>
  <meta property="og:image:width" content="300"/>
  <meta property="og:image:height" content="300"/>
@mlissner
Copy link
Member Author

We could probably improve this to show better values for item.description in the meta description and og:description fields. Right now we show the PACER short description (Complaint), but we could probably show the long description when we have that.

@johnhawkinson
Copy link
Contributor

imo you should iframe the PDF (perhaps with ?nometa and exclude ?nometa from this special handling, so that Twitterbot can get the PDF if it wants it).

I would include the twitter:creator and twitter:site tags (@recapthelaw and @freelawproject?).

Only og:description is relevant to Twitter.

I don't think "Complaint — Document #1 in NATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States — Brought to you by the RECAP Initiative and Free Law Project, a non-profit dedicated to creating" is very good. Document number first, without a huge waste of characters like "Document." More verbose description? maybe, but, you run out of characters first. I wonder if the date or the author is better, e.g.:

"#1 COMPLAINT against All Defendants United States of America filed by NATIONAL VETERANS LEGAL SERVICES PROGRAM, ALLIANCE FOR JUSTICE, NATIONAL CONSUMER LAW CENTER (Gupta, Deepak) in NATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States"

vs.

"#1, Apr. 21, 2016 by Deepak Gupta: COMPLAINT against All Defendants United States of America filed by NATIONAL VETERANS LEGAL SERVICES PROGRAM, ALLIANCE FOR JUSTICE, NATIONAL CONSUMER LAW CENTER (Gupta, Deepak) in ATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States"

or some other variant. I parse to elide parens (Filing fee $ 400 receipt number 0090-4495374), which, well...yeah...

What is this limited to, 70 characters? I dunno how much twitter clients actually display.

@mlissner
Copy link
Member Author

The real fun here will be creating thumbnails of the PDFs, which seems like a noble pursuit, so that you can have a snapshot, a lá @big_cases bot.

@mlissner
Copy link
Member Author

Hey, look, I did finish the code I was working on for this:

@app.task
def make_png_thumbnail_from_pdf(pk, width=350):
"""Create a png thumbnail from a financial disclosure PDF"""
fd = FinancialDisclosure.objects.get(pk=pk)
# Use a temporary location for the file, then save it to the model.
with NamedTemporaryFile(prefix='financial_disclosure_',
suffix=".png") as tmp:
convert = [
'convert',
# Only do the first page.
'%s[0]' % fd.filepath.path,
'-resize', '%s' % width,
# This and the next line handle transparency problems
'-background', 'white',
'-alpha', 'remove',
tmp.name,
]
p = subprocess.Popen(convert, close_fds=True, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, universal_newlines=True)
stdout, stderr = p.communicate()
if p.returncode != 0:
fd.thumbnail_status = fd.THUMBNAIL_FAILED
fd.save()
return fd.pk
fd.thumbnail_status = fd.THUMBNAIL_COMPLETE
filename = '%s.thumb.%sw.png' % (fd.person.slug, width)
fd.thumbnail.save(filename, File(tmp))
return fd

That's from the financial disclosure project, but it'll be trivial to generalize it. AWESOME. I think the performance should be there to do this on the fly so we don't have to do it for every PDF we get.

@mlissner
Copy link
Member Author

This is pretty much done. I expect some trouble during deployment because there are a lot of weird file system things going on:

  • New directory for the thumbnails
  • Apache gets new alias
  • Can't really test this until it's live

But those shouldn't be a big deal really.

I implemented this three times, each time refactoring to decrease complexity. Ugh:

  1. First, I just copied the recap document template and tweaked it however I needed to make for a good twitter template.

  2. Second, I realized there was too much overlap between Twitter's special template and the regular one, so I used includes and blocks to simplify the Twitter template.

  3. Finally I realized there's no need for the special twitter template at all and that I can just use our NORMAL HTML page, but add the new Twitter stuff to it.

Each refactoring wasn't too bad, but it was kind of dumb. Anyway, the implementation is very simple now. Whenever twitter comes crawling, instead of serving the PDF directly, we serve the regular recap document HTML page, which has the Twitter card info, and now has an embedded PDF. Simple.


Addressing the thoughts about what to put in the title and the description:

  • I upgraded the title a bit to simplify it, but not a ton. I played with adding the date and considered adding the author. Adding the date just...looked cluttered. Adding the author isn't something we have the data for. This is easy to further tweak if we want.

  • For the description, I changed it to show the full docket entry description if we have it. And if not, to show just the document short description. I think that'll work OK.

Tests are running now. If they pass, I'll deploy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants