Catch twitterbot on shared RECAP PDFs and insert twitter card data #863

mlissner · 2018-08-30T21:13:08Z

So, when people share PDFs on twitter, we don't have any card data associated with them. This is a missed opportunity.

To address this, we can tweak the code here:

Lines 336 to 341 in 481e4f9

    
           elif file_path.startswith('recap'): 
        
               # Create an empty object, and set it to blocked. No need to hit the DB 
        
               # since all RECAP documents are blocked. 
        
               item = RECAPDocument() 
        
               item.blocked = True 
        
               mimetype = 'application/pdf'

To watch for the Twitterbot user agent. When it's detected, we provide a useful HTML response page with twitter/facebook card information. When it's not, we serve up the PDF.

I guess the next question is, what info do we want in the card. This is what we show on our HTML pages for documents:

{% block title %}{{ title|safe|striptags }} – CourtListener.com{% endblock %}
{% block og_title %}{{ title|safe|striptags }} – CourtListener.com{% endblock %}
{% block description %}{{ title|safe|striptags }} — Brought to you by the RECAP Initiative and Free Law Project, a non-profit dedicated to creating high quality open legal information.{% endblock %}
{% block og_description %}{{ title|safe|striptags }} — Brought to you by the
  RECAP Initiative and Free Law Project, a non-profit dedicated to creating
  high quality open legal information.{% endblock %}

And the title variable is defined as:

    title = '%sDocument #%s%s in %s' % (
        '%s &ndash; ' % item.description if item.description else '',
        item.document_number,
        ', Attachment #%s' % item.attachment_number if
        item.document_type == RECAPDocument.ATTACHMENT else '',
        best_case_name(item.docket_entry.docket),
    )

So the result is something like:

  <meta property="og:type" content="website"/>
  <meta property="og:title" content="Complaint &ndash; Document #1 in NATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States – CourtListener.com"/>
  <meta property="og:description"
        content="Complaint &ndash; Document #1 in NATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States — Brought to you by the
  RECAP Initiative and Free Law Project, a non-profit dedicated to creating
  high quality open legal information.">
  <meta property="og:url" content="https://www.courtlistener.com/docket/4214664/1/national-veterans-legal-services-program-v-united-states/"/>
  <meta property="og:site_name" content="CourtListener"/>
  <meta property="og:image"
        content="https://www.courtlistener.com/static/png/og-image-300x300.png"/>
  <meta property="og:image:type" content="image/png"/>
  <meta property="og:image:width" content="300"/>
  <meta property="og:image:height" content="300"/>

The text was updated successfully, but these errors were encountered:

mlissner · 2018-08-30T21:15:33Z

We could probably improve this to show better values for item.description in the meta description and og:description fields. Right now we show the PACER short description (Complaint), but we could probably show the long description when we have that.

johnhawkinson · 2018-08-30T21:34:33Z

imo you should iframe the PDF (perhaps with ?nometa and exclude ?nometa from this special handling, so that Twitterbot can get the PDF if it wants it).

I would include the twitter:creator and twitter:site tags (@recapthelaw and @freelawproject?).

Only og:description is relevant to Twitter.

I don't think "Complaint — Document #1 in NATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States — Brought to you by the RECAP Initiative and Free Law Project, a non-profit dedicated to creating" is very good. Document number first, without a huge waste of characters like "Document." More verbose description? maybe, but, you run out of characters first. I wonder if the date or the author is better, e.g.:

"#1 COMPLAINT against All Defendants United States of America filed by NATIONAL VETERANS LEGAL SERVICES PROGRAM, ALLIANCE FOR JUSTICE, NATIONAL CONSUMER LAW CENTER (Gupta, Deepak) in NATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States"

vs.

"#1, Apr. 21, 2016 by Deepak Gupta: COMPLAINT against All Defendants United States of America filed by NATIONAL VETERANS LEGAL SERVICES PROGRAM, ALLIANCE FOR JUSTICE, NATIONAL CONSUMER LAW CENTER (Gupta, Deepak) in ATIONAL VETERANS LEGAL SERVICES PROGRAM v. United States"

or some other variant. I parse to elide parens (Filing fee $ 400 receipt number 0090-4495374), which, well...yeah...

What is this limited to, 70 characters? I dunno how much twitter clients actually display.

mlissner · 2018-09-10T23:52:25Z

The real fun here will be creating thumbnails of the PDFs, which seems like a noble pursuit, so that you can have a snapshot, a lá @big_cases bot.

mlissner · 2018-09-11T00:26:15Z

Hey, look, I did finish the code I was working on for this:

courtlistener/cl/people_db/tasks.py

Lines 10 to 40 in 9ee76e4

    
           @app.task 
        
           def make_png_thumbnail_from_pdf(pk, width=350): 
        
               """Create a png thumbnail from a financial disclosure PDF""" 
        
               fd = FinancialDisclosure.objects.get(pk=pk) 
        
               # Use a temporary location for the file, then save it to the model. 
        
               with NamedTemporaryFile(prefix='financial_disclosure_', 
        
                                       suffix=".png") as tmp: 
        
                   convert = [ 
        
                       'convert', 
        
                       # Only do the first page. 
        
                       '%s[0]' % fd.filepath.path, 
        
                       '-resize', '%s' % width, 
        
                       # This and the next line handle transparency problems 
        
                       '-background', 'white', 
        
                       '-alpha', 'remove', 
        
                       tmp.name, 
        
                   ] 
        
                   p = subprocess.Popen(convert, close_fds=True, stdout=subprocess.PIPE, 
        
                                        stderr=subprocess.PIPE, universal_newlines=True) 
        
                   stdout, stderr = p.communicate() 
        
                   if p.returncode != 0: 
        
                       fd.thumbnail_status = fd.THUMBNAIL_FAILED 
        
                       fd.save() 
        
                       return fd.pk 
        
                   fd.thumbnail_status = fd.THUMBNAIL_COMPLETE 
        
                   filename = '%s.thumb.%sw.png' % (fd.person.slug, width) 
        
                   fd.thumbnail.save(filename, File(tmp)) 
        
               return fd

That's from the financial disclosure project, but it'll be trivial to generalize it. AWESOME. I think the performance should be there to do this on the fly so we don't have to do it for every PDF we get.

mlissner · 2018-09-13T21:06:04Z

This is pretty much done. I expect some trouble during deployment because there are a lot of weird file system things going on:

New directory for the thumbnails
Apache gets new alias
Can't really test this until it's live

But those shouldn't be a big deal really.

I implemented this three times, each time refactoring to decrease complexity. Ugh:

First, I just copied the recap document template and tweaked it however I needed to make for a good twitter template.
Second, I realized there was too much overlap between Twitter's special template and the regular one, so I used includes and blocks to simplify the Twitter template.
Finally I realized there's no need for the special twitter template at all and that I can just use our NORMAL HTML page, but add the new Twitter stuff to it.

Each refactoring wasn't too bad, but it was kind of dumb. Anyway, the implementation is very simple now. Whenever twitter comes crawling, instead of serving the PDF directly, we serve the regular recap document HTML page, which has the Twitter card info, and now has an embedded PDF. Simple.

Addressing the thoughts about what to put in the title and the description:

I upgraded the title a bit to simplify it, but not a ton. I played with adding the date and considered adding the author. Adding the date just...looked cluttered. Adding the author isn't something we have the data for. This is easy to further tweak if we want.
For the description, I changed it to show the full docket entry description if we have it. And if not, to show just the document short description. I think that'll work OK.

Tests are running now. If they pass, I'll deploy.

mlissner added the easy pickins label Aug 31, 2018

mlissner closed this as completed Sep 15, 2018

mlissner mentioned this issue Sep 18, 2018

Automatically download Document Selection Menus (aka Attachment Pages) #852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch twitterbot on shared RECAP PDFs and insert twitter card data #863

Catch twitterbot on shared RECAP PDFs and insert twitter card data #863

mlissner commented Aug 30, 2018

mlissner commented Aug 30, 2018

johnhawkinson commented Aug 30, 2018

mlissner commented Sep 10, 2018

mlissner commented Sep 11, 2018

mlissner commented Sep 13, 2018

Catch twitterbot on shared RECAP PDFs and insert twitter card data #863

Catch twitterbot on shared RECAP PDFs and insert twitter card data #863

Comments

mlissner commented Aug 30, 2018

mlissner commented Aug 30, 2018

johnhawkinson commented Aug 30, 2018

mlissner commented Sep 10, 2018

mlissner commented Sep 11, 2018

mlissner commented Sep 13, 2018