[Feature/Addition] Adding a connector for the LegiScan API #935

matthewkrausse · 2023-11-29T18:18:08Z

I'm working on a python library to wrap the Legiscan API. This allows you to get data on legislators, bills, votes, etc.
Wanted to see if other people would be interested in this. There is not that much work to be done on it but would be happy to have someone pair on it.

https://legiscan.com/

Priority

low

lkesich · 2023-12-07T20:24:30Z

It's not being updated anymore, but I still use the pylegiscan library (https://github.com/poliquin/pylegiscan). I use it pretty regularly and would be happy to help test a new Parsons legiscan class.

matthewkrausse · 2023-12-07T20:52:14Z

@lkesich What are some of the methods you primarily use? I wanted to create a priority order of me working on it. Also how are you working with the data? A lot of information is nested and I would like to provide some same use cases for being able to extract the data in a useful way.

Thanks!

lkesich · 2023-12-08T23:05:28Z

@matthewkrausse Honestly, I mostly just use the getDatsetList and getDataset methods. I have redshift tables for people, bills, votes etc that I overwrite once a day with fresh data from those bulk imports. My application doesn't require real-time updates, so this was simplest for me.

I've also used getMasterListRaw and getBill for comparing change hashes and replacing stale data, but I found this was actually slower than just overwriting the current session completely. It also creates many more API calls than the bulk import methods. A more advanced Python user probably could have optimized my process here, but I ended up just going back to what worked for me.

The nesting is definitely the most difficult part of working with Legiscan data. I use the long_table() method from Parsons.Table to expand the fields I'm interested in as separate tables, then drop the other nested columns before loading to the data warehouse. I don't like to work with semi-structured data in redshift, so all the data I load is fully unpacked. When I was using a postgres-based data warehouse, I loaded the data basically as-is.

My redshift schema has the following tables:

votes
roll_calls
people
bills
sponsors
subjects

sponsors and subjects are extracted from the nested structure of the bill JSONs prior to loading. votes and roll_calls are both extracted from the vote JSONs (in my data model, votes has one row per legislator per roll_call, roll_calls has one row per roll_call).

I use this data to fulfill one-off research requests (e.g. "which current Dem legislators are most likely to break with their party on gun control votes"), and to support applications that allied organizations use for opposition research and end-of-year scorecard building.

I can imagine that other people might want to use Legiscan for real time bill monitoring, so I'm not sure how representative my use case is. If it's helpful, I can share more details about it and attach my Python script.

matthewkrausse · 2023-12-09T00:24:51Z

Thanks for taking the time to respond. That would be helpful if you would share any resources you use it for and that structure. I envision a lot of other people having that similar use case. When we release the new connector, it would be great to have some sample script to show how to handle the data.

lkesich · 2023-12-09T02:28:00Z

This is the script for creating or overwriting my database. The functions in the Legiscan class are from poliquin's pylegiscan library. There's probably more elegant ways of doing a lot of this, but at some point I had to get it working and move on to the actual project :)

import os
import pandas as pd
import numpy as np
from parsons import Table, Redshift
import json
import requests
import base64
import zipfile
import io
from urllib.parse import urlencode, quote_plus

rs = Redshift()
table = Table()

class LegiScanError(Exception):
    pass

class LegiScan(object):
    BASE_URL = 'http://api.legiscan.com/?key={0}&op={1}&{2}'

    def __init__(self, apikey=None):
        """LegiScan API.  State parameters should always be passed as
           USPS abbreviations.  Bill numbers and abbreviations are case
           insensitive.  Register for API at http://legiscan.com/legiscan
        """
        # see if API key available as environment variable
        if apikey is None:
            apikey = os.getenv('LEGISCAN_API_KEY')
        self.key = apikey.strip()

    def _url(self, operation, params=None):
        """Build a URL for querying the API."""
        if not isinstance(params, str) and params is not None:
            params = urlencode(params)
        elif params is None:
            params = ''
        return self.BASE_URL.format(self.key, operation, params)

    def _get(self, url):
        """Get and parse JSON from API for a url."""
        req = requests.get(url)
        if not req.ok:
            raise LegiScanError('Request returned {0}: {1}'\
                    .format(req.status_code, url))
        data = json.loads(req.content)
        if data['status'] == "ERROR":
            raise LegiScanError(data['alert']['message'])
        return data

    def get_dataset_list(self, state=None, year=None):
        """Get list of available datasets, with optional state and year filtering.
        """
        if state is not None:
            url = self._url('getDatasetList', {'state': state})
        elif year is not None:
            url = self._url('getDatasetList', {'year': year})
        else:
            url = self._url('getDatasetList')
        data = self._get(url)
        return data['datasetlist']

    def get_dataset(self, id, access_key):
        """Get list of available datasets, with optional state and year filtering.
        """
        url = self._url('getDataset', {'id': id, 'access_key': access_key})
        data = self._get(url)
        return data['dataset']
    
legis = LegiScan()
    
    
""" Warning: these lines will create one API call per session + one more API call. """

datasets = legis.get_dataset_list(state = 'my_state')

votes = []
bills = []
people = []
    
for dataset in datasets:
    session_id = dataset['session_id']
    access_key = dataset['access_key']   
    api_output = legis.get_dataset(session_id, access_key)
    encoded = base64.b64decode(api_output['zip'])
    zipped = zipfile.ZipFile(io.BytesIO(encoded))
    files = zipped.namelist()
    
    for file in files:
        content = zipped.read(file).decode("utf-8")   
        try:
            data = json.loads(content)
            if '/bill/' in file:
                bills.append(data['bill'])
            elif '/vote/' in file:
                votes.append(data['roll_call'])
            elif '/people/' in file:
                data['person']['session_id'] = session_id
                people.append(data['person'])       
        except:
            pass
    
bill_df = pd.json_normalize(bills)
vote_df = pd.json_normalize(votes)
people_df = pd.json_normalize(people)


""" Unpack nested columns into separate tables + clean column names """

def simplify_df(df):
    
    nested_cols = []
    
    for col in df.columns:
        if isinstance(df[col][0],list):
            nested_cols.append(col)
            
    new_names = []
    old_names = df.columns

    for old_name in old_names:
        new_name = old_name.split('.')[-1]
        new_names.append(new_name)
            
    name_dict = dict(zip(old_names, new_names))
    
    output = df.rename(columns = name_dict)
    
    output.drop(columns = nested_cols, axis = 1, inplace = True)
    
    output.mask(output.applymap(type).eq(list) & ~output.astype(bool), inplace = True)
    output.replace({np.nan: None}, inplace = True)
    
    return output

def clean_names(tbl, prefix = ''):
    old_names = tbl.columns
    
    for old_name in tbl.columns:
        new_name = old_name.replace(prefix,'')
        try:
            tbl.rename_column(old_name, new_name)
        except:
            pass
        
    return tbl

# roll call
roll_call_tbl = table.from_dataframe(bill_df).long_table('bill_id','votes')
clean_names(roll_call_tbl, 'votes_')

# subjects
subjects_tbl = table.from_dataframe(bill_df).long_table('bill_id','subjects')
clean_names(subjects_tbl, 'subjects_')

# sponsors
sponsors_tbl = table.from_dataframe(bill_df).long_table('bill_id','sponsors')
clean_names(sponsors_tbl, 'sponsors_')

# vote
vote_tbl = table.from_dataframe(vote_df).long_table(['roll_call_id','bill_id'],'votes')
clean_names(vote_tbl, 'votes_')

# bill
simplified_bill_df = simplify_df(bill_df)
bill_tbl = table.from_dataframe(simplified_bill_df)

# person
simplified_people_df = simplify_df(people_df)
people_tbl = table.from_dataframe(simplified_people_df)


""" Warning: this will completely overwrite all legiscan tables in redshift. """

legiscan_tables = [
    {
        'table': 'legiscan_people',
        'schema': 'my_schema',
        'sortkey': 'people_id',
        'distkey': 'people_id',
        'tbl': people_tbl
    },
    {
        'table': 'legiscan_roll_calls',
        'schema': 'my_schema',
        'sortkey': 'date',
        'distkey': 'roll_call_id',
        'tbl': roll_call_tbl
    },
    {
        'table': 'legiscan_bills',
        'schema': 'my_schema',
        'sortkey': 'session_id',
        'distkey': 'bill_id',
        'tbl': bill_tbl
    },
    {
        'table': 'legiscan_votes',
        'schema': 'my_schema',
        'sortkey': 'roll_call_id, people_id',
        'distkey': 'roll_call_id',
        'tbl': vote_tbl
    },
    {
        'table': 'legiscan_subjects',
        'schema': 'my_schema',
        'sortkey': 'subject_id',
        'distkey': 'bill_id',
        'tbl': subjects_tbl
    },
    {
        'table': 'legiscan_sponsors',
        'schema': 'my_schema',
        'sortkey': 'people_id',
        'distkey': 'bill_id',
        'tbl': sponsors_tbl
    }
]


""" Send to redshift. If you have not defined a distkey or sortkey, 
    you will get a warning that can be safely ignored."""

for key in legiscan_tables:
    destination = key['schema'] + '.' + key['table']
    tbl = key['tbl']
    dist = key['distkey']
    sort = key['sortkey']
    tbl.to_redshift(
        destination, 
        distkey = dist, 
        sortkey = sort, 
        if_exists = 'drop')

matthewkrausse · 2023-12-10T20:56:25Z

This is great. Thanks! I went ahead and used this for testing the new LegiScan connector and also rewrote it to send the data to BigQuery. Once I get it cleaned up, I'll add it to the a sample use case script to share when the new connector is ready.

lkesich · 2023-12-11T02:30:39Z

I look forward to trying the new connector! Thanks for your work on this.

matthewkrausse added the enhancement Impact - something should be added to or changed about Parsons that isn't causing a current breakage label Nov 29, 2023

shaunagm added low priority Priority - this is a nice-to-have, is non-urgent, and/or has a minor overall imapact on Parsons new connector Work type - creating a new Parsons connector for a tool labels Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature/Addition] Adding a connector for the LegiScan API #935

[Feature/Addition] Adding a connector for the LegiScan API #935

matthewkrausse commented Nov 29, 2023 •

edited

Loading

lkesich commented Dec 7, 2023

matthewkrausse commented Dec 7, 2023

lkesich commented Dec 8, 2023

matthewkrausse commented Dec 9, 2023

lkesich commented Dec 9, 2023 •

edited

Loading

matthewkrausse commented Dec 10, 2023 •

edited

Loading

lkesich commented Dec 11, 2023

[Feature/Addition] Adding a connector for the LegiScan API #935

[Feature/Addition] Adding a connector for the LegiScan API #935

Comments

matthewkrausse commented Nov 29, 2023 • edited Loading

Priority

lkesich commented Dec 7, 2023

matthewkrausse commented Dec 7, 2023

lkesich commented Dec 8, 2023

matthewkrausse commented Dec 9, 2023

lkesich commented Dec 9, 2023 • edited Loading

matthewkrausse commented Dec 10, 2023 • edited Loading

lkesich commented Dec 11, 2023

matthewkrausse commented Nov 29, 2023 •

edited

Loading

lkesich commented Dec 9, 2023 •

edited

Loading

matthewkrausse commented Dec 10, 2023 •

edited

Loading