Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load archived MUR data into postgres #4151

Closed
1 task
patphongs opened this issue Jan 17, 2020 · 1 comment
Closed
1 task

Load archived MUR data into postgres #4151

patphongs opened this issue Jan 17, 2020 · 1 comment

Comments

@patphongs
Copy link
Member

patphongs commented Jan 17, 2020

What we're after:

It's risky to have our source of truth be the Classic site and Elasticsearch because we have retired Classic and it's now that data in only in Elasticsearch. In addition, we don't have a lot of experience with managing and restoring from backups in elasticsearch.

Improvement plan

We should insert the archived MUR data into postgres for better data preservation and maintenance.

Code references

We should modify this function to scrape from classic, insert into postgres, and upload to s3 if needed (should already be there, but confirm)

def load_archived_murs(from_mur_no=None, specific_mur_no=None, num_processes=1, tasks_per_child=None):
"""
Reads data for archived MURs from http://classic.fec.gov/MUR/, assembles a JSON
document corresponding to the MUR and indexes this document in Elasticsearch
in the index `archived_murs` with a doc_type of `murs`. In addition, the MUR
document is uploaded to an S3 bucket under the _directory_ `legal/murs/`.
"""
logger.info("Loading archived MURs")
table_text = requests.get('http://classic.fec.gov/MUR/MURData.do').text
raw_mur_tr_element_list = re.findall("<tr [^>]*>(.*?)</tr>", table_text, re.S)[1:]
if from_mur_no is not None:
raw_mur_tr_element_list = list(itertools.dropwhile(
lambda x: re.search('/disclosure_data/mur/([0-9]+)(?:_[A-Z])*\.pdf', x, re.M).group(1) != from_mur_no, raw_mur_tr_element_list))
elif specific_mur_no is not None:
raw_mur_tr_element_list = list(filter(
lambda x: re.search('/disclosure_data/mur/([0-9]+)(?:_[A-Z])*\.pdf', x, re.M).group(1) == specific_mur_no, raw_mur_tr_element_list))
process_murs(raw_mur_tr_element_list)
logger.info("%d archived MURs loaded", len(raw_mur_tr_element_list))

Completion criteria:

  • Reload archived MURs from postgres like we do other legal resources
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants