Load archived MUR data into postgres #4151

patphongs · 2020-01-17T16:12:19Z

What we're after:

It's risky to have our source of truth be the Classic site and Elasticsearch because we have retired Classic and it's now that data in only in Elasticsearch. In addition, we don't have a lot of experience with managing and restoring from backups in elasticsearch.

Improvement plan

We should insert the archived MUR data into postgres for better data preservation and maintenance.

Code references

We should modify this function to scrape from classic, insert into postgres, and upload to s3 if needed (should already be there, but confirm)

openFEC/webservices/legal_docs/load_legal_docs.py

Lines 445 to 462 in 3c78a21

    
           def load_archived_murs(from_mur_no=None, specific_mur_no=None, num_processes=1, tasks_per_child=None): 
        
               """ 
        
               Reads data for archived MURs from http://classic.fec.gov/MUR/, assembles a JSON 
        
               document corresponding to the MUR and indexes this document in Elasticsearch 
        
               in the index `archived_murs` with a doc_type of `murs`. In addition, the MUR 
        
               document is uploaded to an S3 bucket under the _directory_ `legal/murs/`. 
        
               """ 
        
               logger.info("Loading archived MURs") 
        
               table_text = requests.get('http://classic.fec.gov/MUR/MURData.do').text 
        
               raw_mur_tr_element_list = re.findall("<tr [^>]*>(.*?)</tr>", table_text, re.S)[1:] 
        
               if from_mur_no is not None: 
        
                   raw_mur_tr_element_list = list(itertools.dropwhile( 
        
                       lambda x: re.search('/disclosure_data/mur/([0-9]+)(?:_[A-Z])*\.pdf', x, re.M).group(1) != from_mur_no, raw_mur_tr_element_list)) 
        
               elif specific_mur_no is not None: 
        
                   raw_mur_tr_element_list = list(filter( 
        
                       lambda x: re.search('/disclosure_data/mur/([0-9]+)(?:_[A-Z])*\.pdf', x, re.M).group(1) == specific_mur_no, raw_mur_tr_element_list)) 
        
               process_murs(raw_mur_tr_element_list) 
        
               logger.info("%d archived MURs loaded", len(raw_mur_tr_element_list))

Completion criteria:

Reload archived MURs from postgres like we do other legal resources

JonellaCulmer · 2020-07-30T15:46:12Z

Closing in favor of #4465

patphongs added Work: Back-end Needs refinement Needs prioritization Work: PI 11 no milestone labels Jan 17, 2020

lbeaufort mentioned this issue Jan 17, 2020

Preserve archived MUR data in Postgres or archived MUR XML #4129

Closed

2 tasks

lbeaufort removed the Work: PI 11 no milestone label Jul 16, 2020

JonellaCulmer added this to the Sprint 13.2 milestone Jul 30, 2020

JonellaCulmer closed this as completed Jul 30, 2020

patphongs mentioned this issue Mar 4, 2024

Sub-epic: Elasticsearch fecgov/fec-epics#183

Closed

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load archived MUR data into postgres #4151

Load archived MUR data into postgres #4151

patphongs commented Jan 17, 2020 •

edited by lbeaufort

Loading

JonellaCulmer commented Jul 30, 2020

Load archived MUR data into postgres #4151

Load archived MUR data into postgres #4151

Comments

patphongs commented Jan 17, 2020 • edited by lbeaufort Loading

What we're after:

Improvement plan

Code references

JonellaCulmer commented Jul 30, 2020

patphongs commented Jan 17, 2020 •

edited by lbeaufort

Loading