Elasticsearch load script doesn't properly seed a fresh database #3

Woozl · 2024-02-23T19:51:17Z

I was trying to deploy the chart into my personal namespace to test, but the code doesn't work unless there is already the neuro_query_docker index in the elasticsearch DB. This doesn't really impact the current architecture as anytime the chart is upgraded it will use the existing PVC but would be good to eventually fix if we ever need to redeploy. The offending function is here:

neuro-query-dense-vectors/loadNeuroQueryTermsSapBert.py

Lines 43 to 80 in 6ae5cf3

    
           def insertDataIntoIndex(termFile, indexName, shards, esConn): 
        
               # First question: has the data already been loaded? 
        
               res = esConn.indices.refresh(indexName) 
        
               res = esConn.cat.count(indexName, params={"format": "json"}) 
        
               nData = (res[0]['count']) 
        
               if int(nData) > 0: 
        
                  print (f"{nData} data items already loaded") 
        
                  sys.exit(0) 
        
               tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")   
        
               model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext") 
        
               rowId = 1 
        
               terms = open(termFile, 'r') 
        
               for line in terms: 
        
                  cleanLine = line.strip() 
        
                  toks = tokenizer.batch_encode_plus([cleanLine],  
        
                                                     padding="max_length",  
        
                                                     max_length=25,  
        
                                                     truncation=True, 
        
                                                     return_tensors="pt") 
        
                  output = model(**toks) 
        
                  cls_rep = output[0][:,0,:] 
        
                  print(type(cls_rep)) 
        
                  embeddingArray = cls_rep.detach().numpy() 
        
                  print(type(embeddingArray)) 
        
                  print(embeddingArray) 
        
                  insertBody = {'term_name': cleanLine, 
        
                                'term': cleanLine, 
        
                                'term_vec':  embeddingArray[0], 
        
                                'row_id':  rowId } 
        
                  rowId += 1 
        
                  esConn.index(index=indexName, body=insertBody) 
        
               print(f"number of rows inserted is {rowId - 1}") 
        
               terms.close()

I believe we just need a try/catch around the esConn.indices for the elasticsearch NotFoundError

Stack Trace

Traceback (most recent call last):
  File "/usr/local/renci/bin/loadNeuroQueryTermsSapBert.py", line 148, in <module>
    main(args)
  File "/usr/local/renci/bin/loadNeuroQueryTermsSapBert.py", line 41, in main
    insertDataIntoIndex(termFile, indexName, shards, esConn)
  File "/usr/local/renci/bin/loadNeuroQueryTermsSapBert.py", line 46, in insertDataIntoIndex
    res = esConn.indices.refresh(indexName)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/client/utils.py", line 153, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/client/indices.py", line 60, in refresh
    return self.transport.perform_request(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/transport.py", line 415, in perform_request
    raise e
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/transport.py", line 381, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 275, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/connection/base.py", line 322, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.NotFoundError: NotFoundError(404, 'index_not_found_exception', 'no such index [neuro_query_docker]', neuro_query_docker, index_or_alias)

The text was updated successfully, but these errors were encountered:

Woozl assigned HowardLander Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch load script doesn't properly seed a fresh database #3

Elasticsearch load script doesn't properly seed a fresh database #3

Woozl commented Feb 23, 2024

Elasticsearch load script doesn't properly seed a fresh database #3

Elasticsearch load script doesn't properly seed a fresh database #3

Comments

Woozl commented Feb 23, 2024