Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch load script doesn't properly seed a fresh database #3

Open
Woozl opened this issue Feb 23, 2024 · 0 comments
Open

Elasticsearch load script doesn't properly seed a fresh database #3

Woozl opened this issue Feb 23, 2024 · 0 comments
Assignees

Comments

@Woozl
Copy link
Collaborator

Woozl commented Feb 23, 2024

I was trying to deploy the chart into my personal namespace to test, but the code doesn't work unless there is already the neuro_query_docker index in the elasticsearch DB. This doesn't really impact the current architecture as anytime the chart is upgraded it will use the existing PVC but would be good to eventually fix if we ever need to redeploy. The offending function is here:

def insertDataIntoIndex(termFile, indexName, shards, esConn):
# First question: has the data already been loaded?
res = esConn.indices.refresh(indexName)
res = esConn.cat.count(indexName, params={"format": "json"})
nData = (res[0]['count'])
if int(nData) > 0:
print (f"{nData} data items already loaded")
sys.exit(0)
tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")
rowId = 1
terms = open(termFile, 'r')
for line in terms:
cleanLine = line.strip()
toks = tokenizer.batch_encode_plus([cleanLine],
padding="max_length",
max_length=25,
truncation=True,
return_tensors="pt")
output = model(**toks)
cls_rep = output[0][:,0,:]
print(type(cls_rep))
embeddingArray = cls_rep.detach().numpy()
print(type(embeddingArray))
print(embeddingArray)
insertBody = {'term_name': cleanLine,
'term': cleanLine,
'term_vec': embeddingArray[0],
'row_id': rowId }
rowId += 1
esConn.index(index=indexName, body=insertBody)
print(f"number of rows inserted is {rowId - 1}")
terms.close()

I believe we just need a try/catch around the esConn.indices for the elasticsearch NotFoundError

Stack Trace
Traceback (most recent call last):
  File "/usr/local/renci/bin/loadNeuroQueryTermsSapBert.py", line 148, in <module>
    main(args)
  File "/usr/local/renci/bin/loadNeuroQueryTermsSapBert.py", line 41, in main
    insertDataIntoIndex(termFile, indexName, shards, esConn)
  File "/usr/local/renci/bin/loadNeuroQueryTermsSapBert.py", line 46, in insertDataIntoIndex
    res = esConn.indices.refresh(indexName)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/client/utils.py", line 153, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/client/indices.py", line 60, in refresh
    return self.transport.perform_request(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/transport.py", line 415, in perform_request
    raise e
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/transport.py", line 381, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 275, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/connection/base.py", line 322, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.NotFoundError: NotFoundError(404, 'index_not_found_exception', 'no such index [neuro_query_docker]', neuro_query_docker, index_or_alias)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants