You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the lm_score code processes a single sentence at a time. This is pretty slow if you're processing a large amount of data. I wrote a batched version, though it's a bit ugly. This increases processing speed by about 8x on a single 3090
import torch.nn.functional as F
def get_lm_score(sentences, batch_tokens=42000):
def score_batch(batch, tokenizer, model):
inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
batch_scores = []
with torch.no_grad():
labels = inputs["input_ids"].clone()
labels[inputs["input_ids"] == tokenizer.pad_token_id] = -100
out = model(input_ids=inputs["input_ids"], labels=labels, attention_mask=inputs["attention_mask"], token_type_ids=inputs["token_type_ids"])
logits = out['logits']
for j in range(labels.shape[0]):
loss = F.cross_entropy(logits[j].view(-1, tokenizer.vocab_size), labels[j].view(-1))
batch_scores.append(math.exp(loss.item()))
return batch_scores
model_name = 'bert-base-cased'
model = BertForMaskedLM.from_pretrained(model_name).to(device)
model.eval()
tokenizer = BertTokenizerFast.from_pretrained(model_name)
lm_score = []
# sort sentences by length for optimal padding (getting the tokens takes too long so using string length as approximation)
sentences_flat = []
for sent in sentences:
for s in sent:
sentences_flat.append((s, len(s)))
sentences_flat.sort(key=lambda x: x[1], reverse=True)
batches = []
current_batch_count = 0
current_batch = []
for sent in sentences_flat:
current_batch.append(sent[0])
current_batch_count += sent[1]
if current_batch_count > batch_tokens:
batches.append(current_batch)
current_batch_count = 0
current_batch = []
if len(current_batch) > 0:
batches.append(current_batch)
score_dict = {}
for batch in tqdm(batches):
batch_score = score_batch(batch, tokenizer, model)
for j, sent in enumerate(batch):
score_dict[sent] = batch_score[j]
for sentence in sentences:
if len(sentence) == 0:
lm_score.append(0.0)
continue
score_i = 0.0
for x in sentence:
if x in score_dict:
score_i += score_dict[x]
else:
score_i += 10000
score_i /= len(sentence)
lm_score.append(score_i)
return lm_score
The text was updated successfully, but these errors were encountered:
huggingface seems to have changed their api for the model.forward call - the above code works for transformers 4.20 (the latest one) but not the one in this repo (3.3.1) The code would have to be changed if you want to keep the current transformer version.
the batched code requires a new parameter for either a batch size or number of tokens per batch. This parameter would need to be set depending on how much vram you have. I'm not sure how you'd like to expose this option in your code.
I noticed that the lm_score code processes a single sentence at a time. This is pretty slow if you're processing a large amount of data. I wrote a batched version, though it's a bit ugly. This increases processing speed by about 8x on a single 3090
The text was updated successfully, but these errors were encountered: