You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As this is (extremely useful!) example code, it should be as clean as possible.
I'm looking at word_language_model/data.py and there are two areas where the clarity and speed could be improved by removing redundant code.
tokenize() runs in two passes called # Add words to the dictionary and # Tokenize file content. The first calls add_word() which does both the adding words to the dictionary and it returns the token. So everything can be done in one pass. Cleanest is to completely remove the first pass and change the line ids.append(self.dictionary.word2idx[word]) to ids.append(self.dictionary.add_word(word)).
In # Tokenize file content, a list of torch tensors is built and then torch.cat() is used to merge into the final list. It is both cleaner and faster not to use the intermediate torch tensors and simply do:
# Tokenize file content
with open(path, 'r', encoding="utf8") as f:
ids = []
for line in f:
words = line.split() + ['<eos>']
for word in words:
ids.append(self.dictionary.word2idx[word])
return torch.tensor(ids).type(torch.int64)
In both cases I've just tried to take out the redundant code to make things cleaner to read and faster to execute (data load was about 20 minutes for the billion word corpus).
The text was updated successfully, but these errors were encountered:
As this is (extremely useful!) example code, it should be as clean as possible.
I'm looking at word_language_model/data.py and there are two areas where the clarity and speed could be improved by removing redundant code.
tokenize()
runs in two passes called# Add words to the dictionary
and# Tokenize file content
. The first callsadd_word()
which does both the adding words to the dictionary and it returns the token. So everything can be done in one pass. Cleanest is to completely remove the first pass and change the lineids.append(self.dictionary.word2idx[word])
toids.append(self.dictionary.add_word(word))
.In
# Tokenize file content
, a list of torch tensors is built and thentorch.cat()
is used to merge into the final list. It is both cleaner and faster not to use the intermediate torch tensors and simply do:In both cases I've just tried to take out the redundant code to make things cleaner to read and faster to execute (data load was about 20 minutes for the billion word corpus).
The text was updated successfully, but these errors were encountered: