-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of Sentences in corpusfile don't match trained sentences. #2693
Comments
How much of a discrepancy between these numbers did you see in your case, and exactly which two outputs were you comparing? You can find some discussion of why the individual threads' stopping conditions are approximate, and thus exact word/text counts not necessarily expected to line-up, in the #2127 PR that added the (You will have to click Github's "Load more..." link to reveal hidden items to get the leading/following context.) As the expected behavior in The approximate nature of this approach seemed a bit fishy to me at the time, in that it might risk some (tiny?) ranges/contexts of the file being trained-on multiple times, while other ranges get missed. But in largish corpuses perhaps such little discrepancies along the "seams" between shards don't matter much.
@persiyanov may be able to comment further. |
Problem description
I'm training a fasttext model (CBOW) over a corpus, for instance
enwik8
.The number of sentences trained (or example_count as referred in log methods) on doesn't equal the number of sentences in the file (
wc -l
orlen(f.readlines())
, referred asexpected_count
ortotal_examples
).Why is this happening? Also, in the method here, this warning has been suppressed for corpus mode.
Versions
The text was updated successfully, but these errors were encountered: