-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracted topics make no sense; might have something to do with unicodes #132
Comments
Hey @hedgy123 , it's hard for me to tell what's going wrong here, but since your code looks correct, I'm guessing that the garbage topics result from some combination of problems with the data, the term normalization, and the parameters for the topic model being trained. Here are a few things to try:
If none of that works, I'd assume that either your corpus isn't conducive to topic modeling (ugh for you) or there's a bug somewhere in |
From the top of my head, I think it might be some escaping formatting issues related to It's worth trying to escape them properly or remove them altogether from your raw data before pushing it to This might help to escape them if you want to keep the punctuation: https://stackoverflow.com/questions/18935754/how-to-escape-special-characters-of-a-string-with-single-backslashes |
@hedgy123 I get the same issue, the topics do not make sense - did you figure out what the problem was? |
Ran into same issue here🤔 |
Okay, sounds like I should confirm that topic models behavior is expected... I've been punting on major |
Hi,
I've just installed the latest version of textacy in python 2.7 on a Mac. I am trying to extract topics from a set of comments that do have quite a few non-ASCII characters. The topics I am getting make no sense.
Here's what's going on. I create a corpus of comments like this:
This create a Corpus(3118 docs; 71018 tokens). If I print out the first three documents/tokens from the corpus, they look normal:
Then:
And that's where I get back "topics" that make no sense:
Somehow, the fact that everything comes up with "u's" seems to indicate to me that unicodes are potentially messing things up, but I am not sure how to fix that. The printed corpus seemed perfectly fine.
Could you please help? Thanks a lot!
The text was updated successfully, but these errors were encountered: