Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL links in reddit comments should be replaced with token or filtered out #8

Open
bwuu opened this issue Sep 3, 2015 · 1 comment

Comments

@bwuu
Copy link
Collaborator

bwuu commented Sep 3, 2015

url links look something like
"httpvideo nationalgeographic comvideoplayeranimalsinvertebratesanimalsoctopusandsquidoctopuscyanealocomotion html" and obviously not good for training. should replace with token like or filter out

related: maybe need to tokenize/filter gibberish words in general to limit vocab size?

@timothywangdev
Copy link
Owner

Currently DataGenerator only converts numbers to , feel free to add a few lines to convert url to . We may need to consider phone numbers as well (xxx-xxx-xxxx)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants