Skip to content
This repository has been archived by the owner on Oct 6, 2020. It is now read-only.

Using the trained model to get sentences split #1

Open
genyunus opened this issue Aug 17, 2018 · 1 comment
Open

Using the trained model to get sentences split #1

genyunus opened this issue Aug 17, 2018 · 1 comment

Comments

@genyunus
Copy link

Hi,

Thanks for the great work! My question is that, how to use the trained model to split a sentence. For example;

sentence: last month we went on vocation the trip was very hard but it was worth doing because we had a lot of fun however on the way I lost my favorite shoes

output: ( splitted sentences output using the train model )

Also, I couldnt see the part where to vectorize the sentences to train, I see you split them and save as dataset.sentences.

Thanks a lot

@brandonrobertz
Copy link
Owner

Hey there,

Sorry for the confusion here. This project was kind of a disorganized tool I was using for a research paper I was working on so it didn't really get the love it needed. I ended up abandoning this idea, though, and went different path.

The overall strategy for turning character streams to sentences was this:
Model 1 is the binary classifier which took a window of characters/words and decided if there needed to be a punctuation mark between any of them. If model 1 said "yes" we pass the data to model 2.
Model 2 is a multiclass classifier which decides where a punctuation mark goes.
These two models were intended to be trained separately.

I know some others who worked on this (https://github.com/jaggzh/nn-punk) and they tried the straight single model method with a char-CNN, but I found that one too sparse with a higher dimensional output, as most of the time you don't need any punctuation in a sentence.

The dual model was derived from some previous work, as you can find in the overall comment here: https://github.com/brandonrobertz/sentence-autosegmentation/blob/master/classifier.py#L17

RE: Vectorization: Vectorization happens in the load_data.py file, precompute function. (https://github.com/brandonrobertz/sentence-autosegmentation/blob/master/load_data.py#L8)
Currently, it's a character based model (which failed). It wouldn't be difficult to change it to a word embedding based model, but I just didn't get that far. (But I suspect it would work a lot better.)

Does that answer it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants