-
Notifications
You must be signed in to change notification settings - Fork 9
Dealing with stop words and NER in multilingual texts #114
Comments
Current workflow is to detect language using python guess-language and then select appropriate stopwords if it's a language nltk has stopwords for. I hadn't thought about mixed languages, though. Might be helpful to have some sample mixed language text so we can see what guess-language thinks of it, write some tests. |
Here is some text that hugely confuses the guess-language function: Later, pressure increased to focus less on animal conservation and more on the welfare of urban-dwellers and tourism promotion. As from 1930 hunting permits were sold and in 1932 the journal of the Italian Alpine Club published an article proposing to transform the Gran Paradiso into a sort of huge open-air zoological garden, with all the features of an urban park. In the same years the Aostan autonomist politician Emile Chanoux lamented that until then the park had stressed too much its scientific aims, forgetting to respond to what it called its “social function”: |
After discussing it with my friendly local multilingual historian and thinking over Wilko's issue, I wonder if there are two parts to the problem: the first is dealing with stop words in the appropriate languages, the second is NER (entity recognition) in other languages. Does dbpedia automatically query Wikipedia content from all languages or just English? If not, can we use the current language detection to query the appropriate instances as well as applying different sets of stop words? Thoughts @moltude ? Also thanks @wilkohardenberg for your input and earlier comments! |
I'm still thinking about this but I have a couple of thoughts so far:
I'm still chewing on this so any additional thoughts would be appreciated. |
Useful points, thanks! We could possibly assume that any non-English text is more pertinent and prioritise those queries - but do we actually need to run separate queries against the search APIs or do we just add non-English terms into the mix? |
Perhaps in the meantime we can make it clear that Serendip-o-matic only supports English language text in the 1.0? |
Hi everyone. I sent the pull request for FR stop words and was referred to this discussion (thanks Mia!). One way to solve this problem might be to break a text up into chunks and run guess-language on each chunk, aggregating results to build list of search terms. Chunks could be separated by punctuation and line breaks. This should work for Wilko's text above. For single words and short phrases from one language inserted into a text written mainly in another language, it may be too much trouble to determine the different languages. |
I was thinking paragraphs, as detected by various forms of line breaks (assuming they're still slightly different between OSs), how does that sound? |
If feasible it sounds good to me. Single words or short sentences should not be too much of a problem in most cases. I wonder however how this should work on a Zotero library: separate language guessing for each entry? |
Paragraphs are natural chunks, so that works for me. |
Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess. As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms. |
Hi everyone. Combining stop words from different languages will create problems, e.g. "den" is an article in German and a noun in English. mw On Aug 11, 2013, at 6:12 PM, Rebecca Sutton Koeser [email protected] wrote:
|
We don't need to keep the paragraph structure, just pass things into a bucket for the appropriate language then push each one to the appropriate tokenisation, stop words and entity recognition steps... Though we might want to adjust the mix of query terms according to the proportional amount of each languages - too fussy? (At some future point we may want to use the languages detected to query for objects from particular cultures or in particular languages, but that'd need to be considered carefully in relation to 'serendipity' and any future 'hint' function) |
Just a note that it might be easiest to work out and document design decisions on the wiki then return here to finish integrating them https://github.com/chnm/serendipomatic/wiki/Serendipomatic-architecture |
Do we need a chat to decide on the best solution? If so, who's interested? |
I'm interested. |
I'm also ready to dig back in on this. On Tue, Oct 8, 2013 at 2:25 PM, Mia [email protected] wrote:
|
I'm interested too. |
Cool! Is there an asynchronous way we can talk through the options or should we try for a chat? (I'm complicating things slightly by being in a completely different timezone). |
I can make time 9-5 M/F for a chat if that makes the timezone problem Thursday or Friday would be the best day for me this week if we wanted to On Sat, Oct 12, 2013 at 8:07 PM, Mia [email protected] wrote:
|
This Friday afternoon (10/18), US East Coast time, would work for me. Could we videoconference? mw On Oct 14, 2013, at 9:17 AM, Scott Williams [email protected] wrote:
|
I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179 |
I can meet Friday 8:00 AM Mia's time (Thursday 5:00 PM my time). mw On Oct 14, 2013, at 6:31 PM, Mia [email protected] wrote:
|
Skype? I don't have a camera on the dinosaur laptop I'm travelling with so it's voice-only for me at the best of times. |
Thursday 5:00 EST on skype would work for me. On Tue, Oct 15, 2013 at 5:59 PM, Mia [email protected] wrote:
|
I'm available at thursday 5pm EST too. Is skype audio conference calling free? How do we exchange skype account names (prefer not to post them publicly, obviously). When OWOT team did video/audio chat last week it was kind of laggy and a bit difficult to communicate at times, which makes me wonder if a text chat might be more useful - but I guess skype has a chat tool built in that we can use if the audio is too laggy, right? Alternatively we could try a google+ hangout if we want to do video for those who have cameras. |
The document for collecting sample text for testing is 'Help us collect multilingual text for testing Serendip-o-matic' https://docs.google.com/document/d/100UygYyACS7tgU70FYpc4d00NTwoXaDzDmSUCu3naJE/edit# |
Here's a record of the decisions reached during our chat: a) set up analytics to keep track of word count, languages Of those, a, f, g will be new issues, b adds weight to #11, h is related to #78 and c, d, e, i and j are related to the original issue. |
Slightly off-topic, but this article on NER might be worth a look: 'Exploring Entity Recognition and Disambiguation |
Is the current workflow: 'detect language, apply appropriate stopwords' or 'apply generic multilingual stopwords'? If it's the former, can we detect multiple languages and apply the appropriate lists of stopwords?
As this conversation hints, many scholars work in two or more languages https://twitter.com/wilkohardenberg/status/363677752391516161 so ideally we could cope with returning entities and tokens for at least two languages and also apply stop words.
The trickiness of dealing with this might also be a call for more randomness in the way query terms are mixed so people can refresh the results and see different terms applied.
The text was updated successfully, but these errors were encountered: