Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Normalizers not working #66

Open
thomasegense opened this issue Jul 19, 2018 · 9 comments
Open

Default Normalizers not working #66

thomasegense opened this issue Jul 19, 2018 · 9 comments

Comments

@thomasegense
Copy link

thomasegense commented Jul 19, 2018

I am using the latest 1.13 release.

The FrequencyAnalyze default constructor adds the following normalizers:

public FrequencyAnalyzer() {
        this.normalizers.add(new TrimToEmptyNormalizer());
        this.normalizers.add(new CharacterStrippingNormalizer());
        this.normalizers.add(new LowerCaseNormalizer());
    }

And this seems correct, but it does not work properly. It leaves whitespace, so the trim is not working correct for some reason. Here is the log file.
Notice the first line, that is just white space that is most frequent.
Also notice how many times the word "crack" appears below with and without trailing spaces.

2018-07-19 09:42:44,639 [main] INFO  com.kennycason.kumo.WordCloud - placed:    (1/300)
2018-07-19 09:42:44,642 [main] INFO  com.kennycason.kumo.WordCloud - placed: the (2/300)
2018-07-19 09:42:44,643 [main] INFO  com.kennycason.kumo.WordCloud - placed: music (3/300)
2018-07-19 09:42:44,644 [main] INFO  com.kennycason.kumo.WordCloud - placed: and (4/300)
2018-07-19 09:42:44,644 [main] INFO  com.kennycason.kumo.WordCloud - placed: user (5/300)
2018-07-19 09:42:44,645 [main] INFO  com.kennycason.kumo.WordCloud - placed:  crack (6/300)
2018-07-19 09:42:44,646 [main] INFO  com.kennycason.kumo.WordCloud - placed: this (7/300)
2018-07-19 09:42:44,646 [main] INFO  com.kennycason.kumo.WordCloud - placed: you (8/300)
2018-07-19 09:42:44,647 [main] INFO  com.kennycason.kumo.WordCloud - placed: csdb (9/300)
2018-07-19 09:42:44,689 [main] INFO  com.kennycason.kumo.WordCloud - placed: comment (10/300)
2018-07-19 09:42:44,689 [main] INFO  com.kennycason.kumo.WordCloud - placed: submitted (11/300)
2018-07-19 09:42:44,689 [main] INFO  com.kennycason.kumo.WordCloud - placed: for (12/300)
2018-07-19 09:42:44,690 [main] INFO  com.kennycason.kumo.WordCloud - placed: graphics (13/300)
2018-07-19 09:42:44,690 [main] INFO  com.kennycason.kumo.WordCloud - placed: scene (14/300)
2018-07-19 09:42:44,691 [main] INFO  com.kennycason.kumo.WordCloud - placed: demo (15/300)
2018-07-19 09:42:44,702 [main] INFO  com.kennycason.kumo.WordCloud - placed: crack   (16/300)
2018-07-19 09:42:44,702 [main] INFO  com.kennycason.kumo.WordCloud - placed: c64 (17/300)
2018-07-19 09:42:44,702 [main] INFO  com.kennycason.kumo.WordCloud - placed: crack (18/300)
2018-07-19 09:42:44,710 [main] INFO  com.kennycason.kumo.WordCloud - placed: demo   (19/300)
2018-07-19 09:42:44,711 [main] INFO  com.kennycason.kumo.WordCloud - placed: can (20/300)
2018-07-19 09:42:44,713 [main] INFO  com.kennycason.kumo.WordCloud - placed: made (21/300)
2018-07-19 09:42:44,714 [main] INFO  com.kennycason.kumo.WordCloud - placed: commodore (22/300)
2018-07-19 09:42:44,714 [main] INFO  com.kennycason.kumo.WordCloud - placed: find (23/300)
2018-07-19 09:42:44,715 [main] INFO  com.kennycason.kumo.WordCloud - placed: all (24/300)
2018-07-19 09:42:44,719 [main] INFO  com.kennycason.kumo.WordCloud - placed: one-file (25/300)
2018-07-19 09:42:44,721 [main] INFO  com.kennycason.kumo.WordCloud - placed: intro (26/300)
2018-07-19 09:42:44,721 [main] INFO  com.kennycason.kumo.WordCloud - placed: 1990 (27/300)
2018-07-19 09:42:44,723 [main] INFO  com.kennycason.kumo.WordCloud - placed: about (28/300)
2018-07-19 09:42:44,723 [main] INFO  com.kennycason.kumo.WordCloud - placed: out (29/300)
@kennycason
Copy link
Owner

Thanks for posting this . I'll check it out. Seems like it should be a straight forward fix.

@kennycason
Copy link
Owner

Could you give me a sample input? Are you loading from a raw text file? or are you loading a "Frequency file" of the format:

100: frog
94: dog
43: cog
3: fog
1: log
1: pog

@kennycason
Copy link
Owner

I created a simple unit tests with some weird text and could not immediately replicate your issue.
Test

    @Test
    public void defaultTokenizerTrimTest() throws IOException {
        final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
        final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
                Thread.currentThread().getContextClassLoader().getResourceAsStream("trim_test.txt"));

        final Map<String, WordFrequency> wordFrequencyMap = wordFrequencies
                .stream()
                .collect(Collectors.toMap(WordFrequency::getWord,
                                          Function.identity()));

        assertEquals(2, wordFrequencyMap.get("random").getFrequency());
        assertEquals(1, wordFrequencyMap.get("some").getFrequency());
        assertEquals(1, wordFrequencyMap.get("with").getFrequency());
        assertEquals(1, wordFrequencyMap.get("spaces").getFrequency());
        assertEquals(1, wordFrequencyMap.get("i'm").getFrequency());
    }

The contents of trim_test.txt:
I'm some random random text with spaces .

Feel free to post your raw text/file and I can add tests around it and help debug.

@kennycason
Copy link
Owner

I went ahead and pushed up the test since there was no existing FrequencyAnalyzerTest. https://github.com/kennycason/kumo/blob/master/kumo-core/src/test/java/com/kennycason/kumo/nlp/FrequencyAnalyzerTest.java

@thomasegense
Copy link
Author

thomasegense commented Jul 23, 2018

Here is an example text file with the bug.
(removed sample file)
It gives same result loading from a text-file or from inputstream.
Most special characters are removed, but not -. Am not sure this is intended.
But I end up with different tokens:

-
--
---

etc.

@kennycason
Copy link
Owner

@thomasegense thanks for the sample! I'll check it out.

@thomasegense
Copy link
Author

Hi again, can you reproduce the error?

@kennycason
Copy link
Owner

@thomasegense Hi, Sorry this week has been hectic for me at work. I'll try and look at over this weekend. I have this tab open in my browser. :)

@kennycason
Copy link
Owner

kennycason commented Aug 5, 2018

I was able to replicate this error.

    @Test
    public void largeTextFileTest() throws IOException {
        final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
        final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
                Thread.currentThread().getContextClassLoader().getResourceAsStream("text/csdb.txt"));

        wordFrequencies
                .forEach(wordFrequency ->
                                 System.out.println(
                                         String.format("[%s] -> [%d]", wordFrequency.getWord(), wordFrequency.getFrequency())));
    }

Result:

[  ] -> [258594]
[the] -> [251345]
[music] -> [82106]
[and] -> [69944]
[user] -> [66652]
[ crack] -> [55529]
[this] -> [54919]
[you] -> [54355]
[csdb] -> [53250]
[comment] -> [50887]
[submitted] -> [50417]
[for] -> [49680]
[graphics] -> [44411]
[scene] -> [40164]
[demo] -> [38855]
[crack  ] -> [37584]
[c64] -> [36656]
[crack] -> [35495]
[demo  ] -> [35339]
[can] -> [31646]
[made] -> [28503]
[commodore] -> [27584]
[find] -> [27268]
[all] -> [25895]
[one-file] -> [25843]
[intro] -> [25235]
[1990] -> [22883]
[about] -> [22095]
[out] -> [21743]
[1989] -> [21269]
[here] -> [21171]
[not] -> [21055]
[but] -> [21001]
[which] -> [20647]
[was] -> [20377]
[are] -> [20349]
[forum] -> [20110]
[release] -> [20101]
[search] -> [19774]
[sceners] -> [19406]
[page] -> [19343]
[home] -> [19306]
[1988] -> [19037]
[that] -> [18841]
[code] -> [18535]
[website] -> [18503]
[computer] -> [18459]
[] -> [18446]
[1991] -> [17545]
[comments] -> [17502]

Looking at [ crack] in the debugger shows ascii character 160, which is a non-breaking space

image

image

One unquestionable bug is the empty token I found here:
image

I will consider how to handle these use-cases, In the mean time I recommend you strip the ascii character 160 from your text file. The hex code, and regex to match ASCII 160 is \xA0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants