Improve logic behind spell checking text #8

neomatrix369 · 2020-09-20T10:06:03Z

Core issue
We have a spell checking functionality in NLP Profiler which uses a third-party library i.e. TextBlob, it does a decent job although the scores returned per misspelt word would then need to be correctly amortised across the whole text.

Meaning, in a fair fashion evaluate on the whole how bad is the spelling in the text.

At the moment it's using the below logic:

def spelling_quality_score(text: str) -> float:
    if (not isinstance(text, str)) or (len(text.strip()) == 0):
        return NaN

    tokenized_text = get_tokenized_text(text)
    misspelt_words = [
        each_word for _, each_word in enumerate(tokenized_text)
        if actual_spell_check(each_word) is not None
    ]
    avg_words_per_sentence = \
        len(tokenized_text) / get_sentence_count(text)
    result = 1 - (len(misspelt_words) / avg_words_per_sentence)

    return result if result >= 0.0 else 0.0

Which can be improved as there are visible chances of false positive or false negative scores.

PS: performance of this feature is being addressed on #2, so this particular issue isn't about improving it's speed/performance. Performance issues may be addressed via other issues at a later stage. There has already been some significant performance improvements to the spell check and other aspects of NLP Profiler via #2.

Fix to #14 impacts, this issue, will need to also be fixed together.

Secondary issue

~~Replace the spellchecker with the package pyspellchecker (on PyPi) which appears to be closer to Peter Norvig's work.~~ Replaced with Symspellpy (https://pypi.org/project/symspellpy/)

The text was updated successfully, but these errors were encountered:

…s to the sentences count fix. The logic to calculate spelling score needs to be attended to, see issue #8.

neomatrix369 · 2020-10-07T13:22:23Z

A simpler solution would be to revert back to the original logic:

score = 1 - (number_of_incorrect_words / number_of_correct_words)

and adjust the Words of Estimative Probability table to a stricter scoring:

   ["Very good", 99, 100],  
    ["Quite good", 95, 99], 
    ["Good", 90, 95],  
    ["Pretty good", 85, 90], 
    ["Bad", 60, 85],  
    ["Pretty bad", 12, 60],  
    ["Quite bad", 2, 12],  
    ["Very bad", 0, 2]

We can tune this logic further with new input from users in the community. Eventually, this table could be made custom or can be passed as a parameter to assist in the scoring.

… score is calculated and also adjusting the WEP table - making it stricter. Fixes issues #8.

neomatrix369 · 2020-10-07T15:15:30Z

The new logic can be found in https://github.com/neomatrix369/nlp_profiler/blob/master/nlp_profiler/spelling_quality_check.py#L59 and the changes are as per the comment #8 (comment).

May not be the best or the optimal fix, but it's a simple fix to start with.

neomatrix369 · 2020-10-07T15:18:10Z

Issue is partially fixed via #16.

neomatrix369 added enhancement New feature or request help wanted Extra attention is needed labels Sep 20, 2020

neomatrix369 added the Hacktoberfest label Sep 25, 2020

neomatrix369 added a commit that referenced this issue Oct 7, 2020

Tests: Spellings test have been modified to adjust for the new change…

91fcfb4

…s to the sentences count fix. The logic to calculate spelling score needs to be attended to, see issue #8.

neomatrix369 mentioned this issue Oct 7, 2020

Sentences (in general) are getting an incorrect sentence_count value #14

Closed

4 tasks

neomatrix369 added a commit that referenced this issue Oct 7, 2020

Fix: High-level features: Spelling check: fixing the logic on how the…

9844c31

… score is calculated and also adjusting the WEP table - making it stricter. Fixes issues #8.

neomatrix369 mentioned this issue Oct 7, 2020

Fix spelling score issue #16

Merged

neomatrix369 mentioned this issue Mar 12, 2023

Spelling checker has been modified #71

Merged

6 tasks

neomatrix369 linked a pull request Mar 12, 2023 that will close this issue

Spelling checker has been modified #71

Merged

6 tasks

neomatrix369 closed this as completed in #71 Mar 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logic behind spell checking text #8

Improve logic behind spell checking text #8

neomatrix369 commented Sep 20, 2020 •

edited

Loading

neomatrix369 commented Oct 7, 2020

neomatrix369 commented Oct 7, 2020 •

edited

Loading

neomatrix369 commented Oct 7, 2020

Improve logic behind spell checking text #8

Improve logic behind spell checking text #8

Comments

neomatrix369 commented Sep 20, 2020 • edited Loading

neomatrix369 commented Oct 7, 2020

neomatrix369 commented Oct 7, 2020 • edited Loading

neomatrix369 commented Oct 7, 2020

neomatrix369 commented Sep 20, 2020 •

edited

Loading

neomatrix369 commented Oct 7, 2020 •

edited

Loading