Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logic behind spell checking text #8

Closed
2 tasks done
neomatrix369 opened this issue Sep 20, 2020 · 3 comments · Fixed by #71
Closed
2 tasks done

Improve logic behind spell checking text #8

neomatrix369 opened this issue Sep 20, 2020 · 3 comments · Fixed by #71
Labels
2. medium-priority Good if it can be attended to be soon, but not urgent enough enhancement New feature or request hacktoberfest Classify topic. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com) help wanted Extra attention is needed high-level feature(s)

Comments

@neomatrix369
Copy link
Owner

neomatrix369 commented Sep 20, 2020

  • Core issue
    We have a spell checking functionality in NLP Profiler which uses a third-party library i.e. TextBlob, it does a decent job although the scores returned per misspelt word would then need to be correctly amortised across the whole text.

Meaning, in a fair fashion evaluate on the whole how bad is the spelling in the text.

At the moment it's using the below logic:

def spelling_quality_score(text: str) -> float:
    if (not isinstance(text, str)) or (len(text.strip()) == 0):
        return NaN

    tokenized_text = get_tokenized_text(text)
    misspelt_words = [
        each_word for _, each_word in enumerate(tokenized_text)
        if actual_spell_check(each_word) is not None
    ]
    avg_words_per_sentence = \
        len(tokenized_text) / get_sentence_count(text)
    result = 1 - (len(misspelt_words) / avg_words_per_sentence)

    return result if result >= 0.0 else 0.0

Which can be improved as there are visible chances of false positive or false negative scores.

PS: performance of this feature is being addressed on #2, so this particular issue isn't about improving it's speed/performance. Performance issues may be addressed via other issues at a later stage. There has already been some significant performance improvements to the spell check and other aspects of NLP Profiler via #2.

Fix to #14 impacts, this issue, will need to also be fixed together.


  • Secondary issue

Replace the spellchecker with the package pyspellchecker (on PyPi) which appears to be closer to Peter Norvig's work. Replaced with Symspellpy (https://pypi.org/project/symspellpy/)

@neomatrix369 neomatrix369 added enhancement New feature or request help wanted Extra attention is needed labels Sep 20, 2020
@neomatrix369 neomatrix369 added hacktoberfest-accepted Approved/merged. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com) 2. medium-priority Good if it can be attended to be soon, but not urgent enough high-level feature(s) hacktoberfest Classify topic. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com) and removed Hacktoberfest2019 hacktoberfest-accepted Approved/merged. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com) labels Oct 6, 2020
neomatrix369 added a commit that referenced this issue Oct 7, 2020
…s to the sentences count fix. The logic to calculate spelling score needs to be attended to, see issue #8.
@neomatrix369
Copy link
Owner Author

A simpler solution would be to revert back to the original logic:

score = 1 - (number_of_incorrect_words / number_of_correct_words)

and adjust the Words of Estimative Probability table to a stricter scoring:

   ["Very good", 99, 100],  
    ["Quite good", 95, 99], 
    ["Good", 90, 95],  
    ["Pretty good", 85, 90], 
    ["Bad", 60, 85],  
    ["Pretty bad", 12, 60],  
    ["Quite bad", 2, 12],  
    ["Very bad", 0, 2]  

We can tune this logic further with new input from users in the community. Eventually, this table could be made custom or can be passed as a parameter to assist in the scoring.

neomatrix369 added a commit that referenced this issue Oct 7, 2020
… score is calculated and also adjusting the WEP table - making it stricter. Fixes issues #8.
@neomatrix369
Copy link
Owner Author

neomatrix369 commented Oct 7, 2020

The new logic can be found in https://github.com/neomatrix369/nlp_profiler/blob/master/nlp_profiler/spelling_quality_check.py#L59 and the changes are as per the comment #8 (comment).

May not be the best or the optimal fix, but it's a simple fix to start with.

@neomatrix369
Copy link
Owner Author

Issue is partially fixed via #16.

@neomatrix369 neomatrix369 linked a pull request Mar 12, 2023 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2. medium-priority Good if it can be attended to be soon, but not urgent enough enhancement New feature or request hacktoberfest Classify topic. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com) help wanted Extra attention is needed high-level feature(s)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant