Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spelling checker has been modified #71

Merged
merged 12 commits into from
Mar 12, 2023

Conversation

bitanb1999
Copy link
Contributor

@bitanb1999 bitanb1999 commented Mar 11, 2023

Please check the options that you have completed and strike out the options that do not apply via this pull request:

  • a clear title and description of the Pull Request has been provided
    you have read
  • the Contributing doc
  • the Developer Guide
  • the pull request passes the tests (`./test-coverage "tests slow-tests"``) - this will also be visible via the Code coverage report and CI/CD task on the Pull Request
  • you have performed some kind of smoke test by running your changes in an isolated environment i.e. Docker container, Google Colab, Kaggle, etc...
  • [] the notebooks are updated (see notebooks folder, read the Notebooks docs)
  • CHANGELOG.md has been updated (please follow the existing format)

Goal or purpose of the PR

The spelling checker previously used TextBlob and required tokenization for the spelling checking and spelling quality summarisation. This took significant time and the result score calculated was also not satisfactory.

Changes implemented in the PR

I replaced the checker function with a package that states to be much faster than TextBlob and jamspell, namely, Symspellpy. Further, the result scoring was entirely based on the ratio of the number of misspelled words to the total length of the string. This doesn't take ease of reading or "whether the phrase makes sense" into account. To resolve these issues, I used fuzzy-matching techniques that compare the original text and the rectified text and mark the score of the text accordingly.

@sourcery-ai
Copy link

sourcery-ai bot commented Mar 11, 2023

Sourcery Code Quality Report

❌  Merging this PR will decrease code quality in the affected files by 2.06%.

Quality metrics Before After Change
Complexity 3.06 ⭐ 3.51 ⭐ 0.45 👎
Method Length 38.27 ⭐ 43.00 ⭐ 4.73 👎
Working memory 4.81 ⭐ 5.10 ⭐ 0.29 👎
Quality 86.71% 84.65% -2.06% 👎
Other metrics Before After Change
Lines 137 154 17
Changed files Quality Before Quality After Quality Change
nlp_profiler/high_level_features/ease_of_reading_check.py 85.73% ⭐ 85.18% ⭐ -0.55% 👎
nlp_profiler/high_level_features/spelling_quality_check.py 87.36% ⭐ 84.28% ⭐ -3.08% 👎

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation

Legend and Explanation

The emojis denote the absolute quality of the code:

  • ⭐ excellent
  • 🙂 good
  • 😞 poor
  • ⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

@neomatrix369
Copy link
Owner

neomatrix369 commented Mar 12, 2023

This PR depends on the merging of PR #69 - once we merge that PR the current one can proceed but till then let's resolve any comments on this PR

@neomatrix369
Copy link
Owner

neomatrix369 commented Mar 12, 2023

Please also do one last check in https://github.com/neomatrix369/nlp_profiler/blob/master/CONTRIBUTING.md to see if any dependent files need changing i.e. re-running notebooks etc, the Developer Guide is also something to review as a closing action.

Maybe you can enhance the existing grammar check example in the notebook(s) to illustrate the new package's features.

There are notebooks on this repo, please take a look at them and re-run them on your local machine to see if your changes have taken effect and no issues have arisen.

There are also markdown files in this repo, they may need a touch-up due to this change - can you pls check if that's the case?

Copy link
Owner

@neomatrix369 neomatrix369 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just a few changes requested

nlp_profiler/high_level_features/spelling_quality_check.py Outdated Show resolved Hide resolved
@neomatrix369
Copy link
Owner

This PR is related to #8, have a good read of the issue to see if all or most of the requirements there are resolved by this PR

@neomatrix369 neomatrix369 linked an issue Mar 12, 2023 that may be closed by this pull request
2 tasks
@bitanb1999
Copy link
Contributor Author

This PR is related to #8, have a good read of the issue to see if all or most of the requirements there are resolved by this PR

I checked #8 and #2 and it addresses both issues. The results have been modified with fuzzy algorithm and they are penalizing for each misspelled word and arrangement of tokens. See this article: https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe
Also, the Symspell package is much faster than TextBlob as stated by multiple articles and hence #2 is also being addressed.

@neomatrix369
Copy link
Owner

One last thing to do is update the CHANGELOG.md for this change - its very easy to do, see how the previous ones are done

@neomatrix369 neomatrix369 merged commit 0100ac0 into neomatrix369:master Mar 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers high-level feature(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve logic behind spell checking text
2 participants