Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #2844 #3911; add --spellsuggest to suggest symbols in scope with similar spellings on undefined symbol error #16067

Merged
merged 16 commits into from
Mar 16, 2021

Conversation

timotheecour
Copy link
Member

@timotheecour timotheecour commented Nov 20, 2020

fix #2844
fix #3911
fix #9197

saves time when perplexed about a symbol that you think should exist but was misspelt.

Some other languages have a similar feature too, and it helps.

note

  • only suggests the symbols with smallest edit distance wrt typo
  • among those, only suggests the symbols in the nearest scope

example 1

when true:
  let foobar0 = 1
  proc main()=
    type foobar1 = int
    var foobar2 = 1
    echo foobar
  main()
nim r --spellsuggest --listfullpaths:off t11334.nim
t11334.nim(26, 10) Error: undeclared identifier: 'foobar'
  candidate misspelling: 'foobar1' [type declared in t11334.nim(24, 10)]
  candidate misspelling: 'foobar2' [var declared in t11334.nim(25, 9)]

example 2

when true:
  import strutils
  proc main()=
    strutils2.normalizeStr 1 #  candidate misspelling: 'strutils' [module declared in strutils.nim(1, 2)]
    # strutils.normalizeStr 1 # candidate misspelling: 'normalize' [proc declared in strutils.nim(304, 6)]
    # echo2 1 # candidate misspelling: 'echo' [proc declared in system.nim(1985, 8)]
  main()

future work

  • make this flag also suggest when a module import is misspellt, including things like import exitprocs instead of import std/exitprocs

  • also use spellsuggestions in more contexts, eg unrecognized pragmas, compiler flags/commands etc

  • add Damerau–Levenshtein distance algorithm to std/editdistance, which allows adjacent transpositions with cost 1 instead of cost 2, and use that instead (see misc std/editdistance timotheecour/Nim#397)

  • assign low cost (or cost 1) to moves: abcDef => defAbc; eg for pathJoin vs joinPath

  • I've left open for future work the currently un-allowed --spellsuggest (without argument), which will use some secret sauce based on length of query + max edit distance, but even with that secret sauce, --spellsuggest:N will still be useful (EDIT: now implemented)

  • I don't believe this adds any meaningful overhead so --spellsuggest can become a default once the 0 arg version is implemented
    => followup #16067 --spellSuggest #17401

  • make this work with dotExpr:

import strutils
echo "FOO".toLowerAscii2 # currently doesn't honor --spellsuggest

links

@Araq
Copy link
Member

Araq commented Nov 20, 2020

In my experience with NimEdit the "edit distance" is really quite bad for this. See https://norvig.com/spell-correct.html for ideas how to do it.

@ringabout
Copy link
Member

ringabout commented Nov 20, 2020

Great job!
Maybe fix #3911
and related issue #2844

@timotheecour
Copy link
Member Author

timotheecour commented Nov 21, 2020

@Araq norvig's algorithm is unsuitable in the context of compiler spell correction:

  • exponential in the distance: (alphabetLen*wordLen*2)^distance, alphabetLen=26+10 for nim's case (without considering even unicode)
  • in practice, prohibitively expensive for edit distance >= 3 (see below)
    and edit distance <= 2 is insufficient for long workds:
    getCurrentSourcePath => currentSourcePath requires 3
    getCurrentCompilerExecutable => getCurrentCompilerExe requires 7
  • regardless of performance issue, it's actually a bad cost function in the context of compiler spell correction, since it doesn't care about edit distance amongst the candidate set, only about the probabilities of the words (which are not useful in our context IMO).

as shown in this simplified (but realistic performance wise) example, it would take:

  • 1mn for dist = 3 to correct a mis-spelt editDistance
  • 6mn for dist = 3 to correct a mis-spelt getCurrentCompilerExe
import std/sets
import std/math
import std/times
import std/editdistance
import std/random
import std/sugar

proc main()=
  var r = initRand(987)
  var words: seq[string]
  let n = 1_000
  var buf: string
  let alphabet = {'a'..'z', '0'..'9'}
  for i in 0..<n:
    let m = (i mod 20) + 1
    buf.setLen 0
    for j in 0..<m: buf.add r.sample(alphabet)
    words.add buf

  var query = "editDistances"

  block:
    let editDist = 3 # param

    let t0 = epochTime()
    var t: HashSet[string]
    for a in words: t.incl a
    let W = query.len
    let m = pow(2*W.float*36, editDist.float).int
    let t1 = epochTime()
    var c = 0
    var prefix = "foobar"
    for j in 0..<m:
      if (prefix & $j) in t:
        c.inc
    let t2 = epochTime()
    echo (t1: t2 - t1, t2: t1 - t0, candidates: m, dummy: c)

  block:
    let t1 = epochTime()
    var c = 0
    for ai in words:
      c += editDistance(ai, query)
    let t2 = epochTime()
    echo (t: t2 - t1, dummy: c)
main()

prints, with -d:danger:
(t1: 67.17855405807495, t2: 0.0001928806304931641, candidates: 644972544, dummy: 0)
(t: 0.0008981227874755859, dummy: 12918)

=> > 1 minute is impractical.

whereas the algo in this PR would take < 1 millisecond.

for getCurrentCompilerExe: this would give: t1: 391 seconds.

In other words, you'd need 400 million symbols in scope to break-even. And even then, there are trivial ways to dramatically speedup the existing editdistance based algorithm (I can go into details if needed).

And that's only considering edit distances of size <= 3, and not even considering unicode, for which norvig's approach wouldn't work (prohibitive or at least unclear, the naive utf8 analog has issues).

norvig's algo is only useful for:

  • very large vocabulary sizes
  • small query words (since polynomial in word size)
  • edit distance 1 or 2

In my experience with NimEdit the "edit distance" is really quite bad for this

try this PR, the heuristic I used (show all shortest edit distance, taking into account scope as tie breaker) does a pretty good job IMO. One improvement (to be done as future work) is to add Damerau–Levenshtein distance algorithm to std/editdistance, which allows adjacent transpositions (writenl => writeln) with cost 1 instead of cost 2, and that's still O(word1.len * word2.len) complexity, see https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

Once this is implemented, it'll be trivial to use that instead.

@Araq
Copy link
Member

Araq commented Nov 21, 2020

only about the probabilities of the words (which are not useful in our context IMO).

This is what I was getting at and even nimsuggest uses basic identifier counting for suggestions. But maybe you're right and the current scopes provide sufficient context.

However, another thing that you should really do -- if you haven't already -- is to limit the edit-distance.

@timotheecour
Copy link
Member Author

timotheecour commented Nov 27, 2020

@Araq PTAL

However, another thing that you should really do -- if you haven't already -- is to limit the edit-distance.

instead i now have: --spellsuggest:5 to show up to 5 suggestions, up to user to choose his preferred setting on cmdline/nim cfg. An absolute threshold wouldn't be very meaningful eg for these:

getCurrentSourcePath => currentSourcePath requires 3
getCurrentCompilerExecutable => getCurrentCompilerExe requires 7

and furthermore, a limit on number of suggestions is more intuitive than a limit on edit-distance (the ranking criterion might change); see also future work in PR msg item I've left open for future work the currently un-allowed [...]

@timotheecour timotheecour changed the title add --spellsuggest to suggest symbols in scope with similar spellings on undefined symbol error fix #2844; add --spellsuggest to suggest symbols in scope with similar spellings on undefined symbol error Dec 1, 2020
@timotheecour timotheecour changed the title fix #2844; add --spellsuggest to suggest symbols in scope with similar spellings on undefined symbol error fix #2844 #3911; add --spellsuggest to suggest symbols in scope with similar spellings on undefined symbol error Dec 1, 2020
@timotheecour
Copy link
Member Author

friendly ping @Araq

@Araq
Copy link
Member

Araq commented Dec 14, 2020

Accepted but blocked by #15935

Sorry.

@ringabout
Copy link
Member

what's the progress of this PR?

@timotheecour
Copy link
Member Author

Well it was ready and green for a while but (as I kind of predicted) broke after IC PR #15935 (ditto for other PR's I care a lot about such as import {.all.} and alias); The conflict resolving are easy to fix but I wasn't able to make it work after resolving those conflicts... help welcome after I rebase + fix those conflicts

@timotheecour timotheecour force-pushed the pr_fixSpelling branch 2 times, most recently from a4e122a to 48243fe Compare March 14, 2021 23:56
@timotheecour
Copy link
Member Author

timotheecour commented Mar 15, 2021

@Araq PTAL, this PR now works again (i had to use allSyms since previous variant didn't consider imported symbols after recent changes like #15935).

In addition:

  • --spellSuggest (with no argument) is now supported (with implementation defined set of suggestions, currently using dist = dist(closest_match) ); --spellSuggest:N is still very useful and doesn't add any complexity so it should stay
  • the locations of suggestions are shown even without --declaredLocs
  • added tests

the 32bit linux CI failure is unrelated and tracked in another issue

@timotheecour timotheecour added the TODO: followup needed remove tag once fixed or tracked elsewhere label Mar 15, 2021
@timotheecour timotheecour requested a review from ringabout March 16, 2021 03:42
@Araq Araq merged commit 48eab53 into nim-lang:devel Mar 16, 2021
@timotheecour timotheecour deleted the pr_fixSpelling branch March 16, 2021 22:46
timotheecour added a commit to timotheecour/Nim that referenced this pull request Mar 16, 2021
Araq pushed a commit that referenced this pull request Mar 17, 2021
* followup #16067 --spellSuggest

* enable --spellSuggest by default

* fixup
ringabout pushed a commit to ringabout/Nim that referenced this pull request Mar 22, 2021
…s in scope with similar spellings on undefined symbol error (nim-lang#16067)

* add --spellsuggest to suggest symbols in scope with similar spellings on undefined symbol errors
* implement --spellsuggest with 0 arguments
ringabout pushed a commit to ringabout/Nim that referenced this pull request Mar 22, 2021
* followup nim-lang#16067 --spellSuggest

* enable --spellSuggest by default

* fixup
ardek66 pushed a commit to ardek66/Nim that referenced this pull request Mar 26, 2021
…s in scope with similar spellings on undefined symbol error (nim-lang#16067)

* add --spellsuggest to suggest symbols in scope with similar spellings on undefined symbol errors
* implement --spellsuggest with 0 arguments
ardek66 pushed a commit to ardek66/Nim that referenced this pull request Mar 26, 2021
* followup nim-lang#16067 --spellSuggest

* enable --spellSuggest by default

* fixup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TODO: followup needed remove tag once fixed or tracked elsewhere
Projects
None yet
3 participants