-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support plain text .dic dictionary files #931
Comments
A file of valid words is insufficient for |
I'm not sure what that means, please elaborate |
See https://github.com/crate-ci/typos/blob/master/crates/typos-dict/assets/words.csv for our dictionary format we use at compile time. |
@epage thx, I understand about the conversion from "bad" to "good" words. What I don't understand is the workflow for the most typical use-case:
As such, the .dic files seem to be a perfect fit. |
Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right? |
Exactly! Thanks :) |
P.S. And of course you may consider using these words to auto-correct INTO (e.g. if I have a custom |
Is there a spec for this format? Can you link to examples of where open source projects use these files with descriptions of how they are used? |
I am not certain there is an official "spec" similar to .csv (some variants, not perfectly standardized) -- i.e. it seems UTF-8 is a relatively "recent" change to it, while many programs still treat those files as being in their language own encoding (i.e. uses whatever common encoding was used for the language of the dictionary). A quick search showed these:
|
P.S. I think this is the best documentation page I found: https://proofingtoolgui.org/proofingtoolgui_files/ProofingToolGUI_manual_V30.html |
Looks like At this point, I'm going to step back and restart the conversation. Can you describe the problem being addressed ( |
My understanding was that Now, to the main question of what I would like solved: I would like to have a very easy, minimal no frills way to store custom list of words per project. I have done many PRs for big FOSS projects doing spell checking - e.g. using IntelliJ's spellchecking tool to go through the code. As part of that process, I often have thousands (!!!) of words that are custom to each project, and I have to go through them one by one, "accepting" them into the dictionary. This is an extremely tedious and boring task, and I would much rather have a tool to list all suspicious words into a plain text file, sort it, and quickly read through it to delete any words that are likely spelling mistakes. Whatever left is my new "project dictionary" - a file I can check into the project. The dictionary file should not have any structure because they are much easier to work with when they get fairly large -- no spaces or commas or quotes or escapes, no mandatory wrapping braces, easy to edit, easy to sort the whole file if needed, easy to diff between multiple files, easy to load it with libreoffice to do some multi-file meshes or lookups, etc. P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all |
Looks like those are used by both your wooorm and LibreOffice links. This is an example of why I wanted to step back, to understand your request and how people today are using these files to fulfill your request to understand if you are asking for us to support LibreOffice dic files or if there are uses that are a common subset. It also didn't help that when i searched on my own for the referenced Chromium dic file, I accidentally ended up in a dict file which had a different format.
Would you be able to find those and link to them? I'd like to see how projects are using them in practice. A part of all of this is that we have a way to define blessed words, so an important part of this is "why do we need something different". Prior art / meeting existing projects where they are at is important. This also helps guide discussions on auto-discovery vs specified paths in config, single or multiple files, etc.
I wonder if Speaking of, I assume we would want to support specifying these for both words and identifiers. |
(I found it with a simple github search https://github.com/search?q=path%3A*.dic&type=code ) |
Looks like tokio is using |
Sure - advanced usages are always possible -- once the simple cases are solved. They mention |
I can confirm that the good enough solution is to provide a file with known words. My use case: In the code, there are used non-english "business" words. I already maintain a file with these valid words (it is in fact a Lack of this feature prevent me to use this tool in pre-commit checks in some of our projects. Probably generating config in |
For us to say we are supporting a format and then only supporting a fraction of it feels like it would be setting invalid expectations for users. I looked around and not seeing other tools implement this. cspell only discusses it in passing in streetsidesoftware/cspell#4942 codespells makes no reference to a specific format but does have an "ignore file" with a line per word and a custom dictionary format scspell uses a modified format with headers for saying what the "valid words apply to, e.g. their own dict |
With all of that said, the fact that we have native support for words makes this a lower priority for me resolving. |
@epage I understand your desire to have "ideal" solution (nothing wrong with that :) ) - my point of this ticket is that in my experience, the most common need is a plain text |
I'm not shooting for an ideal; I just don't want a lie. |
So the current workaround is to place Would it be possible to extend configuration to accept a path to a such file? (I would like to not pollute my I think the format itself is not so import and solution in |
I agree, if you think |
I'm also interested in the feature to be able to provide a list of words to ignore via a simple file (no matter the extension) I would expect to be able to provide something like this via the .toml file [files]
extend-ignore = ["ignore1.txt",".github/ignored.bar"] |
Many projects like Chromium use standard
.dic
files to list all "known" words, i.e. those words that should NOT be corrected. Is it possible to add support for this? Or is this something already supported (I couldn't find it in the readme or code search)A .dic file is a simple text file with one word per line. I don't recall how capitalization is specified (i.e. must be exact, or it allows a lower-cased word in the .dic file to be in upper-case to be ignored, but not the other way around).
The text was updated successfully, but these errors were encountered: