Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic language classification for Untitled files #118455

Closed
isidorn opened this issue Mar 8, 2021 · 22 comments
Closed

Automatic language classification for Untitled files #118455

isidorn opened this issue Mar 8, 2021 · 22 comments
Assignees
Labels
languages-guessing Language guessing issues under-discussion Issue is under discussion for relevance, priority, approach workbench-untitled-editors Managing of untitled editors in workbench window
Milestone

Comments

@isidorn
Copy link
Contributor

isidorn commented Mar 8, 2021

We could do automatic language classification for Untitled files so users do not have to explicitly choose what language mode to use.

There are already modules for this in Python https://github.com/yoeo/guesslang
However it would be cool to have something in Javascript. One idea is to use TensurFlowJS and try to reuse the guesslang model.

Nice demo by Tyler:

screen_recording_2021-03-17_at_9.27.13_pm.mov
@isidorn isidorn added the under-discussion Issue is under discussion for relevance, priority, approach label Mar 8, 2021
@isidorn isidorn added this to the March 2021 milestone Mar 8, 2021
@isidorn isidorn self-assigned this Mar 8, 2021
@bpasero bpasero added the languages-guessing Language guessing issues label Mar 8, 2021
@bpasero
Copy link
Member

bpasero commented Mar 8, 2021

Some related issues: languages-guessing Language guessing issues

@bpasero
Copy link
Member

bpasero commented Mar 8, 2021

I would actually also use this for the CLI support for reading from stdin. Currently we create a tmp txt file that contains the contents that are piped into VS Code, but it would be awesome if we could change the language mode based on contents 👍

@isidorn
Copy link
Contributor Author

isidorn commented Mar 8, 2021

@bpasero cool, glad there are more use cases.
Did not know we have this label :)

@isidorn
Copy link
Contributor Author

isidorn commented Mar 19, 2021

@TylerLeonhardt did a great experiment, we plan to continue on this next milestone thus assigning to April
This PR has a checklist and discussion points #119325

@isidorn isidorn modified the milestones: March 2021, April 2021 Mar 19, 2021
@isidorn isidorn mentioned this issue Mar 19, 2021
8 tasks
@TylerLeonhardt
Copy link
Member

@bpasero I think for your request, if we fix this: #41614

Then what your requesting could come "for free" if we add the logic to run the model on untitled files as I did in my prototype.

@bpasero
Copy link
Member

bpasero commented Mar 19, 2021

Yeah unfortunately opening an untitled file from piping is currently not an option because we have no good way of talking from the one process directly into an untitled file. But you are probably right that this would be the right solution for that issue.

However, do we consider to support language detection also for plain text files? I wonder if the detection should also run when you open a txt file without having a language mode set. If we think that makes sense, it would solve the case for piping too.

@TylerLeonhardt
Copy link
Member

In my eyes a .txt file is a plaintext file so I don't really think that scenario needs to be addressed but if we were to run the detect, we'd need to:

  • change the file extension as well
  • maybe run it less often (we said once every 100ish chars for untitled) since it's less likely that the user wants the detect logic than in the untitled case (then again running the model might not be intensive enough to need to do this)

@bpasero
Copy link
Member

bpasero commented Mar 19, 2021

I am not sure the file extension needs to change, we can change the language today in any file without the need to change the file extension. And this is also persisted between restarts for as long as the editor is not closed.

@TylerLeonhardt
Copy link
Member

I am not sure the file extension needs to change, we can change the language today in any file without the need to change the file extension.

I'm not sure all language runtimes can handle non-expected file extensions (PowerShell can't, for example) so if you try to F5 debug a file, and that file is a .txt file even though the language mode is set to Python/PowerShell/TypeScript, the debugger might get confused.

I'd want the language mode to change and then F5 debugging/running in their terminal (node foo.js) to all work flawlessly and leaving it as a txt might prevent that/cause confusion.

@bpasero
Copy link
Member

bpasero commented Mar 19, 2021

Yeah I think that is fine, I actually think #41614 is doable with some file watching tricks so I guess it is another good reason to look into it eventually.

@isidorn isidorn modified the milestones: April 2021, May 2021 Mar 29, 2021
@isidorn isidorn modified the milestones: May 2021, June 2021 May 4, 2021
@bpasero bpasero added the workbench-untitled-editors Managing of untitled editors in workbench window label May 21, 2021
@isidorn
Copy link
Contributor Author

isidorn commented Jun 10, 2021

Assigning also @TylerLeonhardt since I know he is interested and half of June and most of July I will be on vacation.
Tyler if you will not have time feel free to unassign yourself. Thanks!

@isidorn
Copy link
Contributor Author

isidorn commented Jun 11, 2021

After discussion with @joaomoreno here are some of the things we should do in order to successfully ship this as part of VS Code:

  • Update the model (and figure out how to do this) so it supports more languages. More details here Guesslang in VS Code yoeo/guesslang#29
  • Compress the model, so it can be shipped as part of VS Code. Joao just used gzip to make it 20kb so we are good here
  • Check the model performance so we know how often can we run it, and can it be run in the renderer process. Maybe @pyu10055 already has some ideas. However we should use Chrome dev tools to profile
  • Is the performance affected by the length of the text passed in the model? We need to figure out how large of a text should be sent and how often.
  • Fine tune how we use model confidence based on our VS Code audience. For example if the model says 0.3 confidence coffescript and 0.25 confidence typescipt we should obviously choose typescript. I can dig out the number of our most popular languages and we can use that as a hint for what languages to lean towards.

@TylerLeonhardt feel free to update this list and let me know what you think

fyi @yoeo

@TylerLeonhardt
Copy link
Member

TylerLeonhardt commented Jun 11, 2021

A couple of pre-reqs:

Also if you'd like to play with the experience today (not fine tuned, but works...) the code is published as an extension here:
https://marketplace.visualstudio.com/items?itemName=TylerLeonhardt.auto-language

@TylerLeonhardt TylerLeonhardt removed this from the June 2021 milestone Jun 16, 2021
@TylerLeonhardt TylerLeonhardt added this to the July 2021 milestone Jun 16, 2021
@TylerLeonhardt
Copy link
Member

Moving this to July as I think there are few things we are blocked on above that I don't think we'll figure out in the next week or so.

@ghost
Copy link

ghost commented Jun 24, 2021

Is it possible to increase the accuracy using language heuristics data from github/linguist?

@isidorn
Copy link
Contributor Author

isidorn commented Jun 25, 2021

@4086606 Good idea. However that library is a ruby library so we would have a dependency on ruby, which is a no-go for us, since we want this working in the browser. Unless there is a js alternative?

@ghost
Copy link

ghost commented Jun 25, 2021

Should be able to compile the entire gem to WASM, but I do wonder about size

@isidorn
Copy link
Contributor Author

isidorn commented Jun 25, 2021

If you succeed do let us know how it went and the size. Though I see this as step 2, only in the case that the initial model classification does not prove super accurate.

@ghost
Copy link

ghost commented Jun 25, 2021

It seems like we will need to take the YAML files from that library and reimplement the heuristics logic. You're right that it should be a step 2 seeing as it's rather blind to content and the accuracy is limited..

@ghost
Copy link

ghost commented Jun 26, 2021

@isidorn didn't work out - github/linguist can only disambiguate file extensions. Guesslang could use the linguist samples to support every single Github language

@TylerLeonhardt
Copy link
Member

TylerLeonhardt commented Jul 12, 2021

I was able to address the pre-reqs and published a new version of my extension that doesn't make a single network request:
https://marketplace.visualstudio.com/items?itemName=TylerLeonhardt.auto-language

Next order of business is to understand how to bring this into Core.

Also I used @yoeo's updated model with extra language supports and JSON + YAML is sooooo nice.

@TylerLeonhardt
Copy link
Member

Closing in favor of #129004

@github-actions github-actions bot locked and limited conversation to collaborators Sep 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
languages-guessing Language guessing issues under-discussion Issue is under discussion for relevance, priority, approach workbench-untitled-editors Managing of untitled editors in workbench window
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants