Automatic language classification for Untitled files #118455

isidorn · 2021-03-08T15:48:14Z

We could do automatic language classification for Untitled files so users do not have to explicitly choose what language mode to use.

There are already modules for this in Python https://github.com/yoeo/guesslang
However it would be cool to have something in Javascript. One idea is to use TensurFlowJS and try to reuse the guesslang model.

Nice demo by Tyler:

screen_recording_2021-03-17_at_9.27.13_pm.mov

bpasero · 2021-03-08T16:53:31Z

Some related issues: languages-guessing Language guessing issues

bpasero · 2021-03-08T16:55:04Z

I would actually also use this for the CLI support for reading from stdin. Currently we create a tmp txt file that contains the contents that are piped into VS Code, but it would be awesome if we could change the language mode based on contents 👍

isidorn · 2021-03-08T17:08:00Z

@bpasero cool, glad there are more use cases.
Did not know we have this label :)

isidorn · 2021-03-19T13:39:23Z

@TylerLeonhardt did a great experiment, we plan to continue on this next milestone thus assigning to April
This PR has a checklist and discussion points #119325

TylerLeonhardt · 2021-03-19T14:28:14Z

@bpasero I think for your request, if we fix this: #41614

Then what your requesting could come "for free" if we add the logic to run the model on untitled files as I did in my prototype.

bpasero · 2021-03-19T14:31:50Z

Yeah unfortunately opening an untitled file from piping is currently not an option because we have no good way of talking from the one process directly into an untitled file. But you are probably right that this would be the right solution for that issue.

However, do we consider to support language detection also for plain text files? I wonder if the detection should also run when you open a txt file without having a language mode set. If we think that makes sense, it would solve the case for piping too.

TylerLeonhardt · 2021-03-19T14:59:14Z

In my eyes a .txt file is a plaintext file so I don't really think that scenario needs to be addressed but if we were to run the detect, we'd need to:

change the file extension as well
maybe run it less often (we said once every 100ish chars for untitled) since it's less likely that the user wants the detect logic than in the untitled case (then again running the model might not be intensive enough to need to do this)

bpasero · 2021-03-19T16:05:13Z

I am not sure the file extension needs to change, we can change the language today in any file without the need to change the file extension. And this is also persisted between restarts for as long as the editor is not closed.

TylerLeonhardt · 2021-03-19T16:42:49Z

I am not sure the file extension needs to change, we can change the language today in any file without the need to change the file extension.

I'm not sure all language runtimes can handle non-expected file extensions (PowerShell can't, for example) so if you try to F5 debug a file, and that file is a .txt file even though the language mode is set to Python/PowerShell/TypeScript, the debugger might get confused.

I'd want the language mode to change and then F5 debugging/running in their terminal (node foo.js) to all work flawlessly and leaving it as a txt might prevent that/cause confusion.

bpasero · 2021-03-19T16:58:07Z

Yeah I think that is fine, I actually think #41614 is doable with some file watching tricks so I guess it is another good reason to look into it eventually.

isidorn · 2021-06-10T17:13:57Z

Assigning also @TylerLeonhardt since I know he is interested and half of June and most of July I will be on vacation.
Tyler if you will not have time feel free to unassign yourself. Thanks!

isidorn · 2021-06-11T09:53:05Z

After discussion with @joaomoreno here are some of the things we should do in order to successfully ship this as part of VS Code:

Update the model (and figure out how to do this) so it supports more languages. More details here Guesslang in VS Code yoeo/guesslang#29
Compress the model, so it can be shipped as part of VS Code. Joao just used gzip to make it 20kb so we are good here
Check the model performance so we know how often can we run it, and can it be run in the renderer process. Maybe @pyu10055 already has some ideas. However we should use Chrome dev tools to profile
Is the performance affected by the length of the text passed in the model? We need to figure out how large of a text should be sent and how often.
Fine tune how we use model confidence based on our VS Code audience. For example if the model says 0.3 confidence coffescript and 0.25 confidence typescipt we should obviously choose typescript. I can dig out the number of our most popular languages and we can use that as a hint for what languages to lean towards.

@TylerLeonhardt feel free to update this list and let me know what you think

fyi @yoeo

TylerLeonhardt · 2021-06-11T15:50:18Z

A couple of pre-reqs:

Figure out how to convert the native model to the tfjs model Can not convert TensorFlow model to TensorFlowJS: Unsupported Ops in the model before optimization tensorflow/tfjs#4838 (comment)
Figure out how to get tfjs to load the model over something other than http Can not convert TensorFlow model to TensorFlowJS: Unsupported Ops in the model before optimization tensorflow/tfjs#4838 (comment)

Also if you'd like to play with the experience today (not fine tuned, but works...) the code is published as an extension here:
https://marketplace.visualstudio.com/items?itemName=TylerLeonhardt.auto-language

TylerLeonhardt · 2021-06-16T17:37:53Z

Moving this to July as I think there are few things we are blocked on above that I don't think we'll figure out in the next week or so.

ghost · 2021-06-24T22:59:42Z

Is it possible to increase the accuracy using language heuristics data from github/linguist?

isidorn · 2021-06-25T08:02:31Z

@4086606 Good idea. However that library is a ruby library so we would have a dependency on ruby, which is a no-go for us, since we want this working in the browser. Unless there is a js alternative?

ghost · 2021-06-25T08:27:49Z

Should be able to compile the entire gem to WASM, but I do wonder about size

isidorn · 2021-06-25T09:00:28Z

If you succeed do let us know how it went and the size. Though I see this as step 2, only in the case that the initial model classification does not prove super accurate.

ghost · 2021-06-25T18:20:41Z

It seems like we will need to take the YAML files from that library and reimplement the heuristics logic. You're right that it should be a step 2 seeing as it's rather blind to content and the accuracy is limited..

ghost · 2021-06-26T13:59:41Z

@isidorn didn't work out - github/linguist can only disambiguate file extensions. Guesslang could use the linguist samples to support every single Github language

TylerLeonhardt · 2021-07-12T21:58:26Z

I was able to address the pre-reqs and published a new version of my extension that doesn't make a single network request:
https://marketplace.visualstudio.com/items?itemName=TylerLeonhardt.auto-language

Next order of business is to understand how to bring this into Core.

Also I used @yoeo's updated model with extra language supports and JSON + YAML is sooooo nice.

TylerLeonhardt · 2021-07-19T17:42:47Z

Closing in favor of #129004

isidorn added the under-discussion Issue is under discussion for relevance, priority, approach label Mar 8, 2021

isidorn added this to the March 2021 milestone Mar 8, 2021

isidorn self-assigned this Mar 8, 2021

egamma mentioned this issue Mar 8, 2021

Iteration Plan for March 2021 #118334

Closed

76 tasks

bpasero added the languages-guessing Language guessing issues label Mar 8, 2021

isidorn modified the milestones: March 2021, April 2021 Mar 19, 2021

isidorn mentioned this issue Mar 19, 2021

Language detection #119325

Closed

8 tasks

isidorn mentioned this issue Mar 19, 2021

Can not convert TensorFlow model to TensorFlowJS: Unsupported Ops in the model before optimization tensorflow/tfjs#4838

Closed

isidorn modified the milestones: April 2021, May 2021 Mar 29, 2021

isidorn modified the milestones: May 2021, June 2021 May 4, 2021

bpasero added the workbench-untitled-editors Managing of untitled editors in workbench window label May 21, 2021

isidorn assigned TylerLeonhardt Jun 10, 2021

isidorn mentioned this issue Jun 11, 2021

Guesslang in VS Code yoeo/guesslang#29

Closed

TylerLeonhardt removed this from the June 2021 milestone Jun 16, 2021

TylerLeonhardt added this to the July 2021 milestone Jun 16, 2021

This was referenced Jul 14, 2021

Initial support for language detection #128708

Merged

Automatic language detection plan #129004

Closed

TylerLeonhardt closed this as completed Jul 19, 2021

AlexDev2020 mentioned this issue Aug 11, 2021

Get rid of association files microsoft/vscode-cpptools#7934

Closed

github-actions bot locked and limited conversation to collaborators Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic language classification for Untitled files #118455

Automatic language classification for Untitled files #118455

isidorn commented Mar 8, 2021 •

edited

Loading

bpasero commented Mar 8, 2021

bpasero commented Mar 8, 2021

isidorn commented Mar 8, 2021

isidorn commented Mar 19, 2021

TylerLeonhardt commented Mar 19, 2021

bpasero commented Mar 19, 2021

TylerLeonhardt commented Mar 19, 2021

bpasero commented Mar 19, 2021 •

edited

Loading

TylerLeonhardt commented Mar 19, 2021

bpasero commented Mar 19, 2021

isidorn commented Jun 10, 2021 •

edited

Loading

isidorn commented Jun 11, 2021 •

edited by TylerLeonhardt

Loading

TylerLeonhardt commented Jun 11, 2021 •

edited

Loading

TylerLeonhardt commented Jun 16, 2021

ghost commented Jun 24, 2021

isidorn commented Jun 25, 2021

ghost commented Jun 25, 2021

isidorn commented Jun 25, 2021

ghost commented Jun 25, 2021 •

edited by ghost

Loading

ghost commented Jun 26, 2021 •

edited by ghost

Loading

TylerLeonhardt commented Jul 12, 2021 •

edited

Loading

TylerLeonhardt commented Jul 19, 2021

Automatic language classification for Untitled files #118455

Automatic language classification for Untitled files #118455

Comments

isidorn commented Mar 8, 2021 • edited Loading

bpasero commented Mar 8, 2021

bpasero commented Mar 8, 2021

isidorn commented Mar 8, 2021

isidorn commented Mar 19, 2021

TylerLeonhardt commented Mar 19, 2021

bpasero commented Mar 19, 2021

TylerLeonhardt commented Mar 19, 2021

bpasero commented Mar 19, 2021 • edited Loading

TylerLeonhardt commented Mar 19, 2021

bpasero commented Mar 19, 2021

isidorn commented Jun 10, 2021 • edited Loading

isidorn commented Jun 11, 2021 • edited by TylerLeonhardt Loading

TylerLeonhardt commented Jun 11, 2021 • edited Loading

TylerLeonhardt commented Jun 16, 2021

ghost commented Jun 24, 2021

isidorn commented Jun 25, 2021

ghost commented Jun 25, 2021

isidorn commented Jun 25, 2021

ghost commented Jun 25, 2021 • edited by ghost Loading

ghost commented Jun 26, 2021 • edited by ghost Loading

TylerLeonhardt commented Jul 12, 2021 • edited Loading

TylerLeonhardt commented Jul 19, 2021

isidorn commented Mar 8, 2021 •

edited

Loading

bpasero commented Mar 19, 2021 •

edited

Loading

isidorn commented Jun 10, 2021 •

edited

Loading

isidorn commented Jun 11, 2021 •

edited by TylerLeonhardt

Loading

TylerLeonhardt commented Jun 11, 2021 •

edited

Loading

ghost commented Jun 25, 2021 •

edited by ghost

Loading

ghost commented Jun 26, 2021 •

edited by ghost

Loading

TylerLeonhardt commented Jul 12, 2021 •

edited

Loading