Improve Programming language detection and classification #1445

pombredanne · 2019-03-13T08:26:39Z

Description

ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.

The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.

See https://github.com/nexB/aboutcode/wiki/GSOC-2019#improve-programming-language-detection-and-classification-in-scancode

Here are some actual tools for general filetype and Programming language detection:
In use today:

Python stdlib and mime detection: based on extensions only afaik. we use it
libmagic: we use it with our own ctypes binding and it would need to be upgraded to the latest libmagic as part of the project
Pygments lexers: this is a code lexing and highlighting library and it therefore also detects programming languages as a side effect. This used to be also what Github was using in Linguist a while back.

( we also use a shannon entropy detector and binaryornot to detect binaries)

Things to look at could include :

freedesktop shared mime info: a signature based approach and the gold standard on Linux desktops and more. There a few Python libraries that support this
Github linguist: in Ruby, used to count LOC and detect languages. Uses a combo of signatures/lexers from sublime and a naive bayesian classifier on top AFAICR
douban linguist: a Python port of GH linguist ... interesting but not super active.
enry: a Go port of GH linguist
ohcount uses ragel lexers
https://github.com/yoeo/guesslang uses Tensorflow

See also: #1036 #1012 and #426 #1355 #1201

Ritvyk · 2019-03-24T14:31:21Z

Hi There!
We can also use Regular Expression to Detect the correct Programming Language in which the code has been written.
will Soon upload a source code of it , Working on it right now!

mjherzog · 2019-09-23T22:47:44Z

Some recent examples of errors for Programming Language (pygments):

.rST files reported as VB.Net
.yaml and .yml files reported as ActionScript3
.less files reported as GAS
.md files reported as Objective-C

pombredanne · 2019-09-24T14:12:14Z

@mjherzog thanks. I pushed an updated pygments library in 4aaec8c but this is only a first baby step

mjherzog · 2019-09-24T14:46:28Z

I don't know how/if this factors in to a solution, but I would say that "false positives" are the main concern. It would be better for a .rST file to be reported as No Value Detected for Programming Language than a false positive for VB.Net.

pombredanne added new feature file info labels Mar 13, 2019

pombredanne mentioned this issue Jan 13, 2024

Meta Issue: File classification and categorization #3639

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Programming language detection and classification #1445

Improve Programming language detection and classification #1445

pombredanne commented Mar 13, 2019

Ritvyk commented Mar 24, 2019

mjherzog commented Sep 23, 2019

pombredanne commented Sep 24, 2019

mjherzog commented Sep 24, 2019

Improve Programming language detection and classification #1445

Improve Programming language detection and classification #1445

Comments

pombredanne commented Mar 13, 2019

Description

Ritvyk commented Mar 24, 2019

mjherzog commented Sep 23, 2019

pombredanne commented Sep 24, 2019

mjherzog commented Sep 24, 2019