You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.
The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.
Here are some actual tools for general filetype and Programming language detection:
In use today:
Python stdlib and mime detection: based on extensions only afaik. we use it
libmagic: we use it with our own ctypes binding and it would need to be upgraded to the latest libmagic as part of the project
Pygments lexers: this is a code lexing and highlighting library and it therefore also detects programming languages as a side effect. This used to be also what Github was using in Linguist a while back.
( we also use a shannon entropy detector and binaryornot to detect binaries)
Github linguist: in Ruby, used to count LOC and detect languages. Uses a combo of signatures/lexers from sublime and a naive bayesian classifier on top AFAICR
douban linguist: a Python port of GH linguist ... interesting but not super active.
Hi There!
We can also use Regular Expression to Detect the correct Programming Language in which the code has been written.
will Soon upload a source code of it , Working on it right now!
I don't know how/if this factors in to a solution, but I would say that "false positives" are the main concern. It would be better for a .rST file to be reported as No Value Detected for Programming Language than a false positive for VB.Net.
Description
ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.
The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.
See https://github.com/nexB/aboutcode/wiki/GSOC-2019#improve-programming-language-detection-and-classification-in-scancode
Here are some actual tools for general filetype and Programming language detection:
In use today:
( we also use a shannon entropy detector and binaryornot to detect binaries)
Things to look at could include :
See also: #1036 #1012 and #426 #1355 #1201
The text was updated successfully, but these errors were encountered: