Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Programming language detection and classification #1445

Open
pombredanne opened this issue Mar 13, 2019 · 4 comments
Open

Improve Programming language detection and classification #1445

pombredanne opened this issue Mar 13, 2019 · 4 comments

Comments

@pombredanne
Copy link
Member

Description

ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.

The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.

See https://github.com/nexB/aboutcode/wiki/GSOC-2019#improve-programming-language-detection-and-classification-in-scancode

Here are some actual tools for general filetype and Programming language detection:
In use today:

  • Python stdlib and mime detection: based on extensions only afaik. we use it
  • libmagic: we use it with our own ctypes binding and it would need to be upgraded to the latest libmagic as part of the project
  • Pygments lexers: this is a code lexing and highlighting library and it therefore also detects programming languages as a side effect. This used to be also what Github was using in Linguist a while back.

( we also use a shannon entropy detector and binaryornot to detect binaries)

Things to look at could include :

See also: #1036 #1012 and #426 #1355 #1201

@Ritvyk
Copy link

Ritvyk commented Mar 24, 2019

Hi There!
We can also use Regular Expression to Detect the correct Programming Language in which the code has been written.
will Soon upload a source code of it , Working on it right now!

@mjherzog
Copy link
Member

Some recent examples of errors for Programming Language (pygments):

  • .rST files reported as VB.Net
  • .yaml and .yml files reported as ActionScript3
  • .less files reported as GAS
  • .md files reported as Objective-C

@pombredanne
Copy link
Member Author

@mjherzog thanks. I pushed an updated pygments library in 4aaec8c but this is only a first baby step

@mjherzog
Copy link
Member

I don't know how/if this factors in to a solution, but I would say that "false positives" are the main concern. It would be better for a .rST file to be reported as No Value Detected for Programming Language than a false positive for VB.Net.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants