In telemetry data, hash all imports and not just known ones so that in the future new packages can be found. #3555

rchiodo · 2019-03-20T20:29:54Z

Right now we return the package name for imports if it comes from a known list.

This doesn't work if we want to look for new ones in the future.

Instead this should use a hash of the package name.

For #4852  - [x] Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR) - [x] Title summarizes what is changing - [x] Has a [news entry](https://github.com/Microsoft/vscode-python/tree/master/news) file (remember to thank yourself!) - [ ] Has sufficient logging. - [x] Has telemetry for enhancements. - [x] Unit tests & system/integration tests are added/updated - [ ] [Test plan](https://github.com/Microsoft/vscode-python/blob/master/.github/test_plan.md) is updated as appropriate - [ ] [`package-lock.json`](https://github.com/Microsoft/vscode-python/blob/master/package-lock.json) has been regenerated by running `npm install` (if dependencies have changed) - [ ] The wiki is updated with any design decisions/details.

DonJayamanne · 2019-03-21T04:20:23Z

Some thoughts on perf:

Getting all the document content into memory and splitting the content into lines and processing each line one by one can be slow and CPU intensive (regex and hashing).
- Yes, it might not be much, but the more files you have opened, the larger the file, the more frequent this is done, the worse it gets.
- Assume, a user has auto save on, now every few seconds we're performing this operation (split text, regex processing line by line)
- All happens in the main thread that all other extensions share.
Events handlers are not debounced
- With auto save on, as and when you make changes, all of them will queue up.
- I.e. if we make changes to the same file in 5 times in x seconds, we don't need to process the same file 5 times. I.e. if user is still making changes, lets wait until they are done (e.g. debounce for 10s), then process the file.

Proposal

Perform this (as much as possible) out of proc,
- Language Server (do everything there, however not everyone has this on)
- Use jedi (let jedi do the parsing, tokenizing, etc..), by using the symbol provider.
  - Then in TS all we do is parse a JSON and a smaller subset...
- Preferred approach: Send the file name(s) to a Python script, let the python script parse everything, send back a JSON array of the module names.
  - This way, all text processing (reading the document), parsing, identifying modules, hashing, etc is all done out of proc.

Impact:

As we chew more CPU time, other extensions can get adversely affected.
As we chew more CPU time during on save document, features in our own extension such as save on format & on save actions (sort imports on save) can get adversely affected.
- VSC could end up abandoning those operations if they take more than x ms (we know today that this happens with format on save, as VSC only allocates 750ms for the code to format the text and provide the text edits....
- I.e. we chew CPU for other operations, then the likelihood of these existing operations timeout out increases...

DonJayamanne · 2019-03-21T04:31:05Z

We have to unhash them on the other side to determine if a specific package or not.

https://github.com/Microsoft/vscode-python/blob/master/news/3%20Code%20Health/4852.md

Is this correct, can we unhash them. Or do we compare against hashes of known packages.

rchiodo · 2019-03-21T15:15:57Z

There's no unhashing. It's comparing against known hashes. Like the unit test does.

rchiodo · 2019-03-21T15:17:04Z

We should measure the perf impact of this before we go off and design something more complicated. It only looks at the first 1K of lines.

rchiodo · 2019-03-21T15:27:56Z

Text is already in memory as far as I can tell, so we don't add any extra overhead by reading it.

Spawning a python process may take more cpu time than applying a regex against 1000 lines of text. I'll measure it to find out though.

rchiodo · 2019-03-21T16:03:18Z

@DonJayamanne debouncing sounds like a good idea. No need to queue up a ton of these if the file is just being saved a bunch. I'll add that too while I look at the perf.

DonJayamanne · 2019-03-21T16:23:47Z

. It only looks at the first 1K of lines.

You might want to use the getlineat method of the document object

DonJayamanne · 2019-03-21T16:27:33Z

Spawning a python process may take more cpu time than applying a regex against 1000 lines of tex

The key difference is, it's out of proc, i.e. The extension process time isn't used. Chewing extension process time is where we need to be careful.

rchiodo · 2019-03-21T16:33:58Z

I didn't mean the time to run the python code, I mean simply the time to call proc.spawn (and hook up the stdio handlers and read from them).

Iterating over a 1000 lines is really damn fast. Maybe faster than the amount of code that runs when calling proc.spawn.

I'm profiling now and for a single save of a 3000K line file, the profiler doesn't even register the time for the import. I have to save 3 or 4 times to get anything to show up for the import in the profile.

Our code lens parsing is a much bigger part of on save.

And this is with debug bits. Minimized bits are likely even faster but it's hard to tell what the CPU time is from.

rchiodo · 2019-03-21T16:53:03Z

Here's a detail analysis (of debug bits)

Essentially most of the time doing this work (on a 1000 lines) and it's not much time, just 5 ms, is generating the hashes. We could do the hash generation in the python code, but doesn't seem worth it.

Reparsing codelens on average is around 24ms each save on this same file.

And it seems there's a python spawn in the profile. Spawn itself takes 9ms. So this is actually faster.

I'll do the debounce though.

DonJayamanne · 2019-03-21T18:10:17Z

I think we should still use document.lineAt(<line number/index?).text rather than reading the entire text and splitting the lines, should be more efficient, even though it may be negligible.
Would be good to track the total time taken for this processing (we generally do that in most places when tracking telemetry)
- Will also tell us how often this is done (how often we run this code).

rchiodo self-assigned this Mar 20, 2019

brettcannon closed this as completed Apr 15, 2019

lock bot locked as resolved and limited conversation to collaborators May 13, 2019

microsoft unlocked this conversation Nov 14, 2020

DonJayamanne transferred this issue from microsoft/vscode-python Nov 14, 2020

github-actions bot locked as resolved and limited conversation to collaborators May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In telemetry data, hash all imports and not just known ones so that in the future new packages can be found. #3555

In telemetry data, hash all imports and not just known ones so that in the future new packages can be found. #3555

rchiodo commented Mar 20, 2019

DonJayamanne commented Mar 21, 2019 •

edited

Loading

DonJayamanne commented Mar 21, 2019

rchiodo commented Mar 21, 2019

rchiodo commented Mar 21, 2019

rchiodo commented Mar 21, 2019 •

edited

Loading

rchiodo commented Mar 21, 2019

DonJayamanne commented Mar 21, 2019

DonJayamanne commented Mar 21, 2019 •

edited

Loading

rchiodo commented Mar 21, 2019

rchiodo commented Mar 21, 2019

DonJayamanne commented Mar 21, 2019 •

edited

Loading

In telemetry data, hash all imports and not just known ones so that in the future new packages can be found. #3555

In telemetry data, hash all imports and not just known ones so that in the future new packages can be found. #3555

Comments

rchiodo commented Mar 20, 2019

DonJayamanne commented Mar 21, 2019 • edited Loading

DonJayamanne commented Mar 21, 2019

rchiodo commented Mar 21, 2019

rchiodo commented Mar 21, 2019

rchiodo commented Mar 21, 2019 • edited Loading

rchiodo commented Mar 21, 2019

DonJayamanne commented Mar 21, 2019

DonJayamanne commented Mar 21, 2019 • edited Loading

rchiodo commented Mar 21, 2019

rchiodo commented Mar 21, 2019

DonJayamanne commented Mar 21, 2019 • edited Loading

DonJayamanne commented Mar 21, 2019 •

edited

Loading

rchiodo commented Mar 21, 2019 •

edited

Loading

DonJayamanne commented Mar 21, 2019 •

edited

Loading

DonJayamanne commented Mar 21, 2019 •

edited

Loading