Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In telemetry data, hash all imports and not just known ones so that in the future new packages can be found. #3555

Closed
rchiodo opened this issue Mar 20, 2019 · 11 comments
Assignees

Comments

@rchiodo
Copy link
Contributor

rchiodo commented Mar 20, 2019

Right now we return the package name for imports if it comes from a known list.

This doesn't work if we want to look for new ones in the future.

Instead this should use a hash of the package name.

@rchiodo rchiodo self-assigned this Mar 20, 2019
rchiodo referenced this issue in microsoft/vscode-python Mar 21, 2019
For #4852

<!--
  If an item below does not apply to you, then go ahead and check it off as "done" and strikethrough the text, e.g.:
    - [x] ~Has unit tests & system/integration tests~
-->
- [x] Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR)
- [x] Title summarizes what is changing
- [x] Has a [news entry](https://github.com/Microsoft/vscode-python/tree/master/news) file (remember to thank yourself!)
- [ ] Has sufficient logging.
- [x] Has telemetry for enhancements.
- [x] Unit tests & system/integration tests are added/updated
- [ ] [Test plan](https://github.com/Microsoft/vscode-python/blob/master/.github/test_plan.md) is updated as appropriate
- [ ] [`package-lock.json`](https://github.com/Microsoft/vscode-python/blob/master/package-lock.json) has been regenerated by running `npm install` (if dependencies have changed)
- [ ] The wiki is updated with any design decisions/details.
@DonJayamanne
Copy link
Contributor

DonJayamanne commented Mar 21, 2019

Some thoughts on perf:

  • Getting all the document content into memory and splitting the content into lines and processing each line one by one can be slow and CPU intensive (regex and hashing).
    • Yes, it might not be much, but the more files you have opened, the larger the file, the more frequent this is done, the worse it gets.
    • Assume, a user has auto save on, now every few seconds we're performing this operation (split text, regex processing line by line)
    • All happens in the main thread that all other extensions share.
  • Events handlers are not debounced
    • With auto save on, as and when you make changes, all of them will queue up.
    • I.e. if we make changes to the same file in 5 times in x seconds, we don't need to process the same file 5 times. I.e. if user is still making changes, lets wait until they are done (e.g. debounce for 10s), then process the file.

Proposal

  • Perform this (as much as possible) out of proc,
    • Language Server (do everything there, however not everyone has this on)
    • Use jedi (let jedi do the parsing, tokenizing, etc..), by using the symbol provider.
      • Then in TS all we do is parse a JSON and a smaller subset...
    • Preferred approach: Send the file name(s) to a Python script, let the python script parse everything, send back a JSON array of the module names.
      • This way, all text processing (reading the document), parsing, identifying modules, hashing, etc is all done out of proc.

Impact:

  • As we chew more CPU time, other extensions can get adversely affected.
  • As we chew more CPU time during on save document, features in our own extension such as save on format & on save actions (sort imports on save) can get adversely affected.
    • VSC could end up abandoning those operations if they take more than x ms (we know today that this happens with format on save, as VSC only allocates 750ms for the code to format the text and provide the text edits....
    • I.e. we chew CPU for other operations, then the likelihood of these existing operations timeout out increases...

@DonJayamanne
Copy link
Contributor

We have to unhash them on the other side to determine if a specific package or not.

https://github.com/Microsoft/vscode-python/blob/master/news/3%20Code%20Health/4852.md

Is this correct, can we unhash them. Or do we compare against hashes of known packages.

@rchiodo
Copy link
Contributor Author

rchiodo commented Mar 21, 2019

There's no unhashing. It's comparing against known hashes. Like the unit test does.

@rchiodo
Copy link
Contributor Author

rchiodo commented Mar 21, 2019

We should measure the perf impact of this before we go off and design something more complicated. It only looks at the first 1K of lines.

@rchiodo
Copy link
Contributor Author

rchiodo commented Mar 21, 2019

Text is already in memory as far as I can tell, so we don't add any extra overhead by reading it.

Spawning a python process may take more cpu time than applying a regex against 1000 lines of text. I'll measure it to find out though.

@rchiodo
Copy link
Contributor Author

rchiodo commented Mar 21, 2019

@DonJayamanne debouncing sounds like a good idea. No need to queue up a ton of these if the file is just being saved a bunch. I'll add that too while I look at the perf.

@DonJayamanne
Copy link
Contributor

. It only looks at the first 1K of lines.

You might want to use the getlineat method of the document object

@DonJayamanne
Copy link
Contributor

DonJayamanne commented Mar 21, 2019

Spawning a python process may take more cpu time than applying a regex against 1000 lines of tex

The key difference is, it's out of proc, i.e. The extension process time isn't used. Chewing extension process time is where we need to be careful.

@rchiodo
Copy link
Contributor Author

rchiodo commented Mar 21, 2019

I didn't mean the time to run the python code, I mean simply the time to call proc.spawn (and hook up the stdio handlers and read from them).

Iterating over a 1000 lines is really damn fast. Maybe faster than the amount of code that runs when calling proc.spawn.

I'm profiling now and for a single save of a 3000K line file, the profiler doesn't even register the time for the import. I have to save 3 or 4 times to get anything to show up for the import in the profile.

Our code lens parsing is a much bigger part of on save.

And this is with debug bits. Minimized bits are likely even faster but it's hard to tell what the CPU time is from.

@rchiodo
Copy link
Contributor Author

rchiodo commented Mar 21, 2019

Here's a detail analysis (of debug bits)

image

Essentially most of the time doing this work (on a 1000 lines) and it's not much time, just 5 ms, is generating the hashes. We could do the hash generation in the python code, but doesn't seem worth it.

Reparsing codelens on average is around 24ms each save on this same file.

And it seems there's a python spawn in the profile. Spawn itself takes 9ms. So this is actually faster.

I'll do the debounce though.

@DonJayamanne
Copy link
Contributor

DonJayamanne commented Mar 21, 2019

  • I think we should still use document.lineAt(<line number/index?).text rather than reading the entire text and splitting the lines, should be more efficient, even though it may be negligible.

  • Would be good to track the total time taken for this processing (we generally do that in most places when tracking telemetry)

    • Will also tell us how often this is done (how often we run this code).

@lock lock bot locked as resolved and limited conversation to collaborators May 13, 2019
@microsoft microsoft unlocked this conversation Nov 14, 2020
@DonJayamanne DonJayamanne transferred this issue from microsoft/vscode-python Nov 14, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants