Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for sharing state across analyzer instances #6324

Closed
mavasani opened this issue Oct 26, 2015 · 56 comments
Closed

Add support for sharing state across analyzer instances #6324

mavasani opened this issue Oct 26, 2015 · 56 comments
Assignees
Labels
Area-Analyzers Concept-API This issue involves adding, removing, clarification, or modification of an API. Feature Request
Milestone

Comments

@mavasani
Copy link
Contributor

Currently, our diagnostic analyzer API provides no support for sharing any state between diagnostic analyzers. This causes extreme pain or performance issues for our customers when they have some expensive computation and/or large data to share across analyzer instances. For example:

  1. Group of analyzers that need identical configuration settings from an additional file - they have identical computation logic and resulting data, and would get a real performance benefit if they can share this data across analyzer instances without doing any complicated lifetime management.
  2. StyleCop analyzers want to walk through all trees in a compilation and identify generated files - they want to do it just once per compilation across all analyzer instances. In absence of any API support, they ended up creating a static conditional weak table keyed on compilation, which caused GC related perf problems when executing in the IDE. See Improve state management in analyzers DotNetAnalyzers/StyleCopAnalyzers#1671 for the workaround they put to avoid CWT.

An ideal solution here is to provide an analyzer API to share per-compilation and per-additional file data across analyzer instances, so that analyzer authors don't have to explicitly deal with lifetime management for such state. I am going to propose an initial API for this feature, and we can iterate over the best design for it.

Principles:

  1. API should guarantee easy state access/creation within all analyzer callbacks.
  2. Analyzer author needs to handle concurrent access/update to shared state instance. (@JohnHamby proposes we should think about enhancing the API to have this as an opt-in rather a requirement)
  3. [Implementation] No conditional weak tables for storing share state: CWT are evil in most instances, the shared state should be strongly referenced within CompilationWithAnalyzers/AnalyzerDriver instance.

Proposed API (per-compilation state):
Add the below API to each DiagnosticAnalysisContext parameter type for analyzer callbacks:

TState GetOrCreateSharedState<T, TState>(Key<T> key, Func<Key<T>, TState> createSharedState) where T: object, TState: object

where public sealed class Key<T> { } is a separate definition.

@mavasani mavasani added Area-Analyzers Concept-API This issue involves adding, removing, clarification, or modification of an API. Feature Request labels Oct 26, 2015
@mavasani
Copy link
Contributor Author

Tagging people who might be interested in this feature/design discussion: @sharwell @tmeschter @srivatsn @JohnHamby @heejaechang @pharring

@msJohnHamby
Copy link
Contributor

Can we do better in terms of the safety of the concurrency model?

@msJohnHamby
Copy link
Contributor

If we are going to require analyzers that share state to be in the same assembly, can't any analyzers that want to share state be written as a single analyzer with multiple actions?

@msJohnHamby
Copy link
Contributor

Adding @gafter @mattwar

@sharwell
Copy link
Member

Can't any analyzers that want to share state be written as a single analyzer with multiple actions?

We would prefer to avoid that, as it complicates the implementation of single analyzers. For example, it's currently easy to review the behavior of SA1018 (NullableTypeSymbolsMustNotBePrecededBySpace) against the documentation for that rule. It would not be nearly so easy if the analysis behavior for 150+ other violations was included in the same analyzer.

@sharwell
Copy link
Member

📝 Another thing to consider here. In DotNetAnalyzers/StyleCopAnalyzers#1683 and DotNetAnalyzers/StyleCopAnalyzers#1689, we dramatically reduced the number of allocations required to register our analyzers with a new compilation. However, due to our use of shared state we are currently unable to remove the closure created here, which results in an allocation for every call we make to RegisterSyntaxNodeAction (there are perhaps 1000 such calls overall).

An ideal solution to state management would allow us to reduce or eliminate these per-compilation allocations.

@heejaechang
Copy link
Contributor

@mavasani why limiting sharing state per assembly make threading model more safe? I thought we only serialize analyzer execution not assembly? are we saying if an analyzer from a assembly register shared state, we will serialize all analyzer executions from that assembly regardless whether a particular analyzer from the assembly uses shared state or not?

@sharwell
Copy link
Member

Allow state sharing only between analyzer instances from the same assembly - this is needed to ensure a secure threat model, we don't want unrelated analyzers to be able to access/manipulate each other's state.

In my opinion, we don't need this goal. The intended outcome has questionable value at best, but the impact could result in very real limitations for analyzer developers. For example, in the future we may want to provide a common API for detecting generated code in a StyleCop-compliant manner, which many analyzers then use.

API should guarantee easy state access/manipulation within all analyzer callbacks. However, analyzer author needs to handle concurrent access/update to shared state instance.

We've found that simply having a singly-initialized value associated with a compilation would be sufficient. For example, all we need for generated files support is to associate a single (initially empty) instance of ConcurrentDictionary<SyntaxTree, bool> with a Compilation instance, and then access that dictionary in each analyzer. We never need to modify the dictionary instance which is associated with the compilation, even though we do have code to modify the contents of that instance.

I believe the priorities for the API should be:

  1. Allow lazily- or singly-initialized object to be associated with a compilation (allocating one dictionary even eagerly would not be a problem)
  2. Reduce the number of allocations required when registering analyzer callbacks for the case where shared state is used

[Implementation] No conditional weak tables for storing share state: CWT are evil in most instances, the shared state should be strongly referenced within CompilationWithAnalyzers/AnalyzerDriver instance.

More specifically, after the data is initialized we need access to this data to be lock-free. The implementation of CWT does not allow us to meet this requirement.

@heejaechang
Copy link
Contributor

@mavasani ah, I misinterpret the meaning of thread model. sorry. ignore my previous comment.

@sharwell
Copy link
Member

@heejaechang

why limiting sharing state per assembly make threading model more safe?

The original post says threat, which I interpreted as a discussion of security, as opposed to thread which would be a discussion of concurrency.

@heejaechang
Copy link
Contributor

@mavasani so this reduce repeated work per each analyzer but analyzer need to repeat the same work per compilation, right? I think that is good goal. we should clear the shared state as soon as all analysis of a compilation has finished. otherwise we could get into memory leak.

@mavasani
Copy link
Contributor Author

@heejaechang @sharwell We need some kind of threat model/security to ensure unrelated analyzers cannot view/modify state or worse throw exceptions based on the assumption about the data type of the state. I was proposing we start with drawing a boundary around the assembly in which analyzers exists, and can extend the API later to support sharing data across assemblies. However, we do need some measures to prevent unrelated analyzers from accidentally/purposefully corrupt shared data.

@heejaechang
Copy link
Contributor

@mavasani @sharwell 2 questions.

  1. I think we can change "key" from string to object. by doing it, the author can decide whether they want to share their state or not by making the object key shared between analyzers. since we don't restrict the state as a serializable object, I think we can just use object as key.
  2. why limit TState as object if we are not going to use CWT?

@sharwell
Copy link
Member

so this reduce repeated work per each analyzer but analyzer need to repeat the same work per compilation, right?

We've actually been successful in accomplishing this, but it's been a challenge with many pitfalls. We've completely changed the way we manage per-compilation state twice as a result of performance problems that snuck in. The primary benefit of a common API for us will be the elimination of many additional allocations (several hundred at least) which occur during the registration of our analyzers.

We need some kind of threat model/security to ensure unrelated analyzers cannot view/modify state or worse throw exceptions based on the assumption about the data type of the state.

If you change the key to be a Type which is compared by reference (as opposed to a string), then the simple use of a type which is not publicly exposed as a key would be sufficient to address this concern.

@pharring
Copy link
Contributor

If you use "object" as the key instead of "string", then you can leave it up to the analyzer authors to determine "sharability". If you need private state, then use a "private static readonly object" key. If you need something visible to other analyzers in the same assembly, then use "internal", etc.

#5542 had that

@mavasani
Copy link
Contributor Author

@pharring that is surely a better approach, let me update the original comment accordingly.

@heejaechang
Copy link
Contributor

@mavasani can we remove this API.
bool TryGetSharedState(string key, out TState value) where TState: object

can one always use GetOrCreate ? that will remove people to assume there is any dependency we guarantee.

@heejaechang
Copy link
Contributor

@mavasani also, can we remove "State" from the API and rather put "Cache" in them. so that people doesn't use it as a state but use as a "Cache". implication is, "state" can break functionality but cache can't. this also remove assumption on dependency and also, we don't need to re-think our statefull and stateless analyzer story.

@mavasani
Copy link
Contributor Author

can we remove this API.
bool TryGetSharedState(string key, out TState value) where TState: object

Yes, probably but API design probably want to decide that. If the analyzer wants to have null as an allowable value for the data for exceptional circumstances, then a reader invoking GetOrCreate needs to do an additional null check. Just a minor API preference..

can we remove "State" from the API and rather put "Cache" in them. so that people doesn't use it as a state but use as a "Cache".

I thought we explicitly want it to be strongly referenced state, that lives the lifetime of the compilation. We don't want to create conditional weak table or weakly referenced property bags or any kind of cache, that has already proven to be bad for IDE perf. Am I misunderstanding your comment?

@gafter
Copy link
Member

gafter commented Oct 26, 2015

I would prefer this API be typesafe, for example as in #5708.

@heejaechang
Copy link
Contributor

@mavasani

"Yes, probably but API design probably want to decide that. If the analyzer wants to have null as an allowable value for the data for exceptional circumstances, then a reader invoking GetOrCreate needs to do an additional null check. Just a minor API preference.."

user can still return null from GetOrCreateXXX if they want to.

"I thought we explicitly want it to be strongly referenced state, that lives the lifetime of the compilation. We don't want to create conditional weak table or weakly referenced property bags or any kind of cache, that has already proven to be bad for IDE perf. Am I misunderstanding your comment?"

this is not what I meant. implementation-wise, leave it exactly as you suggested.

what I meant was just remove "State" from API and rather put "Catch" in them. and tell them you can't depends on the data for your functional correctness. so stateless or statefull analyzer both still can use this API without making them stateful analyzer. conjunction with the removing of "TryXXXX" API, all user of the API, which uses GetOrCreateXXX should be able to create the cache if it doesn't exist.

@sharwell
Copy link
Member

I like the simplicity of a single GetOrCreateXXX method, and it resolves every single case we have where we need to register per-compilation analyzer callbacks as opposed to calling back to static methods which are registered once for the lifetime of the analyzer assembly.

For this to work, it is essential that the fast path (case where the item already exists) be very efficient in a concurrent environment, because it could be invoked in each of the analysis callbacks (potentially dozens of times per SyntaxNode depending on the number of analyzers enabled).

I would prefer this API be typesafe, for example as in #5708.

We could get this by replacing the object key in the proposed API with Key<T>.

@mavasani
Copy link
Contributor Author

@gafter I have updated the proposed API to take a TKey instead of object key, hopefully this is what you meant.

@heejaechang

tell them you can't depends on the data for your functional correctness

Why would we do so? The fact that you want per-compilation shared state makes you a stateful analyzer, and we should be able to guarantee functional correctness by keeping this object alive as long as the underlying compilation lives. We can just store the state on CompilationWithAnalyzers instance, which has strong reference to the underlying compilation on which analyzers are executed. Analyzers can safely assume that we will initialize this state only once per compilation.

@gafter
Copy link
Member

gafter commented Oct 26, 2015

@mavasani No, it doesn't enforce any relationship between the type of the key and the type of the value. The implementation of TryGetSharedState will contain a cast that can fail. Please see #5708 for a model of type safety for this kind of API.

@sharwell
Copy link
Member

I have updated the proposed API to take a TKey instead of object key, hopefully this is what you meant.

Define a class Key<T> as follows:

public sealed class Key<T> { }

Then update the API to require the key's type match the created value's type:

TState GetOrCreateSharedState<TState>(Key<TState> key, Func<object, TState> createSharedState)
    where TState: class

@gafter
Copy link
Member

gafter commented Oct 26, 2015

@sharwell Yes, almost like that. get rid of the object:

TState GetOrCreateSharedState<TState>(Key<TState> key, Func<Key<TState>, TState> createSharedState)
    where TState : class

@heejaechang
Copy link
Contributor

@mavasani first analyzer become stateful analyzer. you get one benefit and you lost one - concurrency since state basically implies dependency.

since it is state that they relies on for functionality, the order of state mutation must be correct and observable from all callers.

making it not state is much simpler and stateless analyzer can use it as well.

@msJohnHamby
Copy link
Contributor

Here's a proposal to cover reading an external resource from multiple analyzers. It does not cover sharing mutable state between analyzers.

A significant aspect of the issues discussed here is a need for multiple analyzers to read configuration, options, or directive context from external resources. Reading and parsing a resource can be expensive, motivating a desire to read a given resource once per compilation or per analysis session rather than once per analyzer per compilation. Here is a proposed API with the following goals:
-- Minimize the number of times a resource is read and parsed.
-- Guarantee that a resource will be reread if it changes between accesses.

To be added to:
-- AnalysisContext
-- CompilationStartAnalysisContext
-- CompilationAnalysisContext

We propose:
bool TryGetResource<T>(Key<T> key, SourceText text, Func<SourceText, T> reader, out T result)

Semantics:
-- At first invocation for a given key value and a given text value, invoke the reader function with the text value. If the reader returns normally, set result to the value returned by the reader and return true. If the reader terminates with an exception, set result to Default(T) and return false.
-- A subsequent invocation with an equivalent key value and equivalent text value is permitted (and is likely) to bypass invoking the reader, set result to the previous result, and return the same value as the prior invocation. However, it is also permitted to invoke the reader function again.
-- A subsequent invocation with a different key value or a different text value must invoke the reader.

Caching can cross contexts. For example, if there is a call to AnalysisContext.TryGetResource in one analyzer and an equivalent call to CompilationStartAnalysisContext.TryGetResource in another, the resource will likely be read only once. The lifetime of a cached result is arbitrary, but is expected to be as long as practical without introducing excessive memory pressure.

Expectations on clients:
-- Every invocation of a given reader with equivalent text will produce equivalent results.
-- All invocations of these methods for a given key will supply readers with equivalent behavior and text values intended to represent (versions of) the same resource.
-- The value returned by a reader will not be mutated carelessly.
-- For a given key, a single analyzer will make no more than one invocation of AnalysisContext.TryGetResource and no more than one invocation per compilation of CompilationStartAnalysisContext.TryGetResource or CompilationAnalysisContext.TryGetResource, and will invoke no more than one of these methods.

Violating these expectations can produce difficult-to-explain behavioral inconsistencies.

@sharwell
Copy link
Member

sharwell commented Dec 5, 2015

For a given key, a single analyzer will make no more than one invocation of AnalysisContext.TryGetResource and no more than one invocation per compilation of CompilationStartAnalysisContext.TryGetResource or CompilationAnalysisContext.TryGetResource, and will invoke no more than one of these methods.

I don't understand the need for either of these limitations. Can you clarify how this benefits the implementation and the reason it needs to be different from the behavior described by your other expectations?

@msJohnHamby
Copy link
Contributor

Violating these restrictions might produce a surprise. If the resource changes between invocations, the result you get back will also change. If the resource has not changed and you supply an equivalent reader delegate, you might get back an identical object from two invocations, or you might get different (but equivalent) results. If you supply a different delegate to two invocations with the same key, the second invocation might return the results of the first and not call the delegate supplied with the second invocation.

The implementation won't enforce these restrictions. It also won't do anything to guarantee predictable behavior if they are violated.

@sharwell
Copy link
Member

sharwell commented Dec 5, 2015

If the resource changes between invocations, the result you get back will also change. If the resource has not changed and you supply an equivalent reader delegate, you might get back an identical object from two invocations, or you might get different (but equivalent) results. If you supply a different delegate to two invocations with the same key, the second invocation might return the results of the first and not call the delegate supplied with the second invocation.

Based on the description in the previous comment, none of these behaviors is unexpected. I think what you currently have documented as an expectation on clients would be better worded as a recommendation that care be taken, along with clear examples like you provided in the follow-up. With the expectation in place, clients will not be able to leverage this functionality even if the developers understand the special considerations and know that they would not negatively impact their analyzers.

@mavasani
Copy link
Contributor Author

mavasani commented Dec 9, 2015

Thanks @JohnHamby for the proposal.
@sharwell @paul1956 Apart from the minor modifications to the wording of the semantics/expectations, does the proposal seem adequate for your requirements?

@sharwell
Copy link
Member

sharwell commented Dec 9, 2015

Apart from the minor modifications to the wording of the semantics/expectations, does the proposal seem adequate for your requirements?

Overall, yes. Some questions remain which may or may not be in scope of this:

  1. Where does the SourceText instance come from (second parameter of proposed API)?
  2. Would a similar API be added for code fix contexts?

@mavasani
Copy link
Contributor Author

Where does the SourceText instance come from (second parameter of proposed API)?

Additional files: http://source.roslyn.io/Microsoft.CodeAnalysis/R/b33eb1a9beed263a.html
Regular source files: http://source.roslyn.io/Microsoft.CodeAnalysis/R/5967ebe19d531e5d.html

Would a similar API be added for code fix contexts?

Seems reasonable to me. I'll file a separate issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Analyzers Concept-API This issue involves adding, removing, clarification, or modification of an API. Feature Request
Projects
None yet
Development

No branches or pull requests

9 participants