Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenStream.IncrementToken() is called after Dispose() is called #271

Open
clambertus opened this issue Jul 13, 2019 · 23 comments · May be fixed by #1058
Open

TokenStream.IncrementToken() is called after Dispose() is called #271

clambertus opened this issue Jul 13, 2019 · 23 comments · May be fixed by #1058

Comments

@clambertus
Copy link

When overriding Dispose(bool) in either TokenStream subclasses, Lucene.Net will call Dispose() before it is done using the TokenStream, and call IncrementToken() again.

 

The behavior can be observed in the Lucene.Net.Collation.TestICUCollationKeyFilter.TestCollationKeySort() when the ICUCollationKeyFilter implements Dispose().

JIRA link - [LUCENENET-611] created by nightowl888
@clambertus
Copy link
Author

It is possible that this issue was due to a bug in Lucene.Net.TestFramework that has been addressed already. We need to confirm that it was the test framework that was errantly causing the double-call, not a feature of Lucene.NET.

Also, the ICUCollationKeyFilter no longer implements Dispose() - that was when we were referencing icu-dotnet (which had unmanaged resources) instead of ICU4N. However, the behavior can be observed in any existing disposable TokenStream or by overriding Dispose(bool) in one that isn't disposable.

by nightowl888

@NightOwl888
Copy link
Contributor

I looked into this a little, and it appears that the bug that was fixed in the test framework didn't solve this. When adding a Dispose(bool) override to the ICUCollationKeyFilter, Dispose(true) is still being called followed by additional calls to IncrementToken().

This is occuring in DocFieldProcessor which does a tight loop around some fields. It is possible that the consumer will dispose the TokenFilter in one loop and then another loop will retrieve the same instance of that TokenFilter by calling Field.GetTokenStream(Analyzer) and call IncrementToken() on it after it is disposed.

At this point I haven't confirmed this is how it works in Java so it is still unclear if this is expected behavior or a bug. If the former, we should consider the possibility of making TokenStream "close" and do that separately than "dispose" (see #265).

@NightOwl888
Copy link
Contributor

I started by trying to solve the two TestRandomChains failures (see #269), but it ended up unraveling back to this issue again.

I isolated a specific case that fails in .NET:

[Test]
public void TestWordDelimiterFilterChain()
{
    Analyzer a = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
    {
        Tokenizer tokenizer = new NGramTokenizer(LuceneVersion.LUCENE_48, reader, 70, 79);
        TokenStream filter = tokenizer;
        filter = new WordDelimiterFilter(LuceneVersion.LUCENE_48, filter, WordDelimiterFlags.CATENATE_NUMBERS | WordDelimiterFlags.SPLIT_ON_CASE_CHANGE, new CharArraySet(LuceneVersion.LUCENE_48, new HashSet<string> { "htxodcrfoi", "zywcmh", "pkw", "urdvcqr", "gbuskh" }, ignoreCase: true));
        filter = new ShingleFilter(filter, "EMAIL");
        return new TokenStreamComponents(tokenizer, filter);
    });

    string text = "ufsczdev -[({0, \u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1 kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61 \ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a \ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv \ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
    bool useCharFilter = false;
    bool offsetsAreCorrect = true;

    long seed = 0x21773254b5ac8fa0L;
    Random random = new J2N.Randomizer(seed);


    try
    {
        CheckAnalysisConsistency(random, a, useCharFilter, text, offsetsAreCorrect);
    }
    catch (Exception e) when (e.IsThrowable())
    {
        Console.WriteLine("Exception from random analyzer: " + a);
        throw;
    }
}

And after translating it to Java, it still fails:

public void testWordDelimiterFilterChain() throws Throwable {
  Analyzer a = new Analyzer() {
    public TokenStreamComponents createComponents(String fieldName, Reader reader) {
      Tokenizer tokenizer  = new org.apache.lucene.analysis.ngram.NGramTokenizer(Version.LUCENE_48, reader, 70, 79);
      TokenStream filter = tokenizer;
      filter = new WordDelimiterFilter(Version.LUCENE_48, filter, WordDelimiterFilter.CATENATE_NUMBERS | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, new CharArraySet(Version.LUCENE_48, asSet("htxodcrfoi", "zywcmh", "pkw", "urdvcqr", "gbuskh"), /*ignoreCase:*/ true));
      filter = new org.apache.lucene.analysis.shingle.ShingleFilter(filter, "EMAIL");
      
      return new TokenStreamComponents(tokenizer, filter);
    }
  };
  
  String text = "ufsczdev -[({0, \u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1 kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61 \ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a \ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv \ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
  boolean useCharFilter = false;
  boolean offsetsAreCorrect = false;
  
  long seed = 0x21773254b5ac8fa0L;
  java.util.Random random = new java.util.Random(seed);
  
  
  try
  {
      checkAnalysisConsistency(random, a, useCharFilter, text, offsetsAreCorrect);
  }
  catch (Throwable e)
  {
      System.out.println("Exception from random analyzer: " + a);
      throw e;
  } 
 }

I discovered this is primarily due to the fact that some flags are not supported during indexing: https://stackoverflow.com/a/66478863/, and confirmed that removing the WordDelimiterFlags.CATENATE_NUMBERS flag makes the test pass.

In TestRandomChains, WordDelimiterFilter is added to the brokenOffsetsConstructors dictionary. This causes offsetsAreCorrect to be set to true, which BaseTokenStreamTestCase.CheckAnalysisConsistency is supposed to handle. BaseTokenStreamTestCase has been changed from its original implementation to include try-finally and using blocks (both in CheckAnalysisConsistency() and in AssertTokenStreamContents() and that is where we are running into trouble.

Long story short, this circles back to the fact that we converted the Closable interface in Java to IDisposable in .NET on TokenStream. But TokenStream's close() method has preconditions that must be met or it will throw an exception. The exception doesn't play well when you are trying to close open TextReaders successfully. Before these finally blocks were put into place to call Dispose() one failing analysis test would cause nearly all of them to fail.

Since most of Lucene's tests do not call close() in a finally block, I have long suspected that there is something automatic happening to call these methods, and after a bit of research, I discovered that Closable inherits AutoClosable, which has configuration options that can be set to make it implicit. But it is clear this is going to take more research to work out how it works in Lucene and the best way to make it translate to .NET in a reasonable way.

Some Options to Consider

  1. Add a Close() method that has the precondition checks and use Dispose() to only do the cleanup.
  2. In Dispose(), only suppress the finalizer in the case where the preconditions pass, but rely on a finalizer in cases where it fails.
 private bool isValid;

~TokenStream() // Only called if we don't meet our preconditions
{
    Dispose(false);
}

public void Dispose()
{
   Dispose(true);
   GC.SuppressFinalize(this);
}

protected virtual void Dispose(bool disposing)
{
    if (disposing)
    {
        isValid = Validate(out string reason);
        if (!isValid)
            throw new InvalidOperationException(reason); // Exception short circuits the call to SuppressFinalize

        DoDispose();
    }

    if (!isValid)
        DoDispose();
}

internal void DoDispose()
{
  // Dispose file handles here...
}

Of course, that means we need to wait for the finalizer to execute before beginning the next test. From the finalizer docs:

The programmer has no control over when the finalizer is called; the garbage collector decides when to call it. The garbage collector checks for objects that are no longer being used by the application. If it considers an object eligible for finalization, it calls the finalizer (if any) and reclaims the memory used to store the object. It's possible to force garbage collection by calling Collect, but most of the time, this call should be avoided because it may create performance issues.

Options for fixing the tests:

  1. Call GC.Collect() in a catch block to ensure the finalizer gets called first.
  2. Call DoDispose() in a catch block to ensure we dispose everything right away, without waiting for the GC.

Since the preconditions are only meant to prevent programming bugs that aren't likely to make it into a production application, it seems reasonable to rely on a finalizer as a fallback for this edge case.

Further Consideration

Being that Lucene may call Close() on a TokenStream and then Reset() to start the process all over again, we should separate the concept of Close() from Dispose(). Whether we actually need a Dispose() method needs more analysis to figure out if TokenStream is actually responsible for cleaning up file handles, and if not, where that responsibility lies. Since TokenStream can effectively "Reopen" after it is called, it seems to make sense to at least have a Close() method to signal that we are not actually disposing anything and can still use the TokenStream instance.

But we definitely need to revert BaseTokenStreamTestCase to be closer to its original form and remove all of the finally blocks that "correctly consume" the TokenStream that are now causing other issues in the tests.

@vvdb-architecture
Copy link

Throwing exceptions in Dispose() is bad form.
See https://docs.microsoft.com/en-us/dotnet/fundamentals/code-analysis/quality-rules/ca1065
Or https://docs.microsoft.com/en-us/dotnet/api/system.idisposable.dispose?view=net-6.0:

If an object's Dispose method is called more than once, the object must ignore all calls after the first one. The object must not throw an exception if its Dispose method is called multiple times. Instance methods other than Dispose can throw an ObjectDisposedException when resources are already disposed.

@NightOwl888
Copy link
Contributor

NightOwl888 commented Nov 7, 2022

@vvdb-architecture

Thanks. So that leaves the question - how do we enforce the TokenStream workflow contract in .NET?

The exception is meant to be a guide to help at development time to ensure all of the requirements of the contract are followed. Once the contract is being followed, this exception will never occur at runtime. So, that means the part that they are worried about - adding an exception handler inside a finally block - is not a thing that will ever need to happen.

One option I have considered is to build a Roslyn code analyzer to enforce this contract at design time, but I am not sure how complex it would be to analyze APIs of the user code no matter how they have it configured and ensure that all of the methods are called in the right sequence and the right number of times in the code. They could be building one or more abstractions that do some of the TokenStream operations while leaving the others somewhere else, or they may be in conditional parts of the code, for example.

Some of the rules, such as enforcing the calls to base.Reset(), and base.End() when they are overridden would be easy to enforce, while others would take more work. The exception in the Dispose() method is definitely a simpler approach, but if we had an Roslyn code analyzer that worked reliably, it would help to speed up the development process by catching this sort of thing at design time instead of runtime.

However, this still doesn't get us out of the situation we are in where close() then a call to reset() to start the TokenStream over from the beginning is allowed in Lucene - Dispose() is meant to be the final call that doesn't allow anything else to happen to the object. This is still a violation of the Dispose() contract in .NET, so we need to dig into whether Close() makes sense here, just to inform consumers that it is not actually a Dispose() operation. Or maybe End() is supposed to signify when the end of the stream is reached (after which case Reset() is allowed) and Dispose() is supposed to signify the end of the operation. In that case the tight loop should be refactored never to call Dispose() inside of the loop and have a final loop at the end that calls Dispose() on all of the TokenStreams to ensure methods never are called after Dispose().

According to this error message, it is supposed to be invalid to call reset() multiple times. If that is really the case, then the tight loop can be considered a bug that needs to be fixed, and we don't need to change Dispose() back to Close(). Although, the bug may boil down to the fact we added an extra call to Dispose() in .NET and it shouldn't actually be there.

@NightOwl888
Copy link
Contributor

NightOwl888 commented Nov 8, 2022

After some research, I have determined the following:

  1. Dispose() doesn't actually throw anything. The failure happens when attempting to set up the next field using the same TokenStream instance.
  2. The reason why we get cascade failures in the tests when enforcing the dispose call is that the current Dispose() implementation both calls TextReader.Dispose() (which is good) and sets the state to an ILLEGAL_STATE_READER (which is bad). Ideally, we only close the open file handles if there is an exception somewhere in the middle that is interfering with our calls and leave the other state alone so the next test can pass.
  3. The design of the close() method being called multiple times per TokenStream instance is intentional. They are designed to be reused according to the ReuseStrategy used by the Analyzer. There is some discussion of this in LUCENE-8651.
  4. The bit that loops through the fields and calls the TokenStream methods Reset(), IncrementToken(), End(), and Dispose() is all in DocInverterPerField.
  5. While Analyzer is IDisposable, it doesn't actually dispose any of its children (the state of the ReuseStrategy or any of its TokenStreamComponents), they are simply left to go out of scope.
  6. If we change Dispose() to Close() it only partially fixes the issue. The design doesn't allow for any long-lived or unmanaged state that is supposed to last the lifetime of the TokenStream instance. This makes it difficult to use anything that is expensive to set up in the implementation of a TokenStream. Something like that would need to be done in the Analyzer and passed into the TokenStream in the constructor in the call to CreateComponents(), which is not a clean separation of concerns.

The proposed solution to fix these issues is:

  • Add a new interface ICloseable with a single method Close(). We clearly need to separate the concept of Close() (that can be reused) from Dispose() (where we are getting rid of the instance for good).
  • Add back the IOUtils.Close... methods using the ICloseable interface. These are used in some places in a finally block to close a TokenStream and reset its state to ILLEGAL_STATE_READER.
  • Change all current Tokenizer, TokenStream, and TokenFilter implementations from using Dispose(bool) to Close().
  • Add a Dispose(bool) method to the base Tokenizer, TokenStream, and TokenFilter implementation and possibly anything that needs to override it. The base implementation of Dispose() will differ from Close() only in that it will not use ILLEGAL_STATE_READER.
  • Track the disposed state of Tokenizer, TokenStream, and TokenFilter and throw ObjectDisposedException in all methods after Dispose() is called. This functionality will need to be disabled for the token stream tests, but we should have tests to verify it works in other cases. (needs more consideration)
  • Make ReuseStrategy implement IDisposable and have Analyzer dispose it in its Dispose(bool) method.
  • Make TokenStreamComponents implement IDisposable and have each ReuseStrategy implementation call each instance of it in its Dispose(bool) method. It can do this by calling GetStoredValue(Analyzer), casting to the appropriate type, and calling Dispose() on all instances of TokenStreamComponents in the returned value.
  • Make TokenStreamComponents.Dispose(bool) cascade the call to its Tokenizer and TokenStream.
  • Change BaseTokenStreamTestCase to only call TokenStream.Dispose() in its finally block (a block that was added for .NET) rather than Reset(), IncrementToken(), End(), and Close(). This allows the test to dispose the open file handles without mutating the other state of the TokenStream that causes test failures.
  • Leave the existing runtime exceptions for misusing the TokenStream workflow contract or other rules in place. This ensures we don't need to make any changes to the tests.
  • Create Roslyn code analyzers to ensure that the basics, such as cascading calls to base.Reset(), base.End(), base.Close() and base.Dispose(bool), are enforced in subclasses. These should be set up to produce compile errors, not warnings. Ideally, they would be smart enough to identify when the user is delegating this responsibility to another class or method. Any gaps that we leave here will be caught by the existing Lucene runtime exceptions.

This neatly puts the lifetime of Tokenizer, TokenStream, and TokenFilter instances in sync with the Analyzer that creates them. So, if the user decides to create an Analyzer that has unmanaged resources to clean up, they simply need to change:

Analyzer analyzer = new MyAnalyzer();

to

using Analyzer analyzer = new MyAnalyzer();

Analyzer won't need to be disposed during normal usage, only for cases where there are expensive long-lived components to dispose.

I am toying with the idea of creating a DoClose() method that is shared between Close() and Dispose(), simply because cascading the call from Close() to Dispose() or from Dispose() to Close() violates the rules in both directions but we generally have common components to clean up in both methods. However, only those who need to implement their own Dispose() for long-lived components will need to deal with this duplication. In all other cases, the code will look almost exactly like it does in Lucene.

@paulirwin
Copy link
Contributor

@NightOwl888 Found that ReusableStringReader should also implement this proposed ICloseable interface instead of IDisposable, especially given that its design to be reused is right there in the name.

@paulirwin paulirwin modified the milestones: 4.8.0, 4.8.0-beta00018 Nov 18, 2024
@paulirwin paulirwin self-assigned this Nov 27, 2024
@paulirwin
Copy link
Contributor

I'll take this one.

Add a new interface ICloseable with a single method Close().

@NightOwl888 Are you good with the Lucene.Net.Support.IO namespace for this interface, to match the io in java.io.Closeable?

@NightOwl888
Copy link
Contributor

Add a new interface ICloseable with a single method Close().

@NightOwl888 Are you good with the Lucene.Net.Support.IO namespace for this interface, to match the io in java.io.Closeable?

There is no public Support namespace, so, no. It should be in the Support folder, but perhaps this belongs in the Lucene.Net.Util namespace.

@paulirwin
Copy link
Contributor

Next question: on the IOUtils restoring the Close... methods, do we need the unused CloseWhileHandlingException overloads that take an Exception priorException first argument? These don't exist upstream, and I'm assuming they exist here because their Disposable counterparts exist, and aren't needed. But let me know if there's a reason why we'd need them for Close.

@paulirwin
Copy link
Contributor

FYI my branch for this issue is https://github.com/paulirwin/lucene.net/tree/issue/271 - note that I might occasionally force-push to rebase this against latest master as other PRs are merged, as I expect this one to take a bit to complete.

I'll also be checking off @NightOwl888's checklist above as I go, but this is just an informal progress tracker.

@NightOwl888
Copy link
Contributor

Next question: on the IOUtils restoring the Close... methods, do we need the unused CloseWhileHandlingException overloads that take an Exception priorException first argument? These don't exist upstream, and I'm assuming they exist here because their Disposable counterparts exist, and aren't needed. But let me know if there's a reason why we'd need them for Close.

I don't recall adding any overloads, just moving over the ones that existed. I see this defined upstream:

https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/IOUtils.java#L82

Is that not what you are referring to?

@paulirwin
Copy link
Contributor

Oh interesting, I was looking at the 4.8.1 code. It looks like those were removed in 4.8.1: https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.1/lucene/core/src/java/org/apache/lucene/util/IOUtils.java

Should we go ahead and remove them and target 4.8.1 compatibility?

@NightOwl888
Copy link
Contributor

If they are unused, I guess we can remove them. So, clearly Lucene isn't following semantic versioning (or, at least wasn't at the time).

@paulirwin
Copy link
Contributor

FYI, here is the 4.8.1 commit that removed the usage of them: apache/lucene@56ab1d1

Also note that they are currently unused because the Disposable versions are used instead; if we target 4.8.1 we can clearly remove them. If we must target 4.8.0 then I haven't yet evaluated if these will be used, or if the code will continue to use the Disposable versions (because their types would not be ICloseable).

@paulirwin
Copy link
Contributor

paulirwin commented Nov 29, 2024

I spent quite a bit of time looking at this over the past few days, and I've come to what I think is a surprising idea, or at least it would have surprised past me when I started looking into this: I think we should remove IDisposable from TokenStream and its descendants completely, in favor of ICloseable. This is, I believe, the simplest solution, as well as the one that most closely matches Lucene. Then we avoid the whole IDisposable problem completely.

As noted above, these token streams are intended to be reused after Close is called. The fact that they are stored in an Analyzer via a "ReuseStrategy" speaks to this. So we know that having ICloseable is the right design choice here. This has the added benefit of more closely matching Lucene. I went through and moved all reusable close logic to Close overrides, and that was pretty easy (although time-consuming to go through ~200 implementations to review them all).

I made Dispose call Close, because Close is supposed to be idempotent, so it should be able to be called multiple times, i.e. if you call .Close() and then .Dispose() that is allowed. I do not think this violates anything (a concern expressed above), because any resources cleaned up by Close would thus need to be cleaned up by Dispose if Close is not called, but the opposite may not be true, so it's not violating any rules in two directions. This works, all tests pass, but there's a concern I had about object lifetimes, and who should be calling Dispose. That started down the rabbit hole that led to this proposal.

The problem with IDisposable TokenStreams is that you're expected to dispose of IDisposables when you're done, and as noted above, TokenStreams are intended to be reused by the Analyzer, per the ReuseStrategy. We cannot make ReuseStrategy disposable, because it is supposed to use the singleton instances Analyzer.PER_FIELD_REUSE_STRATEGY and Analyzer.GLOBAL_REUSE_STRATEGY. There's also no real need, as those classes just encapsulate behavior and not state. The state is stored in the Analyzer, in the storedValue property. Now, we could make Analyzer's DisposableThreadLocal<T> cascade the Dispose call to its state if its state is disposable, and create a disposable wrapper around JCG.Dictionary for the PerFieldReuseStrategy (so that it would thus dispose of all of its dictionary values), but I actually think this might be unnecessary complexity, because then we would need to make sure that Analyzers are disposed everywhere, etc.

I asked myself: why is this not a problem in Lucene? And after digging, and asking ChatGPT, the answer is that Analyzers and TokenStreams are not supposed to hold state that can't be garbage collected or at least released via Close (in the case of TokenStreams). And if you do need one that has to be disposed, then you can track the lifetime of that outside of the analyzer, and dispose of it yourself, because users are usually instantiating Analyzers directly anyways. There's nothing stopping someone from adding IDisposable to their custom Tokenizer or even Analyzer... but we don't have to make Lucene.NET unnecessarily complex and divergent from Lucene in order to accommodate this.

(Aside: there was a thought above that in Java, AutoCloseable means that Close is sometimes called implicitly, which implies automatically, and this is not exactly true. It is called implicitly when the AutoCloseable is used in a try-with-resources statement (akin to a using statement in C#), but it is not called automatically as part of resource cleanup.)

You also might be asking: what about the IDisposable TextReader in Tokenizer? Well, that is already going to be disposed in Close as a result of this change, but also it is passed into the Tokenizer via SetReader, so really it's the responsibility of the caller to dispose of it if Close is not called. The only case where this realistically could be a problem (since most cases use StringReader which is safely GC-able without being disposed) is if you use a TextReader Field value to some non-GC-able implementation of TextReader, in which case you're probably already safely disposing of that in a using block/statement anyways.

There is very little good reason within the Lucene.NET code to have TokenStream be IDisposable, as there are only two cases that I found in the entire codebase, after reviewing all TokenStream subclasses, where the cleanup logic is not reusable (and thus would not already have its Dispose logic moved into Close): PrefixTreeStrategy.CellTokenStream and MemoryIndex.TokenStreamAnonymousClass<T>.

PrefixTreeStrategy.CellTokenStream has an IEnumerator<Cell> field that is passed into its constructor, which is why it is not reusable (the enumerator is not reassigned/reinitialized inside this class). But it is just enumerating over an in-memory list of cells, so there's no concern about leaking this (it would just be garbage collected). Slightly sloppy code, but nothing to lose sleep over. We could easily dispose of the enumerator in a Close method, even though that's not reusable, as it's only used by the encapsulating Field, and is basically a private implementation detail. We could make a LUCENENET-specific change to have this type take an IEnumerable instead in its ctor, dispose of the enumerator in Close, and re-initialize it if it's attempted to be reused. Based on the comment at the call site, this seems like it might have been what was intended. Or, we could take the latest code which avoids this problem entirely (and does not need to be disposed). I think I'd prefer the latter, if possible.

MemoryIndex.TokenStreamAnonymousClass<T> likewise holds an IEnumerator<T> passed into its ctor, and likewise is created from an ICollection that likely can be safely GC'ed if not disposed (although there are no usages to its caller KeywordTokenStream<T>(ICollection<T> keywords), even in tests, so it's hard to tell). Upstream, latest Lucene has a comment indicating that this method possibly should be deprecated. But regardless, it's apparently safe if this is not reused.

So out of the O(200) derived types from TokenStream, only two have non-reusable state that needs to be cleaned up, and even then, both are private implementation details, and it appears that the only valid intended use of these is with in-memory data where it's safe to not clean up the enumerator in Dispose, and one of them even has updated code that avoids the problem. So we'd be going through the hassle of making TokenStream and its descendants IDisposable for no discernible benefit, when users that need their custom implementations to be disposable can handle that outside of our library code with a custom or anonymous Analyzer.

So if we remove IDisposable from TokenStream in favor of ICloseable, it simplifies this dramatically, and avoids a lot of the concerns expressed in the comments above. We can still have .Close() called in try/finally blocks where it makes sense to do so, to make sure that's called if an exception occurs, but this way we don't have to worry about someone accidentally disposing of a TokenStream via a using statement or otherwise when it's possible it might be reused. And we get the added benefit of ensuring that these types are reusable (whereas they are not always today).

I'm going to prototype this approach to determine if it's viable, but I have a hard time believing it wouldn't be at this point.

@eladmarg
Copy link
Contributor

@paulirwin thanks for the detailed information,
After some thoughts I tend to agree.

At the beginning of the port, it was clearly that the closable Java matches with the net disposable

However, looks like it's not a real dispose, especially when this can be reused after dispose

How complex this might be?
Seems like this is a big change that will break things already stabilized

@NightOwl888
Copy link
Contributor

NightOwl888 commented Nov 30, 2024

Thanks so much for the detailed analysis, Paul.

So, what I gather, you are saying a disposable class would be injected like this?

using var someDisposable = new SomeDisposable();

Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
    var src = new StandardTokenizer(m_matchVersion, reader);
    TokenStream tok = new StandardFilter(m_matchVersion, src);
    tok = new LowerCaseFilter(m_matchVersion, tok);
    tok = new StopFilter(m_matchVersion, tok, m_stopwords);
    tok = new MyFilter(someDisposable, tok);
    return new TokenStreamComponents(src, tok);
});

Alternatively, the user would need to implement a whole Analyzer subclass, but they could pass the disposable instance into the constructor to accomplish the same thing. I have to admit, doing it that way hadn't crossed my mind, but I am not sure it covers everything we need to fix. It may address all usability issues, though.

The main motivation for this issue is to fix the BaseTokenStreamTestCase class. We have added hacks to prevent cascade failures from occurring with the original design. But this has bled over into other areas, such as the two TestRandomChains test failures. IIRC, both of these things require the state of ILLEGAL_STATE_READER to be carefully managed.

There were 3 main issues we ran into with BaseTokenStreamTestCase:

  1. If an assert happened, it could leave the TokenStream in an inconsistent state because it would fail to fulfill the TokenStream contract. This would in turn cause all of the remaining tests in the same test class to fail because the TokenStream is in an invalid state each time it is reused.
  2. If a TokenStream class owns a file handle and we skip the call Dispose() on it, other tests that use the same file would get a "File in use by another process" exception. This is probably not an issue in our tests since we embedded the data files as resources into the assemblies (which means they open using a managed stream instance), but I see no reason to impose a limitation of strictly using embedded resources for our users.
  3. If a TokenStream class owns a native resource handle (as was once the case when we were using icu-dotnet), there is also no way to clean it up using Close() since it needs to happen after all of the reuse is done.

It was the 3rd problem that made me open this issue, as using icu-dotnet with Lucene.NET was completely impossible without having a Dispose() method that was only called once at the end of the process. The above solution would have been very ugly to hand our users - imagine having to instruct users to Init() and Cleanup() in every piece of code that uses a particular analyzer implementation (at one point, those were required). And what if a future implementation we own requires us to call something disposable?

My thought was to separate the concepts of Close() from Dispsose() so the test framework could clean up a reusable TokenStream instance without modifying the ILLEGAL_STATE_READER so it doesn't cause all of the other tests in the class to fail along with it. But, IIRC there has never been an attempt to make a failure happen in Lucene to see if it had the same cascade failures or if there is some other hook that is supposed to clean up this state. Having those cascade failures made finding the test that was actually failing very hard. If Lucene does not have cascade failures, I am curious as to why they don't.

Of course, we have changed a lot of things since those hacks were put in place, such as fixing the error handling to work the same way it did in Lucene and switching from external resource files to embedded resources.

A good proof of concept would be to ensure the TestRandomChains test failures can be patched with the proposed fix. I would be interested to know

  1. Can we make those pass without having disposable analysis components?
  2. Can we make the tests that use a single analyzer instance for a whole test fixture class fail independently without getting cascade failures? Even on TextReader implementations that own file handles?

It is that last condition that made me think there was no way out except for having disposable token stream components, since the original configuration would always leave the file handle open upon test failure.

@paulirwin
Copy link
Contributor

So, what I gather, you are saying a disposable class would be injected like this?

That's the idea, yeah. Although your example of the icu-dotnet case is a good thought experiment. After some thinking, I think we can made TokenStream IDisposable as well as ICloseable, although the virtual Dispose(boolean) method would only be an extension point for anyone that needs an example like that. In almost all other cases, Close should be used instead. And as I noted above, we would be able to get away with not having any of our TokenStream subclasses override Dispose.

To do this, I think we would need to make TokenStreamComponents implement IDisposable and dispose of its two fields, although we would have to note that because there is no guarantee that the sink ends up disposing of its source, that we would have to allow for Dispose to be idempotent and able to be called multiple times, because if we just i.e. not dispose of the TokenStreamComponents source field (assuming the sink would dispose it), then there's a chance that it might not, and then something could leak. We could add a trivial check to not redundantly call Dispose on the sink if it ReferenceEquals the source, but other than that I think we have to allow for Dispose to be called more than once.

We then need to change DisposableThreadLocal to dispose of all of its Values that are IDisposable. I'll prototype this and see how it goes.

Note that I do not think we should do the suggestion earlier in this thread of throwing if any of the other TokenStream methods are called on it after it has been disposed. The Closeable implementation does not do this in the upstream code, and we'd have to override many methods from AttributeSource, and make IncrementToken virtual instead of abstract, which would significantly deviate from upstream, and cause a performance penalty, just to prevent someone from doing this in their custom types (as again, we would not be overriding Dispose in our code). I think we should make that the responsibility of the implementer if they care about that, and just let it be.

Can we make those [TestRandomChains] tests pass without having disposable analysis components?

They do pass, even with high iteration counts. I've removed the AwaitsFix attribute from them.

Can we make the tests that use a single analyzer instance for a whole test fixture class fail independently without getting cascade failures?

This also seems to work fine. I think it was due to wrapping the IDisposables in using statements previously like you said, and now I've replaced those with try/finally and calling .Close() in the finally block. This is effectively the same as a using block, which the compiler lowers to try/finally and calling .Dispose() in the finally block. So the behavior is basically the same, just the semantics have changed: Close means reusable, Dispose means you're done with the object. We already effectively were treating disposed objects as reusable, this just creates the distinction in meaning.

Even on TextReader implementations that own file handles?

Do we have any tests that you know of that do this that I can confirm? But again, if they were passing previously, they're still passing now, as there's basically little difference in runtime behavior than before.

@paulirwin
Copy link
Contributor

Also, I can still reproduce this test failure: #271 (comment)

I'm going to attempt removing the try/finally from those BaseTokenStreamTestCase methods to match upstream and see what happens.

@paulirwin paulirwin linked a pull request Dec 5, 2024 that will close this issue
4 tasks
@paulirwin
Copy link
Contributor

I don't have a solution to that failing test, but I've created a draft PR with my work on this: #1058. It worked out pretty well I think to have TokenStreamComponents be IDisposable, and to create a disposable wrapper around JCG.Dictionary for PerFieldReuseStrategy.

@NightOwl888
Copy link
Contributor

I don't have a solution to that failing test

As pointed out above, the backported test failed in Java as well.

And after translating it to Java, it still fails

However, IIRC there were still some problems with managing the state of the TokenStream where the execution paths diverge between .NET and Java. when stepping through. I think it had something to do with whether WordDelimiterFlags.CATENATE_NUMBERS is included in the test, which was controlled by another condition (this may have been fixed in a later version of Lucene).

I discovered this is primarily due to the fact that some flags are not supported during indexing: https://stackoverflow.com/a/66478863/, and confirmed that removing the WordDelimiterFlags.CATENATE_NUMBERS flag makes the test pass.

In TestRandomChains, WordDelimiterFilter is added to the brokenOffsetsConstructors dictionary. This causes offsetsAreCorrect to be set to true, which BaseTokenStreamTestCase.CheckAnalysisConsistency is supposed to handle. BaseTokenStreamTestCase has been changed from its original implementation to include try-finally and using blocks (both in CheckAnalysisConsistency() and in AssertTokenStreamContents() and that is where we are running into trouble.

Long story short, this circles back to the fact that we converted the Closable interface in Java to IDisposable in .NET on TokenStream. But TokenStream's close() method has preconditions that must be met or it will throw an exception. The exception doesn't play well when you are trying to close open TextReaders successfully. Before these finally blocks were put into place to call Dispose() one failing analysis test would cause nearly all of them to fail.

There is some context that I failed to document between the last two paragraphs, but it shouldn't be too hard to figure out why I thought this was failing due to the addition of the try-finally blocks.

@paulirwin
Copy link
Contributor

I spent more time looking at this last night and updated the BaseTokenStreamTestCase to be much closer to the original, mostly by removing some code that didn't exist upstream. The only thing that really differs from upstream is that .Close() is still called in the try/finally blocks, and I added Dispose to those as well, even though none of our implementations now use Dispose as of this PR. (#1058)

TestRandomChains still randomly fails (fairly reliably if you put a Repeat(100) on it), but it doesn't cause cascading failures, and I'm not convinced it's failing because of this issue. I'm not entirely sure where to go next from here, so I could use some help on this one with not only reviewing what I've got, but trying it out to figure out if more work needs to be done here or not.

If TestRandomChains randomly failing is related to this issue, I'd appreciate some enlightenment as to why, as that isn't clear to me. If the core problem with this issue was that failing analysis tests (TestRandomChains or otherwise) caused other cascading failures, that does not seem to be the case anymore.

I am 99% sure I've solved the issue of TokenStreams being reused after Close is called. Next up I'll make a test that tests to make sure Dispose is not called before reuse (as that should not be done in a reusable way), and make those two types mentioned above reusable (even though they aren't causing any failing tests).

paulirwin added a commit to paulirwin/lucene.net that referenced this issue Dec 19, 2024
paulirwin added a commit to paulirwin/lucene.net that referenced this issue Dec 21, 2024
@NightOwl888 NightOwl888 self-assigned this Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants