Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows bwcTestSnapshots unable to upgrade a node due to assertion failing #85725

Closed
martijnvg opened this issue Apr 6, 2022 · 8 comments · Fixed by #86192
Closed

Windows bwcTestSnapshots unable to upgrade a node due to assertion failing #85725

martijnvg opened this issue Apr 6, 2022 · 8 comments · Fixed by #86192
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@martijnvg
Copy link
Member

CI Link

https://gradle-enterprise.elastic.co/s/vn4gmfzui5s3q

Repro line

NONE

Does it reproduce?

Didn't try

Applicable branches

master, 8.2

Failure history

No response

Failure excerpt

java.lang.AssertionError: expected file size 16777216 but was 0 |  
-- | --
  | »  	at org.elasticsearch.xpack.searchablesnapshots.cache.shared.SharedBytes.<init>(SharedBytes.java:64) |  
  | »  	at org.elasticsearch.xpack.searchablesnapshots.cache.shared.FrozenCacheService.<init>(FrozenCacheService.java:295) |  
  | »  	at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshots.createComponents(SearchableSnapshots.java:339) |  
  | »  	at org.elasticsearch.node.Node.lambda$new$16(Node.java:685) |  
  | »  	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273) |  
  | »  	at java.base/java.util.AbstractList$RandomAccessSpliterator.forEachRemaining(AbstractList.java:720) |  
  | »  	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) |  
  | »  	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) |  
  | »  	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:575) |  
  | »  	at java.base/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260) |  
  | »  	at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:616) |  
  | »  	at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:622) |  
  | »  	at java.base/java.util.stream.ReferencePipeline.toList(ReferencePipeline.java:627) |  
  | »  	at org.elasticsearch.node.Node.<init>(Node.java:699) |  
  | »  	at org.elasticsearch.node.Node.<init>(Node.java:291) |  
  | »  	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:234) |  
  | »  	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:234) |  
  | »  	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:358) |  
  | »  	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:166) |  
  | »  	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:157) |  
  | »  	at org.elasticsearch.common.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:81) |  
  | »  	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112) |  
  | »  	at org.elasticsearch.cli.Command.main(Command.java:77) |  
  | »  	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:122) |  
  | »  	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:80)


@martijnvg martijnvg added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Apr 6, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 6, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@mark-vieira
Copy link
Contributor

Looks like these bwc tests don't work on Windows?

Not sure what you mean. They did work. Is it possible we've introduced a regression here that only manifests on Windows?

@tlrx
Copy link
Member

tlrx commented Apr 12, 2022

Most recent change in this area is #85638 but I'm not sure it is caused by this change. Would you mind taking a look @fcofdez?

@fcofdez fcofdez self-assigned this Apr 12, 2022
@mark-vieira
Copy link
Contributor

mark-vieira commented Apr 19, 2022

Any updates here? This has been failing for about two weeks now.

@fcofdez
Copy link
Contributor

fcofdez commented Apr 25, 2022

I've been checking this and I think that frozen file preallocation is not working correctly in Windows.
We rely on RandomAccessFile#setLength in Windows, that ends up calling https://github.com/openjdk/jdk/blob/master/src/java.base/windows/native/libjava/io_util_md.c#L458-L473:

jint
handleSetLength(FD fd, jlong length) {
    HANDLE h = (HANDLE)fd;
    FILE_END_OF_FILE_INFO eofInfo;

    eofInfo.EndOfFile.QuadPart = length;

    if (h == INVALID_HANDLE_VALUE) {
        return -1;
    }
    if (!SetFileInformationByHandle(h, FileEndOfFileInfo, &eofInfo,
            sizeof(FILE_END_OF_FILE_INFO))) {
        return -1;
    }
    return 0;
}

It calls SetFileInformationByHandle using FILE_END_OF_FILE_INFO which I think it doesn't preallocate the file, as it should call it using FILE_ALLOCATION_INFO (see this article with a really similar example). Sadly the JDK doesn't expose that API and I guess we should rely again on JNA to perform this operation as we do in other platforms. cc @tlrx

@tlrx
Copy link
Member

tlrx commented Apr 25, 2022

If I understand correctly before #85638 bwc upgrade tests passed successfully but without using a shared cached file at all (as its computed size was 0 the shared cache file was not even created). After #85638 was merged the tests started to fail on Windows with an assertion error expected file size 16777216 but was 0.

The assertion was introduced by #79447 which speeds up the shared cache file creation. I remember the setLength method worked on the Windows instances we used (ie, the file was created faster than filling TB with zeroes). If I remember correctly the setLength method on Windows created a sparse file with the expected size, but I don't remember what was the size (as returned by fileChannel.size()) of this file after the node is stopped.

Anyway, it seems that creating the file using setLength works so maybe it is just a matter of adapting the assertion for Windows? I'd prefer to avoid the cost of maintaining another native method for Windows.

@fcofdez
Copy link
Contributor

fcofdez commented Apr 25, 2022

Your summary is correct.

Anyway, it seems that creating the file using setLength works so maybe it is just a matter of adapting the assertion for Windows? I'd prefer to avoid the cost of maintaining another native method for Windows.

I haven't checked with a Windows machine, but I agree that we can adjust the assertion for Windows. I'll open a PR shortly.

fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Apr 26, 2022
…indows

In Windows, if there's a pre-existing shared cache file with the
expected size we were deleting the existing cache file as the flag
controlling the shared cache file creation wasn't populated in that
case.

Closes elastic#85725
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants