Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.fdt file still exist when disable _source #69584

Closed
lexlee327 opened this issue Feb 25, 2021 · 9 comments
Closed

.fdt file still exist when disable _source #69584

lexlee327 opened this issue Feb 25, 2021 · 9 comments
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. feedback_needed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@lexlee327
Copy link

lexlee327 commented Feb 25, 2021

Elasticsearch version (bin/elasticsearch --version):
7.11

Plugins installed: []
no

JVM version (java -version):
jdk 12

OS version (uname -a if on a Unix-like system):
mac

Description of the problem including expected versus actual behavior:
When I disable the _source, It is expected lower storage usage, but it didn't.
After set _source: { enabled: false} , Elasticsearch doesn't return _source in get/search request, but it still keep the .fdt file in segment
Elasticsearch are wildly use in OLAP case, when we just need the analysis report without show the documents, we could get rid of the stored field

Steps to reproduce:
1, add sample data:
image

2, reindex it with _source disable
image
3, new index return without _source
image

4, new index with the same size
image

  1. have the .fdt file
    image
@lexlee327 lexlee327 added >bug needs:triage Requires assignment of a team area label labels Feb 25, 2021
@markharwood
Copy link
Contributor

When I disable the _source, It is expected lower storage usage, but it didn't.

Thanks for the report. One thing that looks to be missing is the file size comparison between the source enabled and source disabled indices. Could you provide that? Otherwise all you may be showing is that there is some residual use of stored values (possibly not related to source).

@DaveCTurner
Copy link
Contributor

I think we need to know exactly what version this is about too, the OP says 7.x but this is not at all helpful.

In some versions we temporarily retain the source for replica recovery (and CCR) even if _source is disabled. It gets removed by a later merge once it's no longer needed.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 25, 2021

Relates #41628

@pgomulka pgomulka added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Mar 1, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner added feedback_needed and removed needs:triage Requires assignment of a team area label labels Mar 1, 2021
@lexlee327
Copy link
Author

lexlee327 commented Mar 2, 2021

When I disable the _source, It is expected lower storage usage, but it didn't.

Thanks for the report. One thing that looks to be missing is the file size comparison between the source enabled and source disabled indices. Could you provide that? Otherwise all you may be showing is that there is some residual use of stored values (possibly not related to source).

image
I reindex the sample data with the name source_enable and source_disable, you could see that, the source_disable one even bigger.
I think the key issue is that, it still generate the .fdt file in segment

@lexlee327
Copy link
Author

I think we need to know exactly what version this is about too, the OP says 7.x but this is not at all helpful.

In some versions we temporarily retain the source for replica recovery (and CCR) even if _source is disabled. It gets removed by a later merge once it's no longer needed.

I tested it in 7.9.0, 7.10.0, 7.11.0

@lexlee327
Copy link
Author

just like @DaveCTurner said, I have to disable soft_delete

    "settings": {
      "number_of_replicas" : "0",
      "soft_deletes": {
        "enabled": false
      }
    }

then, the .fdt file is the proper size:

image
image

@DaveCTurner
Copy link
Contributor

just like @DaveCTurner said, I have to disable soft_delete

I said nothing about disabling soft deletes, and I certainly don't recommend that either.

What I did say was that the source is only kept while it's needed for recovery or CCR, and it's removed by a later merge. It's therefore the expected behaviour for there to be stored fields in some segments: there's only a bug here if we are retaining stored fields in merges even though there's no reason to retain it, and so far we haven't seen any evidence of that.

@DaveCTurner
Copy link
Contributor

No further response after a few weeks so I'm closing this. If you can find and share evidence of a bug (see my previous message) then we can reopen this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. feedback_needed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

6 participants