Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" #41628

Closed
davemoore- opened this issue Apr 28, 2019 · 7 comments
Assignees

Comments

@davemoore-
Copy link
Contributor

davemoore- commented Apr 28, 2019

Elasticsearch version: 7.0.0

Plugins installed: []

JVM version: OpenJDK 1.8.0_191

OS version: Ubuntu 16.04 (or Elastic Cloud)

Description of the problem including expected versus actual behavior:

When setting _source.enabled: false in the index mapping, the _source should not be stored.

In 7.0.0, when two indices have identical data and mappings (except for one having _source.enabled: false), the indices will be almost exactly the same size. This isn't the expected behavior.

In 6.7.1, when two indices with identical data and mappings (except for one having source.enabled: false), the index with _source.enabled: false is roughly half the size of the one with _source enabled. This is the expected behavior.

Steps to reproduce:

Overview:

  1. Create two Elasticsearch clusters: version 6.7.1 and version 7.0.0.

  2. Create two index templates with identical mappings, but let the second template use _source.enabled: false. Put these two index templates in both clusters.

  3. Load data into the two indices on both clusters.

  4. Force merge the indices to a single segment.

  5. Compare the "Storage Size" of the two indices in Kibana for each cluster: /app/kibana#/management/elasticsearch/index_management/indices

More detailed:

Create the following templates and pipelines in the 7.0.0 cluster:

PUT _template/logs
{
  "index_patterns": ["logs"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "agent": {
        "type": "text"
      },
      "auth": {
        "type": "keyword"
      },
      "bytes": {
        "type": "long"
      },
      "clientip": {
        "type": "ip"
      },
      "httpversion": {
        "type": "double"
      },
      "ident": {
        "type": "keyword"
      },
      "message": {
        "type": "text"
      },
      "referrer": {
        "type": "keyword"
      },
      "request": {
        "type": "keyword"
      },
      "response": {
        "type": "long"
      },
      "verb": {
        "type": "keyword"
      }
    }
  }
}
PUT _template/logs-nosource
{
  "index_patterns": ["logs-nosource"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "agent": {
        "type": "text"
      },
      "auth": {
        "type": "keyword"
      },
      "bytes": {
        "type": "long"
      },
      "clientip": {
        "type": "ip"
      },
      "httpversion": {
        "type": "double"
      },
      "ident": {
        "type": "keyword"
      },
      "message": {
        "type": "text"
      },
      "referrer": {
        "type": "keyword"
      },
      "request": {
        "type": "keyword"
      },
      "response": {
        "type": "long"
      },
      "verb": {
        "type": "keyword"
      }
    }
  }
}
PUT _ingest/pipeline/logs
{
  "description": "Ingest pipeline for logs",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{COMBINEDAPACHELOG}"
        ]
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": [
          "dd/MMM/yyyy:HH:mm:ss XX"
        ]
      }
    },
    {
      "remove": {
        "field": "timestamp"
      }
    }
  ]
}

Create the following indices and templates in the 6.7.1 cluster:

PUT _template/logs
{
  "index_patterns": ["logs"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent": {
          "type": "text"
        },
        "auth": {
          "type": "keyword"
        },
        "bytes": {
          "type": "long"
        },
        "clientip": {
          "type": "ip"
        },
        "httpversion": {
          "type": "double"
        },
        "ident": {
          "type": "keyword"
        },
        "message": {
          "type": "text"
        },
        "referrer": {
          "type": "keyword"
        },
        "request": {
          "type": "keyword"
        },
        "response": {
          "type": "long"
        },
        "verb": {
          "type": "keyword"
        }
      }
    }
  }
}
PUT _template/logs-nosource
{
  "index_patterns": ["logs-nosource"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent": {
          "type": "text"
        },
        "auth": {
          "type": "keyword"
        },
        "bytes": {
          "type": "long"
        },
        "clientip": {
          "type": "ip"
        },
        "httpversion": {
          "type": "double"
        },
        "ident": {
          "type": "keyword"
        },
        "message": {
          "type": "text"
        },
        "referrer": {
          "type": "keyword"
        },
        "request": {
          "type": "keyword"
        },
        "response": {
          "type": "long"
        },
        "verb": {
          "type": "keyword"
        }
      }
    }
  }
}
PUT _ingest/pipeline/logs
{
  "description": "Ingest pipeline for logs",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{COMBINEDAPACHELOG}"
        ]
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": [
          "dd/MMM/yyyy:HH:mm:ss ZZ"
        ]
      }
    },
    {
      "remove": {
        "field": "timestamp"
      }
    }
  ]
}

Download and unzip the data from https://storage.googleapis.com/elasticsearch-sizing-workshop/data/nginx.zip and then load the nginx.log file into the "logs" and "logs-nosource" indices on both clusters.

Force merge the indices to a single segment.

Compare the size of the indices in Kibana. Elasticsearch 7.0.0 shows both indices as being roughly the same size, whereas Elasticsearch 6.7.1 shows the "logs-nosource" index being roughly half the size of the "logs" index.

@davemoore- davemoore- changed the title "_source" Indices with "_source.enabled: false" same size as "_source.enabled: true" Apr 28, 2019
@davemoore- davemoore- changed the title Indices with "_source.enabled: false" same size as "_source.enabled: true" Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" Apr 28, 2019
@davemoore- davemoore- added v7.0.0 >bug :Search Foundations/Mapping Index mappings, including merging and defining field types labels Apr 28, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jtibshirani
Copy link
Contributor

I think the increase in size is due to the fact that we now add a _recovery_source field to the document if _source is disabled, but soft deletes are enabled (#31106). From my understanding, the _recovery_source fields will eventually get removed during merges once elasticsearch determines the documents aren't needed for replay/ recovery.

In 6.7, index.soft_deletes.enabled defaults to false, whereas in 7.0 it defaults to true, so this could explain the difference in index size you're observing between the two versions. When I try setting index.soft_deletes.enabled: false on a 7.0 index, disabling _source in fact decreases the size of the index by around one half.

For this reason, I'm not sure the behavior indicates a bug. I'll tag @elastic/es-distributed to see if they think a follow-up is in order or would like to add any information.

@dnhatn dnhatn self-assigned this May 1, 2019
@dnhatn dnhatn added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label May 1, 2019
@jtibshirani jtibshirani added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Search Foundations/Mapping Index mappings, including merging and defining field types labels May 1, 2019
@dnhatn
Copy link
Member

dnhatn commented May 1, 2019

Thanks @davemoore- for reporting this. @jtibshirani your explanation is correct.

This test creates a single segment (the dataset is quite small), and TierMergePolicy thinks that segment is merged already. Hence, RecoverySourcePruneMergePolicy is never triggered to prune away _recovery_source.

In this scenario, this behaviour probably is ok as the store size is pretty small (25MB). However, this behaviour can be problematic with a larger dataset for it won't prune away _recovery_source when the retention lease advances if segments are merged already.

I have a test that demonstrates this behavior.

    public void testPruneRecoverySource() throws Exception {
        Settings.Builder settings = Settings.builder()
            .put(defaultSettings.getSettings())
            .put(IndexSettings.INDEX_SOFT_DELETES_SETTING.getKey(), true)
            .put(IndexSettings.INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING.getKey(), 0);
        final IndexMetaData indexMetaData = IndexMetaData.builder(defaultSettings.getIndexMetaData()).settings(settings).build();
        final IndexSettings indexSettings = IndexSettingsModule.newIndexSettings(indexMetaData);
        final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.NO_OPS_PERFORMED);
        final MapperService mapperService = createMapperService("test");
        final MergePolicy mp = new TieredMergePolicy(); // works with LogDocMergePolicy
        try (Store store = createStore();
             InternalEngine engine = createEngine(config(indexSettings, store, createTempDir(), mp, null, null, globalCheckpoint::get))) {
            int numDocs = 10;
            for (int i = 0; i < numDocs; i++) {
                ParsedDocument doc = testParsedDocument(Integer.toString(i), null, testDocument(), new BytesArray("{}"), null, true);
                engine.index(indexForDoc(doc));
            }
            globalCheckpoint.set(engine.getLocalCheckpoint());
            engine.syncTranslog();
            engine.flush(true, true);
            engine.forceMerge(true, 1, false, false, false);
            try (Translog.Snapshot snapshot = engine.newChangesSnapshot("test", mapperService, 0, Long.MAX_VALUE, true)) {
                IllegalStateException sourceNotFound = expectThrows(IllegalStateException.class, snapshot::next);
                assertThat(sourceNotFound.getMessage(), startsWith("source not found"));
            }
        }
    }

@s1monw @jpountz What do you think?

@s1monw
Copy link
Contributor

s1monw commented May 2, 2019

I think we can look into triggering a merge we would drop all sources. Yet, this is really only something that is relevant or should be done in a force_merge context? @dnhatn WDYT? I mean this might look confusing but it's what it is. We can't magically make it go away. I think if you run force merge too quick you would still have the same issue if we need to retain the sources.

@jpountz
Copy link
Contributor

jpountz commented May 7, 2019

I think if you run force merge too quick you would still have the same issue if we need to retain the sources.

Maybe we should fail _forcemerge calls if not all recovery sources can be reclaimed. I can't think of any case when having recovery sources in force-merged segments would be the desired behavior.

@dnhatn dnhatn removed :Search Foundations/Mapping Index mappings, including merging and defining field types v7.0.0 labels May 23, 2019
dnhatn added a commit that referenced this issue Nov 1, 2019
This test failure manifests the limitation of the recovery source merge 
policy explained in #41628. If we already merge down to a single segment
then subsequent force merges will be noop although they can prune
recovery source. We need to adjust this test until we have a fix for the
merge policy.

Relates #41628
Closes #48735
dnhatn added a commit that referenced this issue Nov 3, 2019
This test failure manifests the limitation of the recovery source merge
policy explained in #41628. If we already merge down to a single segment
then subsequent force merges will be noop although they can prune
recovery source. We need to adjust this test until we have a fix for the
merge policy.

Relates #41628
Closes #48735
dnhatn added a commit that referenced this issue Nov 3, 2019
This test failure manifests the limitation of the recovery source merge
policy explained in #41628. If we already merge down to a single segment
then subsequent force merges will be noop although they can prune
recovery source. We need to adjust this test until we have a fix for the
merge policy.

Relates #41628
Closes #48735
dnhatn added a commit that referenced this issue Nov 9, 2019
This test failure manifests the limitation of the recovery source merge
policy explained in #41628. If we already merge down to a single segment
then subsequent force merges will be noop although they can prune
recovery source. We need to adjust this test until we have a fix for the
merge policy.

Relates #41628
Closes #48735
dnhatn added a commit that referenced this issue Nov 9, 2019
This test failure manifests the limitation of the recovery source merge
policy explained in #41628. If we already merge down to a single segment
then subsequent force merges will be noop although they can prune
recovery source. We need to adjust this test until we have a fix for the
merge policy.

Relates #41628
Closes #48735
@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021
@dnhatn dnhatn added :StorageEngine/Logs You know, for Logs and removed :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Apr 26, 2024
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Apr 26, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@dnhatn
Copy link
Member

dnhatn commented Nov 20, 2024

Will be fixed in #114618

@dnhatn dnhatn closed this as completed Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants