Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" #41628

davemoore- · 2019-04-28T20:59:35Z

Elasticsearch version: 7.0.0

Plugins installed: []

JVM version: OpenJDK 1.8.0_191

OS version: Ubuntu 16.04 (or Elastic Cloud)

Description of the problem including expected versus actual behavior:

When setting _source.enabled: false in the index mapping, the _source should not be stored.

In 7.0.0, when two indices have identical data and mappings (except for one having _source.enabled: false), the indices will be almost exactly the same size. This isn't the expected behavior.

In 6.7.1, when two indices with identical data and mappings (except for one having source.enabled: false), the index with _source.enabled: false is roughly half the size of the one with _source enabled. This is the expected behavior.

Steps to reproduce:

Overview:

Create two Elasticsearch clusters: version 6.7.1 and version 7.0.0.
Create two index templates with identical mappings, but let the second template use _source.enabled: false. Put these two index templates in both clusters.
Load data into the two indices on both clusters.
Force merge the indices to a single segment.
Compare the "Storage Size" of the two indices in Kibana for each cluster: /app/kibana#/management/elasticsearch/index_management/indices

More detailed:

Create the following templates and pipelines in the 7.0.0 cluster:

PUT _template/logs
{
  "index_patterns": ["logs"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "agent": {
        "type": "text"
      },
      "auth": {
        "type": "keyword"
      },
      "bytes": {
        "type": "long"
      },
      "clientip": {
        "type": "ip"
      },
      "httpversion": {
        "type": "double"
      },
      "ident": {
        "type": "keyword"
      },
      "message": {
        "type": "text"
      },
      "referrer": {
        "type": "keyword"
      },
      "request": {
        "type": "keyword"
      },
      "response": {
        "type": "long"
      },
      "verb": {
        "type": "keyword"
      }
    }
  }
}

PUT _template/logs-nosource
{
  "index_patterns": ["logs-nosource"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "agent": {
        "type": "text"
      },
      "auth": {
        "type": "keyword"
      },
      "bytes": {
        "type": "long"
      },
      "clientip": {
        "type": "ip"
      },
      "httpversion": {
        "type": "double"
      },
      "ident": {
        "type": "keyword"
      },
      "message": {
        "type": "text"
      },
      "referrer": {
        "type": "keyword"
      },
      "request": {
        "type": "keyword"
      },
      "response": {
        "type": "long"
      },
      "verb": {
        "type": "keyword"
      }
    }
  }
}

PUT _ingest/pipeline/logs
{
  "description": "Ingest pipeline for logs",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{COMBINEDAPACHELOG}"
        ]
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": [
          "dd/MMM/yyyy:HH:mm:ss XX"
        ]
      }
    },
    {
      "remove": {
        "field": "timestamp"
      }
    }
  ]
}

Create the following indices and templates in the 6.7.1 cluster:

PUT _template/logs
{
  "index_patterns": ["logs"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent": {
          "type": "text"
        },
        "auth": {
          "type": "keyword"
        },
        "bytes": {
          "type": "long"
        },
        "clientip": {
          "type": "ip"
        },
        "httpversion": {
          "type": "double"
        },
        "ident": {
          "type": "keyword"
        },
        "message": {
          "type": "text"
        },
        "referrer": {
          "type": "keyword"
        },
        "request": {
          "type": "keyword"
        },
        "response": {
          "type": "long"
        },
        "verb": {
          "type": "keyword"
        }
      }
    }
  }
}

PUT _template/logs-nosource
{
  "index_patterns": ["logs-nosource"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "agent": {
          "type": "text"
        },
        "auth": {
          "type": "keyword"
        },
        "bytes": {
          "type": "long"
        },
        "clientip": {
          "type": "ip"
        },
        "httpversion": {
          "type": "double"
        },
        "ident": {
          "type": "keyword"
        },
        "message": {
          "type": "text"
        },
        "referrer": {
          "type": "keyword"
        },
        "request": {
          "type": "keyword"
        },
        "response": {
          "type": "long"
        },
        "verb": {
          "type": "keyword"
        }
      }
    }
  }
}

PUT _ingest/pipeline/logs
{
  "description": "Ingest pipeline for logs",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{COMBINEDAPACHELOG}"
        ]
      }
    },
    {
      "date": {
        "field": "timestamp",
        "formats": [
          "dd/MMM/yyyy:HH:mm:ss ZZ"
        ]
      }
    },
    {
      "remove": {
        "field": "timestamp"
      }
    }
  ]
}

Download and unzip the data from https://storage.googleapis.com/elasticsearch-sizing-workshop/data/nginx.zip and then load the nginx.log file into the "logs" and "logs-nosource" indices on both clusters.

Force merge the indices to a single segment.

Compare the size of the indices in Kibana. Elasticsearch 7.0.0 shows both indices as being roughly the same size, whereas Elasticsearch 6.7.1 shows the "logs-nosource" index being roughly half the size of the "logs" index.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-04-28T21:17:10Z

Pinging @elastic/es-search

jtibshirani · 2019-04-30T23:51:43Z

I think the increase in size is due to the fact that we now add a _recovery_source field to the document if _source is disabled, but soft deletes are enabled (#31106). From my understanding, the _recovery_source fields will eventually get removed during merges once elasticsearch determines the documents aren't needed for replay/ recovery.

In 6.7, index.soft_deletes.enabled defaults to false, whereas in 7.0 it defaults to true, so this could explain the difference in index size you're observing between the two versions. When I try setting index.soft_deletes.enabled: false on a 7.0 index, disabling _source in fact decreases the size of the index by around one half.

For this reason, I'm not sure the behavior indicates a bug. I'll tag @elastic/es-distributed to see if they think a follow-up is in order or would like to add any information.

dnhatn · 2019-05-01T14:31:10Z

Thanks @davemoore- for reporting this. @jtibshirani your explanation is correct.

This test creates a single segment (the dataset is quite small), and TierMergePolicy thinks that segment is merged already. Hence, RecoverySourcePruneMergePolicy is never triggered to prune away _recovery_source.

In this scenario, this behaviour probably is ok as the store size is pretty small (25MB). However, this behaviour can be problematic with a larger dataset for it won't prune away _recovery_source when the retention lease advances if segments are merged already.

I have a test that demonstrates this behavior.

    public void testPruneRecoverySource() throws Exception {
        Settings.Builder settings = Settings.builder()
            .put(defaultSettings.getSettings())
            .put(IndexSettings.INDEX_SOFT_DELETES_SETTING.getKey(), true)
            .put(IndexSettings.INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING.getKey(), 0);
        final IndexMetaData indexMetaData = IndexMetaData.builder(defaultSettings.getIndexMetaData()).settings(settings).build();
        final IndexSettings indexSettings = IndexSettingsModule.newIndexSettings(indexMetaData);
        final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.NO_OPS_PERFORMED);
        final MapperService mapperService = createMapperService("test");
        final MergePolicy mp = new TieredMergePolicy(); // works with LogDocMergePolicy
        try (Store store = createStore();
             InternalEngine engine = createEngine(config(indexSettings, store, createTempDir(), mp, null, null, globalCheckpoint::get))) {
            int numDocs = 10;
            for (int i = 0; i < numDocs; i++) {
                ParsedDocument doc = testParsedDocument(Integer.toString(i), null, testDocument(), new BytesArray("{}"), null, true);
                engine.index(indexForDoc(doc));
            }
            globalCheckpoint.set(engine.getLocalCheckpoint());
            engine.syncTranslog();
            engine.flush(true, true);
            engine.forceMerge(true, 1, false, false, false);
            try (Translog.Snapshot snapshot = engine.newChangesSnapshot("test", mapperService, 0, Long.MAX_VALUE, true)) {
                IllegalStateException sourceNotFound = expectThrows(IllegalStateException.class, snapshot::next);
                assertThat(sourceNotFound.getMessage(), startsWith("source not found"));
            }
        }
    }

@s1monw @jpountz What do you think?

s1monw · 2019-05-02T15:49:43Z

I think we can look into triggering a merge we would drop all sources. Yet, this is really only something that is relevant or should be done in a force_merge context? @dnhatn WDYT? I mean this might look confusing but it's what it is. We can't magically make it go away. I think if you run force merge too quick you would still have the same issue if we need to retain the sources.

jpountz · 2019-05-07T07:58:52Z

I think if you run force merge too quick you would still have the same issue if we need to retain the sources.

Maybe we should fail _forcemerge calls if not all recovery sources can be reclaimed. I can't think of any case when having recovery sources in force-merged segments would be the desired behavior.

This test failure manifests the limitation of the recovery source merge policy explained in #41628. If we already merge down to a single segment then subsequent force merges will be noop although they can prune recovery source. We need to adjust this test until we have a fix for the merge policy. Relates #41628 Closes #48735

elasticsearchmachine · 2024-04-26T19:28:05Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

dnhatn · 2024-11-20T06:26:59Z

Will be fixed in #114618

davemoore- changed the title ~~"_source"~~ Indices with "_source.enabled: false" same size as "_source.enabled: true" Apr 28, 2019

davemoore- changed the title ~~Indices with "_source.enabled: false" same size as "_source.enabled: true"~~ Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" Apr 28, 2019

davemoore- added v7.0.0 >bug :Search Foundations/Mapping Index mappings, including merging and defining field types labels Apr 28, 2019

dnhatn self-assigned this May 1, 2019

dnhatn added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label May 1, 2019

jtibshirani added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Search Foundations/Mapping Index mappings, including merging and defining field types labels May 1, 2019

dnhatn removed :Search Foundations/Mapping Index mappings, including merging and defining field types v7.0.0 labels May 23, 2019

This was referenced Oct 31, 2019

Fix testForceMergeWithSoftDeletesRetentionAndRecoverySource #48766

Merged

testForceMergeWithSoftDeletesRetentionAndRecoverySource fails #48735

Closed

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

ywelsch mentioned this issue Feb 25, 2021

.fdt file still exist when disable _source #69584

Closed

fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021

Create TestFixMe.md

a9fae03

fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021

Create Helloworld.md

1398a04

fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021

Update Helloworld.md

f68abab

dnhatn added :StorageEngine/Logs You know, for Logs and removed :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Apr 26, 2024

elasticsearchmachine added Team:StorageEngine and removed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Apr 26, 2024

navneet1v mentioned this issue May 30, 2024

Add capability to disable source recovery_source for an index opensearch-project/OpenSearch#13590

Merged

8 tasks

dnhatn closed this as completed Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" #41628

Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" #41628

davemoore- commented Apr 28, 2019 •

edited

Loading

elasticmachine commented Apr 28, 2019

jtibshirani commented Apr 30, 2019

dnhatn commented May 1, 2019 •

edited

Loading

s1monw commented May 2, 2019

jpountz commented May 7, 2019

elasticsearchmachine commented Apr 26, 2024

dnhatn commented Nov 20, 2024

Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" #41628

Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" #41628

Comments

davemoore- commented Apr 28, 2019 • edited Loading

elasticmachine commented Apr 28, 2019

jtibshirani commented Apr 30, 2019

dnhatn commented May 1, 2019 • edited Loading

s1monw commented May 2, 2019

jpountz commented May 7, 2019

elasticsearchmachine commented Apr 26, 2024

dnhatn commented Nov 20, 2024

davemoore- commented Apr 28, 2019 •

edited

Loading

dnhatn commented May 1, 2019 •

edited

Loading