Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove index_mode property from index templates #85985

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Apr 19, 2022

This PR removes the index_mode field from index templates (that was part of data_stream json object).
This field was used to determine whether a tsdb data stream should created or a regular data stream.
As part of this change, a tsdb data stream is now created when the template that matches with the data stream to be created has the following properties:

  • A index.routing_path setting has been defined in the composable index template.
  • The fields mentioned in the index.routing_path setting refers to fields in the mapping of this template that are of type keyword and the time_series_dimension attribute is set to true.

Prior to this change a template that creates tsdb data stream looks like:

{
    "index_patterns": ["k8s*"],
    "data_stream": {
        "index_mode": "time_series"
    },
   "template": {
      "settings": {
         "index.routing_path": ["metricset.name"],
      }
      "mappings": {
  	  "properties": {
    	    "metrics": {
      	        "properties:": {
                     "name": {
                        "type": "keyword",
      	                "time_series_dimension": true
                     }
                }
    	    }
  	  }
     }
   }
}

And with this change a template that creates a tsdb data stream looks like:

{
   "index_patterns": ["k8s*"],
  "data_stream": {},
  "template": {
     "settings": {
       "index.routing_path": ["metricset.name"]
     },
    "mappings": {
  	  "properties": {
    	    "metrics": {
      	        "properties:": {
                     "name": {
                        "type": "keyword",
      	                "time_series_dimension": true
                     }
                }
    	    }
  	  }
     }
  }
}

After this PR the following changes will be made:

  • Generating the index.routing_path setting if not defined based on the mapping. This will make specifying this setting no longer required.
  • Allow tsdb data streams to be created in case when dynamic mappings are used. In this case the concrete routing fields may not be know until ingestions starts. In this case configuring the index.routing_path with a wildcard patten is required and in this case configuring these fields in the template mapping isn't required. Validation will happen upon indexing when dynamic fields are mapped.

@martijnvg martijnvg force-pushed the remove_index_mode_property_from_index_template branch 3 times, most recently from aa63aae to da14b26 Compare April 19, 2022 12:29
@martijnvg martijnvg force-pushed the remove_index_mode_property_from_index_template branch from da14b26 to 720cf09 Compare April 20, 2022 09:45
@martijnvg martijnvg marked this pull request as ready for review April 21, 2022 09:00
@martijnvg martijnvg added >non-issue :Data Management/Data streams Data streams and their lifecycles :StorageEngine/TSDB You know, for Metrics labels Apr 21, 2022
@elasticmachine elasticmachine added Team:Data Management Meta label for data/management team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Apr 21, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@martijnvg martijnvg requested review from dakrone and imotov April 21, 2022 09:00
@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Apr 21, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks reasonable to me, but I am a bit concern about switching from enum to boolean to indicate time series database. I think we should either rename this boolean to reflect what it actually does instead of which type of data streams it is used in or we should keep it an enum since I anticipate other types of specialized indices will follow if this one is successful.

@weizijun
Copy link
Contributor

After this PR the following changes will be made:

  • Generating the index.routing_path setting if not defined based on the mapping. This will make specifying this setting no longer required.
  • Allow tsdb data streams to be created in case when dynamic mappings are used. In this case the concrete routing fields may not be know until ingestions starts. In this case configuring the index.routing_path with a wildcard patten is required and in this case configuring these fields in the template mapping isn't required. Validation will happen upon indexing when dynamic fields are mapped.

About Generating the index.routing_path, I have deep discuss with @nik9000 , there is the detail : #82633 (comment)
It seems that the Generating is hard to implement.

@martijnvg
Copy link
Member Author

It seems that the Generating is hard to implement.

The plan is to not support the dynamic case, in this case the routing_path field must be defined.
The routing_path index setting will be only be generated if the template has mappings with fields of type keyword and the
time_series_dimension attribute is enabled. The reason we like to support this make getting started with tsdb easier. If the first template someone defines has data stream enabled and time series dimension fields defined, then we create data streams in tsdb mode.

In case no time series dimension fields are defined as concrete field, but a dynamic mapping template is defined then we will require a routing_path (possible with field names containing wildcards). But this is for more advanced use cases and likely not for anyone getting started with tsdb.

@martijnvg
Copy link
Member Author

I think we should either rename this boolean to reflect what it actually does instead of which type of data streams it is used in or we should keep it an enum since I anticipate other types of specialized indices will follow if this one is successful.

@imotov What other specialised index modes do you expect? I'm not against adding IndexMode back to DataStream class, but if we don't add more index modes in the future, it feels like 'an extension point' in data streams that isn't real.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me also, but I think I agree with @imotov about keeping it as an enum for future index types.

// (index_mode was behind a feature in the xcontent parser, so it could never actually used)
// (this used to be an optional enum, so just need to (de-)serialize a false boolean value here)
boolean value = in.readBoolean();
assert value == false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a message to this assert?

}

var settings = MetadataIndexTemplateService.resolveSettings(indexTemplate, componentTemplates());
if (IndexMetadata.INDEX_ROUTING_PATH.exists(settings)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to auto-generate the routing path maybe we should use index_mode here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is going to change once index.routing_path is going to be generated from the template's mapping. Maybe we can also use index_mode index setting here, but that isn't something we said a user would need to configure in order to create data streams in tsdb mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think configuring the mode is still a good thing. Right now we support configuring fields as time_series_dimensions and without modifying the index itself. It'd be pretty surprising if upgrading Elasticsearch caused new indices with those mappings to be time_series. I'm not sure that's a terrible idea, but I think it's worth double checking. Personally I'd only try and generate the routing_path if the index is configured in time_series mode. But I might be wrong here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, that concern doesn't block this PR - it'd block the PR that generates the routing_path.

@martijnvg
Copy link
Member Author

After chatting with @dakrone I realised there will be more instances of IndexMode, so I will replace DataStream's timeSeries field back to indexMode.

@martijnvg martijnvg requested review from dakrone, imotov and nik9000 April 26, 2022 18:40
@nik9000
Copy link
Member

nik9000 commented Apr 26, 2022

weizijun commented 5 days ago

About Generating the index.routing_path, I have deep discuss with @nik9000 , there is the detail : #82633 (comment)
It seems that the Generating is hard to implement.

martijnvg commented yesterday

The plan is to not support the dynamic case, in this case the routing_path field must be defined. The routing_path index setting will be only be generated if the template has mappings with fields of type keyword and the time_series_dimension attribute is enabled. The reason we like to support this make getting started with tsdb easier. If the first template someone defines has data stream enabled and time series dimension fields defined, then we create data streams in tsdb mode.

In case no time series dimension fields are defined as concrete field, but a dynamic mapping template is defined then we will require a routing_path (possible with field names containing wildcards). But this is for more advanced use cases and likely not for anyone getting started with tsdb.

I think this is a pretty decent way to generate it, yeah. It's much better than what I was trying to do - building it dynamically from the keyword fields, updating it as things were added. Down that road lies madness.

I also spent a while thinking about how to generate the field's value when initializing the index itself for the first time. Like, build it from templates and stuff and then derive the routing_path statically from the keyword dimensions. This should have worked. But everything happens in the wrong order. So I never pushed on it.

I'm a little worried that this is abstracting away a complex and easy to misconfigure thing inside of templates that may not be owned by folks that know what's up. But it doesn't seem like a bad default. I've mostly listed all of the keyword dimensions in the routing_path for my tests. It's usually right, I think. And if it weren't for dynamic mappings the whole thing would be easier.....

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to keep it enum everywhere, but I am ok with converting it to boolean internally. I am not sure I like the tsdb abbreviation though.

@@ -81,6 +81,9 @@ public final class DataStream implements SimpleDiffable<DataStream>, ToXContentO
private final boolean replicated;
private final boolean system;
private final boolean allowCustomRouting;
// This boolean field is for keeping track of whether a data stream is a tsdb data stream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a leftover from the previous iteration.

*
* @return new {@code DataStream} instance with the rollover operation applied
*/
public DataStream rollover(Index writeIndex, long generation, IndexMode indexModeFromTemplate) {
public DataStream rollover(Index writeIndex, long generation, boolean isTsdbTemplate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is internal, so it is probably ok to keep it a boolean here, but I would still prefer not to use the abbreviation Tsdb. I think we are calling it TimeSeries everywhere else in the code.

@martijnvg
Copy link
Member Author

I'm a little worried that this is abstracting away a complex and easy to misconfigure thing inside of templates that may not be owned by folks that know what's up. But it doesn't seem like a bad default. I've mostly listed all of the keyword dimensions in the routing_path for my tests. It's usually right, I think. And if it weren't for dynamic mappings the whole thing would be easier.....

Right, dynamic field mapping make things for us more difficult. But I suspect, like you, that in most cases the keyword dimension fields are statically configured in the mapping of the template.

@martijnvg
Copy link
Member Author

I am not sure I like the tsdb abbreviation though.

@imotov I've renamed the variables/parameters to use not use tsdb abbreviation and removed the stale comment.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for the enum serialization part

// (index_mode is removed and was part of code based when tsdb was behind a feature flag)
// (index_mode was behind a feature in the xcontent parser, so it could never actually used)
// (this used to be an optional enum, so just need to (de-)serialize a false boolean value here)
boolean value = in.readBoolean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the most important place to keep it enum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index_mode is removed from the the composable index templates, the only reason that a boolean is read here is because index mode wasn't behind a feature flag in the binary serialisation code (it was in the xcontent serialization).

The index mode is kept in the DataSteam class.

So I think this is good?

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >non-issue :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Clients Meta label for clients team Team:Data Management Meta label for data/management team v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants