Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL Add initial support for semantic_text field type #113920

Merged
merged 38 commits into from
Oct 21, 2024

Conversation

ioanatia
Copy link
Contributor

@ioanatia ioanatia commented Oct 2, 2024

Support is added behind a feature flag. We could not simply use EsqlCapabilities since that's not available in esql-core.
Right now we have no support in existing functions.
I followed the approach for adding initial support for date_nanos which was also added behind a feature flag which allowed for incremental progress, rather than adding support for everything in one PR: #110205

With this PR we will return semantic_text fields as part of the results and it will also allow us to refer to semantic_text fields in the match function (to run semantic search):

{
  "columns": [
    {
      "name": "semantic_text_field",
      "type": "semantic_text"
    }
  ],
  "values": [
    [
      "these are not the droids you're looking for. He's free to go around"
    ],
    [
      "all we have to decide is what to do with the time that is given to us"
    ],
    [
      "I have no idea what I'm doing, but I know I'm doing it really, really well."
    ]
  ]
}

realtes: #115103

@elasticsearchmachine
Copy link
Collaborator

Hi @ioanatia, I've created a changelog YAML for you.

@ioanatia ioanatia force-pushed the semantic_text_support branch from 8537aaa to 36e4c7d Compare October 4, 2024 13:07
@ioanatia ioanatia marked this pull request as ready for review October 7, 2024 17:02
@ioanatia ioanatia requested a review from a team as a code owner October 7, 2024 17:02
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 7, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@ioanatia ioanatia added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 7, 2024
@elasticsearchmachine elasticsearchmachine removed the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 7, 2024
@@ -43,6 +44,7 @@ public static ElasticsearchCluster localCluster(ElasticsearchCluster remoteClust
.setting("cluster.remote.connections_per_cluster", "1")
.shared(true)
.setting("cluster.routing.rebalance.enable", "none")
.plugin("inference-service-test")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this plugin is used for testing semantic_text - it's an inference service that can create sparse or dense embeddings - that have no actual "semantic" meaning since they are not using a model, but they are supposed to be deterministic.

Request request = new Request("PUT", "_inference/sparse_embedding/test_sparse_inference");
request.setJsonEntity("""
{
"service": "test_service",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the inference service from the inference-service-test test plugin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's probably worth javadoc to explain that it's for the semantic text fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda surprises me that this is the first capability conditional in the test infra. How expensive is this service endpoint? Conditionally creating it is of course fine, but the test infra would be simpler if it was unconditionally registered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test endpoint is quite lightweight. We could always register it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc was added

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. @not-napoleon might want to take a look because this is beginning to follow your lead.

@@ -14,7 +14,7 @@

public class EsqlSpecIT extends EsqlSpecTestCase {
@ClassRule
public static ElasticsearchCluster cluster = Clusters.testCluster(spec -> {});
public static ElasticsearchCluster cluster = Clusters.testCluster(spec -> spec.plugin("inference-service-test"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will everything need this plugin? should it be in testCluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only this one needs it - it's the only one for multi_node that extends from EsqlSpecTestCase that loads the CSV data sets.

Request request = new Request("PUT", "_inference/sparse_embedding/test_sparse_inference");
request.setJsonEntity("""
{
"service": "test_service",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's probably worth javadoc to explain that it's for the semantic text fields.

@costin costin requested a review from fang-xing-esql October 7, 2024 23:11
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantic_text support, yay! 💯

Request request = new Request("PUT", "_inference/sparse_embedding/test_sparse_inference");
request.setJsonEntity("""
{
"service": "test_service",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test endpoint is quite lightweight. We could always register it.

@ioanatia
Copy link
Contributor Author

ioanatia commented Oct 8, 2024

There are some issues with loading this plugin for multi-cluster and mixed-versions when one of the cluster nodes in on 8.16.0:

       Caused by:
        java.lang.RuntimeException: Execution of elasticsearch-plugin failed with exit code 70
            at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.runToolScript(AbstractLocalClusterFactory.java:795)
            at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.installPlugins(AbstractLocalClusterFactory.java:633)
            at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.start(AbstractLocalClusterFactory.java:159)
            at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.lambda$start$0(DefaultLocalClusterHandle.java:78)
            at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
            at java.base/java.util.AbstractList$RandomAccessSpliterator.forEachRemaining(AbstractList.java:722)
            at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:556)
            at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
            at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:759)
            at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:507)
            at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1491)
            at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:2073)
            at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2035)
            at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)

multi-node and single-node should continue to work just fine.
For now I conditionally disabled loading the plugin in multi-cluster and mixed-versions when one of the cluster nodes is on 8.x. Once we get a fix for the plugin, we can remove this check, but I don't have that much familiarity with the plugin and don't know how to fix it. I reached out to the team that owns it.

The checks CsvTestsDataLoader have also been changed - they no longer check for SEMANTIC_TEXT_TYPE ESQL capability, but rather check whether a dataset requires the test inference endpoint. This kind of makes more sense than to check for an ES|QL capability, we are not using ES|QL APIs to load the data, but the create index and bulk APIs.
If the test inference endpoint does not exist and a dataset requires it, we simply do not load the dataset.
In EsqlSpecTestCase I added another check for shouldSkipTest - in case SEMANTIC_TEXT_TYPE is required we also check that we can use the test inference service. This makes sense because if the test inference service we have no data with semantic_text fields.

@ChrisHegarty
Copy link
Contributor

Ok, mixed node testing will be avoided until the following issue is resolved as a follow up. #115166. (given that mixed mode will only be relevant when this PR is backported)

@ChrisHegarty
Copy link
Contributor

@elasticmachine update branch

@ChrisHegarty ChrisHegarty added the test-release Trigger CI checks against release build label Oct 19, 2024
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has had a number of review iterations. It's a great first step towards support for semantic search in ES|QL. LGTM 👍

@ChrisHegarty ChrisHegarty changed the title Add initial support for semantic_text field type ES|QL Add initial support for semantic_text field type Oct 19, 2024
@ChrisHegarty
Copy link
Contributor

@elasticmachine update branch

@elastic elastic deleted a comment from elasticmachine Oct 20, 2024
@ChrisHegarty
Copy link
Contributor

@elasticmachine update branch

@ioanatia ioanatia removed the test-release Trigger CI checks against release build label Oct 21, 2024
@ioanatia
Copy link
Contributor Author

@elasticmachine test this please

@ioanatia
Copy link
Contributor Author

@elasticmachine update branch

@ioanatia
Copy link
Contributor Author

Thank you folks, especially @fang-xing-esql and @ChrisHegarty for your help on this PR!
All the comment have been addressed, CI is green, the PR has a final approval from Chris after we addressed the comments, so it looks good to go 🚀

@ioanatia ioanatia added auto-backport Automatically create backport pull requests when merged v8.17.0 labels Oct 21, 2024
@ioanatia ioanatia merged commit fd43adc into elastic:main Oct 21, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

The backport operation could not be completed due to the following error:

An unexpected error occurred when attempting to backport this PR.

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 113920

@ioanatia ioanatia deleted the semantic_text_support branch October 21, 2024 13:44
ioanatia added a commit to ioanatia/elasticsearch that referenced this pull request Oct 21, 2024
* Add initial support for semantic_text field type

* Update docs/changelog/113920.yaml

* More tests and fixes

* Use mock inference service

* Fix tests

* Spotless

* Fix mixed-cluster and multi-clusters tests

* sort

* Attempt another fix for bwc tests

* Spotless

* Fix merge

* Attempt another fix

* Don't load the inference-service-test plugin for mixed versions/clusters

* Add more tests, address review comments

* trivial

* revert

* post-merge fix block loader

* post-merge fix compile

* add mixed version testing

* whitespace

* fix MultiClusterSpecIT

* add more fields to mapping

* Revert  mixed version testing

* whitespace

---------

Co-authored-by: ChrisHegarty <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
ioanatia added a commit that referenced this pull request Oct 22, 2024
…5256)

* Add initial support for semantic_text field type

* Update docs/changelog/113920.yaml

* More tests and fixes

* Use mock inference service

* Fix tests

* Spotless

* Fix mixed-cluster and multi-clusters tests

* sort

* Attempt another fix for bwc tests

* Spotless

* Fix merge

* Attempt another fix

* Don't load the inference-service-test plugin for mixed versions/clusters

* Add more tests, address review comments

* trivial

* revert

* post-merge fix block loader

* post-merge fix compile

* add mixed version testing

* whitespace

* fix MultiClusterSpecIT

* add more fields to mapping

* Revert  mixed version testing

* whitespace

---------

Co-authored-by: ChrisHegarty <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
* Add initial support for semantic_text field type

* Update docs/changelog/113920.yaml

* More tests and fixes

* Use mock inference service

* Fix tests

* Spotless

* Fix mixed-cluster and multi-clusters tests

* sort

* Attempt another fix for bwc tests

* Spotless

* Fix merge

* Attempt another fix

* Don't load the inference-service-test plugin for mixed versions/clusters

* Add more tests, address review comments

* trivial

* revert

* post-merge fix block loader

* post-merge fix compile

* add mixed version testing

* whitespace

* fix MultiClusterSpecIT

* add more fields to mapping

* Revert  mixed version testing

* whitespace

---------

Co-authored-by: ChrisHegarty <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
* Add initial support for semantic_text field type

* Update docs/changelog/113920.yaml

* More tests and fixes

* Use mock inference service

* Fix tests

* Spotless

* Fix mixed-cluster and multi-clusters tests

* sort

* Attempt another fix for bwc tests

* Spotless

* Fix merge

* Attempt another fix

* Don't load the inference-service-test plugin for mixed versions/clusters

* Add more tests, address review comments

* trivial

* revert

* post-merge fix block loader

* post-merge fix compile

* add mixed version testing

* whitespace

* fix MultiClusterSpecIT

* add more fields to mapping

* Revert  mixed version testing

* whitespace

---------

Co-authored-by: ChrisHegarty <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >enhancement :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants