Add support for Amazon OpenSearch Serverless #269

sameercaresu · 2023-06-01T08:32:45Z

Is your feature request related to a problem?

I am trying to connect to Opensearch serverless collection from databrikcs. I can connect to Opensearch managed cluster using this. However, while trying to connect to serverless collection, I keep getting this error
OpenSearchHadoopIllegalArgumentException: Cannot detect OpenSearch version - typically this happens if the network/OpenSearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'opensearch.nodes.wan.only' Caused by: OpenSearchHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://xxx.aoss.amazonaws.com:9200

I have tried following configuration
"pushdown" -> "true", "opensearch.nodes" -> "https://xxx.aoss.amazonaws.com", "opensearch.nodes.wan.only" -> "true", "opensearch.aws.sigv4.region" -> "us-east-1", "opensearch.aws.sigv4.service.name" -> "aoss", "opensearch.aws.sigv4.enabled" -> "true"

What solution would you like?

Is it already possible to connect to Opensearch serverless? If yes, then could you please point me to correct set of configuration? If not, then I would like to request this feature.

What alternatives have you considered?

I used elasticsearch hadoop, but that doesn't work either with Opensearch serverless.

Do you have any additional context?

No.

The text was updated successfully, but these errors were encountered:

harshavamsi · 2023-06-01T16:20:15Z

Hi @sameercaresu thanks for bringing this up. This is a known issue with OpenSearch serverless. The hadoop client makes a / root call to the cluster to get cluster info like uuid, version etc. But since serverless does not have those attributes, the client errors out. I am working on a fix as we speak.

ktech-rob · 2023-06-26T16:11:00Z

Hi,
Just ran into the same issue as @sameercaresu. @harshavamsi I was wondering if there had been any update on this?

wbeckler · 2023-06-30T03:59:20Z

I haven't heard of any changes to Serverless to address this API gap.

eswar7216 · 2023-07-10T22:35:09Z

We have a usecase to connect to open search serverless from Apache spark. I am running into similar issue as well. Is there a workaround to connect to open search serverless from Apache spark ?

wbeckler · 2023-07-12T05:32:41Z

There is still no known workaround. If you do figure out a way, please share it here or propose a PR so we can patch the client.

eswar7216 · 2023-07-13T17:22:22Z

Not sure if everyone is doing the same as what I was trying to do but it works for me.
I have data in database which I was trying to get it into openSearch using apache spark in EMR.
Database -> EMR(Saprk) -> OpenSearch serverless.

I am using opensearch-hadoop(java) to connect to openSearch serverless (this is deployed in VPC) using a vpc endpooint something below and it works for me,

Map<String, String> map = new HashMap<>();
map.put("opensearch.nodes", "vpc domain url");
map.put("opensearch.port", "443");
map.put("opensearch.resource", "index_name");
map.put("opensearch.nodes.wan.only", "true");

//data here is which I read from database
JavaOpenSearchSparkSQL.saveToOpenSearch(data, map);

Xtansia · 2023-11-02T21:28:19Z

I've done some quick investigation into this and it's more extensive than just the / info request.
The first few missing APIs I hit while trying to do a simple write from spark can be worked around, though not ideally:

GET / - can be hardcoded to return a dummy info if targeting serverless.
GET /_cluster/health/{index} - can be hardcoded to GREEN if serverless
POST /{index}/_refresh - can be NOOP if serverless

The bigger issue I then hit trying to do a read:

GET /{index}/_search_shards which is used to determine partitions for reading and serverless doesn't support shard information

wbeckler · 2024-04-23T17:17:22Z

@Xtansia It looks like _search_shards is getting called even when the setting os.nodes.client.only is set to TRUE. In that scenario the _search_shards is useless and shouldn't execute since no shards will map to non-data nodes. That means this should be a noop:

opensearch-hadoop/mr/src/main/java/org/opensearch/hadoop/rest/RestRepository.java

Line 279 in c9a6a1c

    
           protected Map<ShardInfo, NodeInfo> doGetWriteTargetPrimaryShards(boolean clientNodesOnly) {

. Thoughts?

Xtansia · 2024-05-05T22:44:33Z

@Xtansia It looks like _search_shards is getting called even when the setting os.nodes.client.only is set to TRUE. In that scenario the _search_shards is useless and shouldn't execute since no shards will map to non-data nodes. That means this should be a noop:

opensearch-hadoop/mr/src/main/java/org/opensearch/hadoop/rest/RestRepository.java

Line 279 in c9a6a1c

protected Map<ShardInfo, NodeInfo> doGetWriteTargetPrimaryShards(boolean clientNodesOnly) {

. Thoughts?

It's not quite as simple as just not calling it, as it uses the shards to determine how to partition the job within Spark for parallelisation. Serverless doesn't expose any shard information. It may be possible to workaround and hard code 1 or a configurable number of partitions for serverless, but I haven't dug into it far enough to know if that's feasible if other parts of the code expect to use an actual shard ID

itiyama · 2024-05-06T00:03:46Z

@Xtansia How does Hadoop client use the following APIs? I am exploring a solution to support some dummy/empty response for these APIs in serverless to support backward compatibility. But without understanding how the client uses these APIs, returning dummy response would be of no use.

GET / - what does it do with the response? Let's say serverless deployment returns empty response for all fields or some dummy value? Would that work?
GET /_cluster/health/{index} - we can hard code this to GREEN, are there any other response params that the client relies on?
POST /{index}/_refresh - yes, this can be NOOP.

Leo-Rola · 2024-05-27T08:58:01Z

@dblock i have seen you pinned this issue about using hadoop with opensearch serverless. Can I ask what have you solved? if for example I want to use glue to transfer documents from a collection OpenSearch Serverless to another, now could I do that? Thanks in advance

sameercaresu added enhancement New feature or request untriaged labels Jun 1, 2023

wbeckler removed the untriaged label Jun 15, 2023

dblock pinned this issue May 23, 2024

dblock changed the title ~~Is it possible to connect to Opensearch serverless?~~ Add support for Amazon OpenSearch Serverless May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Amazon OpenSearch Serverless #269

Add support for Amazon OpenSearch Serverless #269

sameercaresu commented Jun 1, 2023

harshavamsi commented Jun 1, 2023

ktech-rob commented Jun 26, 2023

wbeckler commented Jun 30, 2023

eswar7216 commented Jul 10, 2023

wbeckler commented Jul 12, 2023

eswar7216 commented Jul 13, 2023

Xtansia commented Nov 2, 2023

wbeckler commented Apr 23, 2024

Xtansia commented May 5, 2024 •

edited

Loading

itiyama commented May 6, 2024

Leo-Rola commented May 27, 2024

Add support for Amazon OpenSearch Serverless #269

Add support for Amazon OpenSearch Serverless #269

Comments

sameercaresu commented Jun 1, 2023

Is your feature request related to a problem?

What solution would you like?

What alternatives have you considered?

Do you have any additional context?

harshavamsi commented Jun 1, 2023

ktech-rob commented Jun 26, 2023

wbeckler commented Jun 30, 2023

eswar7216 commented Jul 10, 2023

wbeckler commented Jul 12, 2023

eswar7216 commented Jul 13, 2023

Xtansia commented Nov 2, 2023

wbeckler commented Apr 23, 2024

Xtansia commented May 5, 2024 • edited Loading

itiyama commented May 6, 2024

Leo-Rola commented May 27, 2024

Xtansia commented May 5, 2024 •

edited

Loading