Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Amazon OpenSearch Serverless #269

Open
sameercaresu opened this issue Jun 1, 2023 · 11 comments
Open

Add support for Amazon OpenSearch Serverless #269

sameercaresu opened this issue Jun 1, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@sameercaresu
Copy link

Is your feature request related to a problem?

I am trying to connect to Opensearch serverless collection from databrikcs. I can connect to Opensearch managed cluster using this. However, while trying to connect to serverless collection, I keep getting this error
OpenSearchHadoopIllegalArgumentException: Cannot detect OpenSearch version - typically this happens if the network/OpenSearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'opensearch.nodes.wan.only' Caused by: OpenSearchHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://xxx.aoss.amazonaws.com:9200

I have tried following configuration
"pushdown" -> "true", "opensearch.nodes" -> "https://xxx.aoss.amazonaws.com", "opensearch.nodes.wan.only" -> "true", "opensearch.aws.sigv4.region" -> "us-east-1", "opensearch.aws.sigv4.service.name" -> "aoss", "opensearch.aws.sigv4.enabled" -> "true"

What solution would you like?

Is it already possible to connect to Opensearch serverless? If yes, then could you please point me to correct set of configuration? If not, then I would like to request this feature.

What alternatives have you considered?

I used elasticsearch hadoop, but that doesn't work either with Opensearch serverless.

Do you have any additional context?

No.

@sameercaresu sameercaresu added enhancement New feature or request untriaged labels Jun 1, 2023
@harshavamsi
Copy link
Collaborator

Hi @sameercaresu thanks for bringing this up. This is a known issue with OpenSearch serverless. The hadoop client makes a / root call to the cluster to get cluster info like uuid, version etc. But since serverless does not have those attributes, the client errors out. I am working on a fix as we speak.

@ktech-rob
Copy link

Hi,
Just ran into the same issue as @sameercaresu. @harshavamsi I was wondering if there had been any update on this?

@wbeckler
Copy link

I haven't heard of any changes to Serverless to address this API gap.

@eswar7216
Copy link

We have a usecase to connect to open search serverless from Apache spark. I am running into similar issue as well. Is there a workaround to connect to open search serverless from Apache spark ?

@wbeckler
Copy link

There is still no known workaround. If you do figure out a way, please share it here or propose a PR so we can patch the client.

@eswar7216
Copy link

Not sure if everyone is doing the same as what I was trying to do but it works for me.
I have data in database which I was trying to get it into openSearch using apache spark in EMR.
Database -> EMR(Saprk) -> OpenSearch serverless.

I am using opensearch-hadoop(java) to connect to openSearch serverless (this is deployed in VPC) using a vpc endpooint something below and it works for me,

Map<String, String> map = new HashMap<>();
map.put("opensearch.nodes", "vpc domain url");
map.put("opensearch.port", "443");
map.put("opensearch.resource", "index_name");
map.put("opensearch.nodes.wan.only", "true");

//data here is which I read from database
JavaOpenSearchSparkSQL.saveToOpenSearch(data, map);

@Xtansia
Copy link
Collaborator

Xtansia commented Nov 2, 2023

I've done some quick investigation into this and it's more extensive than just the / info request.
The first few missing APIs I hit while trying to do a simple write from spark can be worked around, though not ideally:

  • GET / - can be hardcoded to return a dummy info if targeting serverless.
  • GET /_cluster/health/{index} - can be hardcoded to GREEN if serverless
  • POST /{index}/_refresh - can be NOOP if serverless

The bigger issue I then hit trying to do a read:

  • GET /{index}/_search_shards which is used to determine partitions for reading and serverless doesn't support shard information

@wbeckler
Copy link

@Xtansia It looks like _search_shards is getting called even when the setting os.nodes.client.only is set to TRUE. In that scenario the _search_shards is useless and shouldn't execute since no shards will map to non-data nodes. That means this should be a noop:

protected Map<ShardInfo, NodeInfo> doGetWriteTargetPrimaryShards(boolean clientNodesOnly) {
. Thoughts?

@Xtansia
Copy link
Collaborator

Xtansia commented May 5, 2024

@Xtansia It looks like _search_shards is getting called even when the setting os.nodes.client.only is set to TRUE. In that scenario the _search_shards is useless and shouldn't execute since no shards will map to non-data nodes. That means this should be a noop:

protected Map<ShardInfo, NodeInfo> doGetWriteTargetPrimaryShards(boolean clientNodesOnly) {

. Thoughts?

It's not quite as simple as just not calling it, as it uses the shards to determine how to partition the job within Spark for parallelisation. Serverless doesn't expose any shard information. It may be possible to workaround and hard code 1 or a configurable number of partitions for serverless, but I haven't dug into it far enough to know if that's feasible if other parts of the code expect to use an actual shard ID

@itiyama
Copy link

itiyama commented May 6, 2024

@Xtansia How does Hadoop client use the following APIs? I am exploring a solution to support some dummy/empty response for these APIs in serverless to support backward compatibility. But without understanding how the client uses these APIs, returning dummy response would be of no use.

  1. GET / - what does it do with the response? Let's say serverless deployment returns empty response for all fields or some dummy value? Would that work?
  2. GET /_cluster/health/{index} - we can hard code this to GREEN, are there any other response params that the client relies on?
  3. POST /{index}/_refresh - yes, this can be NOOP.

@dblock dblock pinned this issue May 23, 2024
@dblock dblock changed the title Is it possible to connect to Opensearch serverless? Add support for Amazon OpenSearch Serverless May 23, 2024
@Leo-Rola
Copy link

@dblock i have seen you pinned this issue about using hadoop with opensearch serverless. Can I ask what have you solved? if for example I want to use glue to transfer documents from a collection OpenSearch Serverless to another, now could I do that? Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants