Rebase main code to feature branch (opensearch-project#1386)

* Add Auto Release Workflow (opensearch-project#1306) * Add Auto Release Workflow Signed-off-by: Sicheng Song <[email protected]> * Fix release note address Signed-off-by: Sicheng Song <[email protected]> --------- Signed-off-by: Sicheng Song <[email protected]> * Bump aws-encryption-sdk-java to fix CVE-2023-33201 (opensearch-project#1309) Signed-off-by: Sicheng Song <[email protected]> * Add release note for 2.10.0 release (opensearch-project#1312) * Add release note for 2.10.0 Signed-off-by: Sicheng Song <[email protected]> * Add CVE fix Signed-off-by: Sicheng Song <[email protected]> --------- Signed-off-by: Sicheng Song <[email protected]> * fixing doc link (opensearch-project#1318) * fixing doc link Signed-off-by: Dhrubo Saha <[email protected]> * fixing indentation Signed-off-by: Dhrubo Saha <[email protected]> --------- Signed-off-by: Dhrubo Saha <[email protected]> * Fix unassigned ml system shard replicas (opensearch-project#1315) (opensearch-project#1324) * Fix unassigned ml system shard replicas * Adjust auto replica settings to keep it consistent with AOS default setting * Update plugin/src/main/java/org/opensearch/ml/indices/MLIndicesHandler.java * Modify exception handling * Modify exception messages * Add response check * Add response check and exception handling * Keep error message consistent * Keep error message consistent * Keep error message consistent --------- Signed-off-by: Sicheng Song <[email protected]> Co-authored-by: Yaliang Wu <[email protected]> * Adjust index replicas settings to keep consistent with AOS 2.9 (opensearch-project#1325) Signed-off-by: Sicheng Song <[email protected]> * Make 2.10 release notes up to date (opensearch-project#1345) Signed-off-by: Sicheng Song <[email protected]> * fix spelling (opensearch-project#1363) Signed-off-by: Kalyan <[email protected]> * Add neural search default processor for non OpenAI/Cohere scenario (opensearch-project#1274) * Add neural search default pre/post process function support Signed-off-by: zane-neo <[email protected]> * Fix UT failures Signed-off-by: zane-neo <[email protected]> * Address PR comment to remove nonJson response case Signed-off-by: zane-neo <[email protected]> * Fix low code coverage issue Signed-off-by: zane-neo <[email protected]> * fix format issue Signed-off-by: zane-neo <[email protected]> * Try to fix classNotFound issue in IT Signed-off-by: zane-neo <[email protected]> * revert Try to fix classNotFound issue in IT Signed-off-by: zane-neo <[email protected]> * Change gson dependency to compileOnly Signed-off-by: zane-neo <[email protected]> * Change default pre/post process function name Signed-off-by: zane-neo <[email protected]> * Address code review comments Signed-off-by: zane-neo <[email protected]> * Make preprocess function to default Signed-off-by: zane-neo <[email protected]> * Remove GsonUtil since there already a single instance in StringUtils Signed-off-by: zane-neo <[email protected]> * Fix UT failures Signed-off-by: zane-neo <[email protected]> * Address comments Signed-off-by: zane-neo <[email protected]> * use import instead of fully qualified name Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: zane-neo <[email protected]> --------- Signed-off-by: Sicheng Song <[email protected]> Signed-off-by: Dhrubo Saha <[email protected]> Signed-off-by: Kalyan <[email protected]> Signed-off-by: zane-neo <[email protected]> Co-authored-by: Sicheng Song <[email protected]> Co-authored-by: Dhrubo Saha <[email protected]> Co-authored-by: Yaliang Wu <[email protected]> Co-authored-by: Kalyan <[email protected]>
zane-neo · Sep 26, 2023 · 8780d1c · 8780d1c
1 parent a9687fc
commit 8780d1c
Show file tree

Hide file tree

Showing 22 changed files with 473 additions and 252 deletions.
diff --git a/.github/workflows/auto-release.yml b/.github/workflows/auto-release.yml
@@ -0,0 +1,28 @@
+name: Releases
+
+on:
+  push:
+    tags:
+      - '*'
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: write
+    steps:
+      - name: GitHub App token
+        id: github_app_token
+        uses: tibdex/[email protected]
+        with:
+          app_id: ${{ secrets.APP_ID }}
+          private_key: ${{ secrets.APP_PRIVATE_KEY }}
+          installation_id: 22958780
+      - name: Get tag
+        id: tag
+        uses: dawidd6/action-get-tag@v1
+      - uses: actions/checkout@v2
+      - uses: ncipollo/release-action@v1
+        with:
+          github_token: ${{ steps.github_app_token.outputs.token }}
+          bodyFile: release-notes/opensearch-ml-common.release-notes-${{steps.tag.outputs.tag}}.md
diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ Machine Learning Commons for OpenSearch is a new solution that make it easy to d
 Until today, the challenge is significant to build a new machine learning feature inside OpenSearch. The reasons include:
 
 * **Disruption to OpenSearch Core features**. Machine learning is very computationally intensive. But currently there is no way to add dedicated computation resources in OpenSearch for machine learning jobs, hence these jobs have to share same resources with Core features, such as: indexing and searching. That might cause the latency increasing on search request, and cause circuit breaker exception on memory usage. To address this, we have to carefully distribute models and limit the data size to run the AD job. When more and more ML features are added into OpenSearch, it will become much harder to manage.
-* **Lack of support for machine learning algorithms.** Customers need more algorighms within Opensearch, otherwise the data need be exported to outside of elasticsearch, such as s3 first to do the job, which will bring extra cost and latency.
+* **Lack of support for machine learning algorithms.** Customers need more algorithms within Opensearch, otherwise the data need be exported to outside of elasticsearch, such as s3 first to do the job, which will bring extra cost and latency.
 * **Lack of resource management mechanism between multiple machine learning jobs.** It's hard to coordinate the resources between multi features.
 
 

diff --git a/common/src/main/java/org/opensearch/ml/common/CommonValue.java b/common/src/main/java/org/opensearch/ml/common/CommonValue.java
@@ -35,13 +35,13 @@ public class CommonValue {
     public static final String ML_MODEL_GROUP_INDEX = ".plugins-ml-model-group";
     public static final String ML_MODEL_INDEX = ".plugins-ml-model";
     public static final String ML_TASK_INDEX = ".plugins-ml-task";
-    public static final Integer ML_MODEL_GROUP_INDEX_SCHEMA_VERSION = 1;
-    public static final Integer ML_MODEL_INDEX_SCHEMA_VERSION = 6;
+    public static final Integer ML_MODEL_GROUP_INDEX_SCHEMA_VERSION = 2;
+    public static final Integer ML_MODEL_INDEX_SCHEMA_VERSION = 7;
     public static final String ML_CONNECTOR_INDEX = ".plugins-ml-connector";
-    public static final Integer ML_TASK_INDEX_SCHEMA_VERSION = 1;
-    public static final Integer ML_CONNECTOR_SCHEMA_VERSION = 1;
+    public static final Integer ML_TASK_INDEX_SCHEMA_VERSION = 2;
+    public static final Integer ML_CONNECTOR_SCHEMA_VERSION = 2;
     public static final String ML_CONFIG_INDEX = ".plugins-ml-config";
-    public static final Integer ML_CONFIG_INDEX_SCHEMA_VERSION = 1;
+    public static final Integer ML_CONFIG_INDEX_SCHEMA_VERSION = 2;
     public static final String USER_FIELD_MAPPING = "      \""
             + CommonValue.USER
             + "\": {\n"

diff --git a/common/src/main/java/org/opensearch/ml/common/connector/Connector.java b/common/src/main/java/org/opensearch/ml/common/connector/Connector.java
@@ -5,6 +5,17 @@
 
 package org.opensearch.ml.common.connector;
 
+
+import java.io.IOException;
+import java.security.AccessController;
+import java.security.PrivilegedActionException;
+import java.security.PrivilegedExceptionAction;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.function.Function;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
 import org.apache.commons.text.StringSubstitutor;
 import org.opensearch.core.common.io.stream.StreamInput;
 import org.opensearch.core.common.io.stream.StreamOutput;
@@ -20,17 +31,6 @@
 import org.opensearch.ml.common.MLCommonsClassLoader;
 import org.opensearch.ml.common.output.model.ModelTensor;
 
-import java.io.IOException;
-import java.security.AccessController;
-import java.security.PrivilegedActionException;
-import java.security.PrivilegedExceptionAction;
-import java.util.List;
-import java.util.Map;
-import java.util.Optional;
-import java.util.function.Function;
-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
 import static org.opensearch.core.xcontent.XContentParserUtils.ensureExpectedToken;
 import static org.opensearch.ml.common.utils.StringUtils.gson;
 

diff --git a/common/src/main/java/org/opensearch/ml/common/connector/MLPostProcessFunction.java b/common/src/main/java/org/opensearch/ml/common/connector/MLPostProcessFunction.java
@@ -5,61 +5,64 @@
 
 package org.opensearch.ml.common.connector;
 
+import org.opensearch.ml.common.output.model.MLResultDataType;
+import org.opensearch.ml.common.output.model.ModelTensor;
+
+import java.util.ArrayList;
 import java.util.HashMap;
+import java.util.List;
 import java.util.Map;
+import java.util.function.Function;
 
 public class MLPostProcessFunction {
 
-    private static Map<String, String> POST_PROCESS_FUNCTIONS;
     public static final String COHERE_EMBEDDING = "connector.post_process.cohere.embedding";
     public static final String OPENAI_EMBEDDING = "connector.post_process.openai.embedding";
 
+    public static final String DEFAULT_EMBEDDING = "connector.post_process.default.embedding";
+
+    private static final Map<String, String> JSON_PATH_EXPRESSION = new HashMap<>();
+
+    private static final Map<String, Function<List<List<Float>>, List<ModelTensor>>> POST_PROCESS_FUNCTIONS = new HashMap<>();
+
+
     static {
-        POST_PROCESS_FUNCTIONS = new HashMap<>();
-        POST_PROCESS_FUNCTIONS.put(COHERE_EMBEDDING, "\n      def name = \"sentence_embedding\";\n" +
-                "      def dataType = \"FLOAT32\";\n" +
-                "      if (params.embeddings == null || params.embeddings.length == 0) {\n" +
-                "          return null;\n" +
-                "      }\n" +
-                "      def embeddings = params.embeddings;\n" +
-                "      StringBuilder builder = new StringBuilder(\"[\");\n" +
-                "      for (int i=0; i<embeddings.length; i++) {\n" +
-                "        def shape = [embeddings[i].length];\n" +
-                "        def json = \"{\" +\n" +
-                "                 \"\\\"name\\\":\\\"\" + name + \"\\\",\" +\n" +
-                "                 \"\\\"data_type\\\":\\\"\" + dataType + \"\\\",\" +\n" +
-                "                 \"\\\"shape\\\":\" + shape + \",\" +\n" +
-                "                 \"\\\"data\\\":\" + embeddings[i] +\n" +
-                "                 \"}\";\n" +
-                "        builder.append(json);\n" +
-                "        if (i < embeddings.length - 1) {\n" +
-                "          builder.append(\",\");\n" +
-                "        }\n" +
-                "      }\n" +
-                "      builder.append(\"]\");\n" +
-                "      \n" +
-                "      return builder.toString();\n    ");
+        JSON_PATH_EXPRESSION.put(OPENAI_EMBEDDING, "$.data[*].embedding");
+        JSON_PATH_EXPRESSION.put(COHERE_EMBEDDING, "$.embeddings");
+        JSON_PATH_EXPRESSION.put(DEFAULT_EMBEDDING, "$[*]");
+        POST_PROCESS_FUNCTIONS.put(OPENAI_EMBEDDING, buildModelTensorList());
+        POST_PROCESS_FUNCTIONS.put(COHERE_EMBEDDING, buildModelTensorList());
+        POST_PROCESS_FUNCTIONS.put(DEFAULT_EMBEDDING, buildModelTensorList());
+    }
 
-        POST_PROCESS_FUNCTIONS.put(OPENAI_EMBEDDING, "\n      def name = \"sentence_embedding\";\n" +
-                "      def dataType = \"FLOAT32\";\n" +
-                "      if (params.data == null || params.data.length == 0) {\n" +
-                "          return null;\n" +
-                "      }\n" +
-                "      def shape = [params.data[0].embedding.length];\n" +
-                "      def json = \"{\" +\n" +
-                "                 \"\\\"name\\\":\\\"\" + name + \"\\\",\" +\n" +
-                "                 \"\\\"data_type\\\":\\\"\" + dataType + \"\\\",\" +\n" +
-                "                 \"\\\"shape\\\":\" + shape + \",\" +\n" +
-                "                 \"\\\"data\\\":\" + params.data[0].embedding +\n" +
-                "                 \"}\";\n" +
-                "      return json;\n    ");
+    public static Function<List<List<Float>>, List<ModelTensor>> buildModelTensorList() {
+        return embeddings -> {
+            List<ModelTensor> modelTensors = new ArrayList<>();
+            if (embeddings == null) {
+                throw new IllegalArgumentException("The list of embeddings is null when using the built-in post-processing function.");
+            }
+            embeddings.forEach(embedding -> modelTensors.add(
+                ModelTensor
+                    .builder()
+                    .name("sentence_embedding")
+                    .dataType(MLResultDataType.FLOAT32)
+                    .shape(new long[]{embedding.size()})
+                    .data(embedding.toArray(new Number[0]))
+                    .build()
+            ));
+            return modelTensors;
+        };
     }
 
-    public static boolean contains(String functionName) {
-        return POST_PROCESS_FUNCTIONS.containsKey(functionName);
+    public static String getResponseFilter(String postProcessFunction) {
+        return JSON_PATH_EXPRESSION.get(postProcessFunction);
     }
 
-    public static String get(String postProcessFunction) {
+    public static Function<List<List<Float>>, List<ModelTensor>> get(String postProcessFunction) {
         return POST_PROCESS_FUNCTIONS.get(postProcessFunction);
     }
+
+    public static boolean contains(String postProcessFunction) {
+        return POST_PROCESS_FUNCTIONS.containsKey(postProcessFunction);
+    }
 }
diff --git a/common/src/main/java/org/opensearch/ml/common/connector/MLPreProcessFunction.java b/common/src/main/java/org/opensearch/ml/common/connector/MLPreProcessFunction.java
@@ -6,44 +6,37 @@
 package org.opensearch.ml.common.connector;
 
 import java.util.HashMap;
+import java.util.List;
 import java.util.Map;
+import java.util.function.Function;
 
 public class MLPreProcessFunction {
 
-    private static Map<String, String> PRE_PROCESS_FUNCTIONS;
+    private static final Map<String, Function<List<String>, Map<String, Object>>> PRE_PROCESS_FUNCTIONS = new HashMap<>();
     public static final String TEXT_DOCS_TO_COHERE_EMBEDDING_INPUT = "connector.pre_process.cohere.embedding";
     public static final String TEXT_DOCS_TO_OPENAI_EMBEDDING_INPUT = "connector.pre_process.openai.embedding";
 
+    public static final String TEXT_DOCS_TO_DEFAULT_EMBEDDING_INPUT = "connector.pre_process.default.embedding";
+
+    private static Function<List<String>, Map<String, Object>> cohereTextEmbeddingPreProcess() {
+        return inputs -> Map.of("parameters", Map.of("texts", inputs));
+    }
+
+    private static Function<List<String>, Map<String, Object>> openAiTextEmbeddingPreProcess() {
+        return inputs -> Map.of("parameters", Map.of("input", inputs));
+    }
+
     static {
-        PRE_PROCESS_FUNCTIONS = new HashMap<>();
-        //TODO: change to java for openAI, embedding and Titan
-        PRE_PROCESS_FUNCTIONS.put(TEXT_DOCS_TO_COHERE_EMBEDDING_INPUT, "\n    StringBuilder builder = new StringBuilder();\n" +
-                "    builder.append(\"[\");\n" +
-                "    for (int i=0; i< params.text_docs.length; i++) {\n" +
-                "        builder.append(\"\\\"\");\n" +
-                "        builder.append(params.text_docs[i]);\n" +
-                "        builder.append(\"\\\"\");\n" +
-                "        if (i < params.text_docs.length - 1) {\n" +
-                "          builder.append(\",\")\n" +
-                "        }\n" +
-                "    }\n" +
-                "    builder.append(\"]\");\n" +
-                "    def parameters = \"{\" +\"\\\"prompt\\\":\" + builder + \"}\";\n" +
-                "    return  \"{\" +\"\\\"parameters\\\":\" + parameters + \"}\";");
-
-        PRE_PROCESS_FUNCTIONS.put(TEXT_DOCS_TO_OPENAI_EMBEDDING_INPUT, "\n    StringBuilder builder = new StringBuilder();\n" +
-                        "    builder.append(\"\\\"\");\n" +
-                        "    builder.append(params.text_docs[0]);\n" +
-                        "    builder.append(\"\\\"\");\n" +
-                        "    def parameters = \"{\" +\"\\\"input\\\":\" + builder + \"}\";\n" +
-                        "    return  \"{\" +\"\\\"parameters\\\":\" + parameters + \"}\";");
+        PRE_PROCESS_FUNCTIONS.put(TEXT_DOCS_TO_COHERE_EMBEDDING_INPUT, cohereTextEmbeddingPreProcess());
+        PRE_PROCESS_FUNCTIONS.put(TEXT_DOCS_TO_OPENAI_EMBEDDING_INPUT, openAiTextEmbeddingPreProcess());
+        PRE_PROCESS_FUNCTIONS.put(TEXT_DOCS_TO_DEFAULT_EMBEDDING_INPUT, openAiTextEmbeddingPreProcess());
     }
 
     public static boolean contains(String functionName) {
         return PRE_PROCESS_FUNCTIONS.containsKey(functionName);
     }
 
-    public static String get(String postProcessFunction) {
+    public static Function<List<String>, Map<String, Object>> get(String postProcessFunction) {
         return PRE_PROCESS_FUNCTIONS.get(postProcessFunction);
     }
 }
diff --git a/common/src/main/java/org/opensearch/ml/common/input/remote/RemoteInferenceMLInput.java b/common/src/main/java/org/opensearch/ml/common/input/remote/RemoteInferenceMLInput.java
@@ -14,14 +14,9 @@
 import org.opensearch.ml.common.utils.StringUtils;
 
 import java.io.IOException;
-import java.security.AccessController;
-import java.security.PrivilegedActionException;
-import java.security.PrivilegedExceptionAction;
-import java.util.HashMap;
 import java.util.Map;
 
 import static org.opensearch.core.xcontent.XContentParserUtils.ensureExpectedToken;
-import static org.opensearch.ml.common.utils.StringUtils.gson;
 
 @org.opensearch.ml.common.annotation.MLInput(functionNames = {FunctionName.REMOTE})
 public class RemoteInferenceMLInput extends MLInput {

diff --git a/common/src/main/java/org/opensearch/ml/common/utils/StringUtils.java b/common/src/main/java/org/opensearch/ml/common/utils/StringUtils.java
@@ -24,6 +24,7 @@
 public class StringUtils {
 
     public static final Gson gson;
+
     static {
         gson = new Gson();
     }

diff --git a/common/src/test/java/org/opensearch/ml/common/connector/MLPostProcessFunctionTest.java b/common/src/test/java/org/opensearch/ml/common/connector/MLPostProcessFunctionTest.java
@@ -6,12 +6,21 @@
 package org.opensearch.ml.common.connector;
 
 import org.junit.Assert;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.ExpectedException;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
 
 import static org.opensearch.ml.common.connector.MLPostProcessFunction.OPENAI_EMBEDDING;
 
 public class MLPostProcessFunctionTest {
 
+    @Rule
+    public ExpectedException exceptionRule = ExpectedException.none();
+
     @Test
     public void contains() {
         Assert.assertTrue(MLPostProcessFunction.contains(OPENAI_EMBEDDING));
@@ -23,4 +32,24 @@ public void get() {
         Assert.assertNotNull(MLPostProcessFunction.get(OPENAI_EMBEDDING));
         Assert.assertNull(MLPostProcessFunction.get("wrong value"));
     }
+
+    @Test
+    public void test_getResponseFilter() {
+        assert null != MLPostProcessFunction.getResponseFilter(OPENAI_EMBEDDING);
+        assert null == MLPostProcessFunction.getResponseFilter("wrong value");
+    }
+
+    @Test
+    public void test_buildModelTensorList() {
+        Assert.assertNotNull(MLPostProcessFunction.buildModelTensorList());
+        List<List<Float>> numbersList = new ArrayList<>();
+        numbersList.add(Collections.singletonList(1.0f));
+        Assert.assertNotNull(MLPostProcessFunction.buildModelTensorList().apply(numbersList));
+    }
+
+    @Test
+    public void test_buildModelTensorList_exception() {
+        exceptionRule.expect(IllegalArgumentException.class);
+        MLPostProcessFunction.buildModelTensorList().apply(null);
+    }
 }