Skip to content

Commit

Permalink
Public Preview Refresh Add MLIndex and DataIndex examples and documen…
Browse files Browse the repository at this point in the history
…tion. (#2624)

* Public Preview Refresh Add MLIndex and DataIndex examples and documention.

* Rename chat-with-index internal code to src and apply various black formatting fixes.

* Rename pup_refresh to code_first.

* Remove artifacts produced by local examples.

* Address comments.

---------

Co-authored-by: Lucas Pickup <[email protected]>
  • Loading branch information
tot0 and Lucas Pickup authored Sep 8, 2023
1 parent fbfe7fc commit 06786a0
Show file tree
Hide file tree
Showing 47 changed files with 1,912 additions and 0 deletions.
87 changes: 87 additions & 0 deletions sdk/python/generative-ai/rag/code_first/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# AzureML MLIndex Asset creation

MLIndex assets in AzureML represent a model used to generate embeddings from text and an index which can be searched using embedding vectors.
Read more about their structure [here](./docs/mlindex.md).

## Pre-requisites

0. Install `azure-ai-ml` and `azureml-rag`:
- `pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/`
- `pip install -U 'azureml-rag[document_parsing,faiss,cognitive_search]>=0.2.0'`
1. You have unstructured data.
- In one of [AzureMLs supported data sources](https://learn.microsoft.com/azure/machine-learning/concept-data?view=azureml-api-2): Blob, ADLSgen2, OneLake, S3, Git
- In any of these supported file formats: md, txt, py, pdf, ppt(x), doc(x)
2. You have an embedding model.
- [Create an Azure OpenAI service + connection](https://learn.microsoft.com/azure/machine-learning/prompt-flow/concept-connections?view=azureml-api-2)
- Use a HuggingFace `sentence-transformer` model (you can just use it now, to leverage the MLIndex in PromptFlow a [Custom Runtime](https://promptflow.azurewebsites.net/how-to-guides/how-to-customize-environment-runtime.html) will be required)
3. You have an Index to ingest data to.
- [Create an Azure Cognitive Search service + connection](https://learn.microsoft.com/azure/machine-learning/prompt-flow/concept-connections?view=azureml-api-2)
- Use a Faiss index (you can just use it now)

## Let's Ingest and Index

A DataIndex job is configured using the `azure-ai-ml` python sdk/cli, either directly in code or with a yaml file.

### SDK

The examples are runnable as Python scripts, assuming the pre-requisites have been acquired and configured in the script.
Opening them in vscode enables executing each block below a `# %%` comment like a jupyter notebook cell.

#### Cloud Creation

##### Process this documentation using Azure OpenAI and Azure Cognitive Search

- [local_docs_to_acs_mlindex.py](./data_index_job/local_docs_to_acs_mlindex.py)

##### Index data from S3 using OneLake

- [s3_to_acs_mlindex.py](./data_index_job/s3_to_acs_mlindex.py)
- [scheduled_s3_to_asc_mlindex.py](./data_index_job/scheduled_s3_to_asc_mlindex.py)

##### Ingest Azure Search docs from GitHub into a Faiss Index

- [cog_search_docs_faiss_mlindex.py](./data_index_job/cog_search_docs_faiss_mlindex.py)

#### Local Creation

##### Process this documentation using Azure OpenAI and Azure Cognitive Search

- [local_docs_to_acs_aoai_mlindex.py](./mlindex_local/local_docs_to_acs_aoai_mlindex.py)

##### Process this documentation using SentenceTransformers and Faiss

- [local_docs_to_faiss_mlindex.py](./mlindex_local/local_docs_to_faiss_mlindex.py)
- [local_docs_to_faiss_mlindex_with_promptflow.py](./mlindex_local/local_docs_to_faiss_mlindex_with_promptflow.py)
- Learn more about [Promptflow here](https://microsoft.github.io/promptflow/)

##### Use a Langchain Documents to create an Index

- [langchain_docs_to_mlindex.py](./mlindex_local/langchain_docs_to_mlindex.py)

## Using the MLIndex asset

More information about how to use MLIndex in various places [here]().

## Appendix

### Which Embeddings Model to use?

There are currently two supported Embedding options: OpenAI's `text-embedding-ada-002` embedding model or HuggingFace embedding models. Here are some factors that might influence your decision:

#### OpenAI

OpenAI has [great documentation](https://platform.openai.com/docs/guides/embeddings) on their Embeddings model `text-embedding-ada-002`, it can handle up to 8191 tokens and can be accessed using [Azure OpenAI](https://learn.microsoft.com/azure/cognitive-services/openai/concepts/models#embeddings-models) or OpenAI directly.
If you have an existing Azure OpenAI Instance you can connect it to AzureML, if you don't AzureML provisions a default one for you called `Default_AzureOpenAI`.
The main limitation when using `text-embedding-ada-002` is cost/quota available for the model. Otherwise it provides high quality embeddings across a wide array of text domains while being simple to use.

#### HuggingFace

HuggingFace hosts many different models capable of embedding text into single-dimensional vectors. The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) ranks the performance of embeddings models on a few axis, not all models ranked can be run locally (e.g. `text-embedding-ada-002` is on the list), though many can and there is a range of larger and smaller models. When embedding with HuggingFace the model is loaded locally for inference, this will potentially impact your choice of compute resources.

**NOTE:** The default PromptFlow Runtime does not come with HuggingFace model dependencies installed, Indexes created using HuggingFace embeddings will not work in PromptFlow by default. **Pick OpenAI if you want to use PromptFlow**

### Setting up OneLake and S3

[Create a lakehouse with OneLake](https://learn.microsoft.com/fabric/onelake/create-lakehouse-onelake)

[Setup a shortcut to S3](https://learn.microsoft.com/fabric/onelake/create-s3-shortcut)
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# %%[markdown]
# # Local Documents to Azure Cognitive Search Index

# %% Prerequisites
# %pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
# %pip install 'azureml-rag[faiss]>=0.2.0'
# %pip install 'promptflow[azure]' promptflow-tools promptflow-vectordb

# %% Authenticate to you AzureML Workspace, download a `config.json` from the top right hand corner menu of the Workspace.
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(
credential=DefaultAzureCredential(), path="config.json"
)

# %% Create DataIndex configuration
from azureml.rag.dataindex.entities import (
Data,
DataIndex,
IndexSource,
CitationRegex,
Embedding,
IndexStore,
)

asset_name = "azure_search_docs_aoai_faiss"

data_index = DataIndex(
name=asset_name,
description="Azure Cognitive Search docs embedded with text-embedding-ada-002 and indexed in a Faiss Index.",
source=IndexSource(
input_data=Data(
type="uri_folder",
path="<This will be replaced later>",
),
input_glob="articles/search/**/*",
citation_url="https://learn.microsoft.com/en-us/azure",
# Remove articles from the final citation url and remove the file extension so url points to hosted docs, not GitHub.
citation_url_replacement_regex=CitationRegex(
match_pattern="(.*)/articles/(.*)(\\.[^.]+)$", replacement_pattern="\\1/\\2"
),
),
embedding=Embedding(
model="text-embedding-ada-002",
connection="azureml-rag-oai",
cache_path=f"azureml://datastores/workspaceblobstore/paths/embeddings_cache/{asset_name}",
),
index=IndexStore(type="faiss"),
# name is replaced with a unique value each time the job is run
path=f"azureml://datastores/workspaceblobstore/paths/indexes/{asset_name}/{{name}}",
)

# %% Use git_clone Component to clone Azure Search docs from github
ml_registry = MLClient(credential=ml_client._credential, registry_name="azureml")

git_clone_component = ml_registry.components.get("llm_rag_git_clone", label="latest")

# %% Clone Git Repo and use as input to index_job
from azure.ai.ml.dsl import pipeline
from azureml.rag.dataindex.data_index import index_data


@pipeline(default_compute="serverless")
def git_to_faiss(
git_url,
branch_name="",
git_connection_id="",
):
git_clone = git_clone_component(git_repository=git_url, branch_name=branch_name)
git_clone.environment_variables[
"AZUREML_WORKSPACE_CONNECTION_ID_GIT"
] = git_connection_id

index_job = index_data(
description=data_index.description,
data_index=data_index,
input_data_override=git_clone.outputs.output_data,
ml_client=ml_client,
)

return index_job.outputs


# %%
git_index_job = git_to_faiss("https://github.com/MicrosoftDocs/azure-docs.git")
# Ensure repo cloned each run to get latest, comment out to have first clone reused.
git_index_job.settings.force_rerun = True

# %% Submit the DataIndex Job
git_index_run = ml_client.jobs.create_or_update(
git_index_job,
experiment_name=asset_name,
)
git_index_run

# %% Wait for it to finish
ml_client.jobs.stream(git_index_run.name)

# %% Check the created asset, it is a folder on storage containing an MLIndex yaml file
mlindex_docs_index_asset = ml_client.data.get(asset_name, label="latest")
mlindex_docs_index_asset

# %% Try it out with langchain by loading the MLIndex asset using the azureml-rag SDK
from azureml.rag.mlindex import MLIndex

mlindex = MLIndex(mlindex_docs_index_asset)

index = mlindex.as_langchain_vectorstore()
docs = index.similarity_search("How can I enable Semantic Search on my Index?", k=5)
docs

# %% Take a look at those chunked docs
import json

for doc in docs:
print(json.dumps({"content": doc.page_content, **doc.metadata}, indent=2))

# %% Try it out with Promptflow

import promptflow

pf = promptflow.PFClient()

# %% List all the available connections
for c in pf.connections.list():
print(c.name + " (" + c.type + ")")

# %% Load index qna flow
from pathlib import Path

flow_path = Path.cwd().parent / "flows" / "bring_your_own_data_chat_qna"
mlindex_path = mlindex_docs_index_asset.path

# %% Put MLIndex uri into Vector DB Lookup tool inputs in [bring_your_own_data_chat_qna/flow.dag.yaml](../flows/bring_your_own_data_chat_qna/flow.dag.yaml)
import re

with open(flow_path / "flow.dag.yaml", "r") as f:
flow_yaml = f.read()
flow_yaml = re.sub(
r"path: (.*)# Index uri", f"path: {mlindex_path} # Index uri", flow_yaml, re.M
)
with open(flow_path / "flow.dag.yaml", "w") as f:
f.write(flow_yaml)

# %% Run qna flow
output = pf.flows.test(
flow_path,
inputs={
"chat_history": [],
"chat_input": "How recently was Vector Search support added to Azure Cognitive Search?",
},
)

chat_output = output["chat_output"]
for part in chat_output:
print(part, end="")

# %% Run qna flow with multiple inputs
data_path = Path.cwd().parent / "flows" / "data" / "azure_search_docs_questions.jsonl"

column_mapping = {
"chat_history": "${data.chat_history}",
"chat_input": "${data.chat_input}",
"chat_output": "${data.chat_output}",
}
run = pf.run(flow=flow_path, data=data_path, column_mapping=column_mapping)
pf.stream(run)

print(f"{run}")


# %%
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# %%[markdown]
# # Local Documents to Azure Cognitive Search Index

# %% Prerequisites
# %pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
# %pip install 'azureml-rag[cognitive_search]>=0.2.0'

# %% Authenticate to you AzureML Workspace, download a `config.json` from the top right hand corner menu of the Workspace.
from azure.ai.ml import MLClient, load_data
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(
credential=DefaultAzureCredential(), path="config.json"
)

# %% Load DataIndex configuration from file
data_index = load_data("local_docs_to_acs_mlindex.yaml")
print(data_index)

# %% Submit the DataIndex Job
index_job = ml_client.data.index_data(data_index=data_index)

# %% Wait for it to finish
ml_client.jobs.stream(index_job.name)

# %% Check the created asset, it is a folder on storage containing an MLIndex yaml file
mlindex_docs_index_asset = ml_client.data.get(data_index.name, label="latest")
mlindex_docs_index_asset

# %% Try it out with langchain by loading the MLIndex asset using the azureml-rag SDK
from azureml.rag.mlindex import MLIndex

mlindex = MLIndex(mlindex_docs_index_asset)

index = mlindex.as_langchain_vectorstore()
docs = index.similarity_search("What is an MLIndex?", k=5)
docs

# %% Take a look at those chunked docs
import json

for doc in docs:
print(json.dumps({"content": doc.page_content, **doc.metadata}, indent=2))

# %% Try it out with Promptflow
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
$schema: http://azureml/sdk-2-0/DataIndex.json
type: uri_folder
name: mlindex_docs_aoai_acs
description: Python embedded with text-embedding-ada-002 and indexed in Azure Cognitive Search.

source:
input_data:
type: uri_folder
path: ../
chunk_size: 200
citation_url: 'https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag/refresh'

embedding:
model: azure_open_ai://deployment/text-embedding-ada-002/model/text-embedding-ada-002
connection: azureml-rag-oai
cache_path: azureml://datastores/workspaceblobstore/paths/embeddings_cache/mlindex_docs_aoai_acs

index:
type: acs
connection: azureml:azureml-rag-acs
name: mlindex_docs_aoai

path: azureml://datastores/workspaceblobstore/paths/indexes/mlindex_docs_aoai_acs/{name}
Loading

0 comments on commit 06786a0

Please sign in to comment.