-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Public Preview Refresh Add MLIndex and DataIndex examples and documen…
…tion. (#2624) * Public Preview Refresh Add MLIndex and DataIndex examples and documention. * Rename chat-with-index internal code to src and apply various black formatting fixes. * Rename pup_refresh to code_first. * Remove artifacts produced by local examples. * Address comments. --------- Co-authored-by: Lucas Pickup <[email protected]>
- Loading branch information
Showing
47 changed files
with
1,912 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
# AzureML MLIndex Asset creation | ||
|
||
MLIndex assets in AzureML represent a model used to generate embeddings from text and an index which can be searched using embedding vectors. | ||
Read more about their structure [here](./docs/mlindex.md). | ||
|
||
## Pre-requisites | ||
|
||
0. Install `azure-ai-ml` and `azureml-rag`: | ||
- `pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/` | ||
- `pip install -U 'azureml-rag[document_parsing,faiss,cognitive_search]>=0.2.0'` | ||
1. You have unstructured data. | ||
- In one of [AzureMLs supported data sources](https://learn.microsoft.com/azure/machine-learning/concept-data?view=azureml-api-2): Blob, ADLSgen2, OneLake, S3, Git | ||
- In any of these supported file formats: md, txt, py, pdf, ppt(x), doc(x) | ||
2. You have an embedding model. | ||
- [Create an Azure OpenAI service + connection](https://learn.microsoft.com/azure/machine-learning/prompt-flow/concept-connections?view=azureml-api-2) | ||
- Use a HuggingFace `sentence-transformer` model (you can just use it now, to leverage the MLIndex in PromptFlow a [Custom Runtime](https://promptflow.azurewebsites.net/how-to-guides/how-to-customize-environment-runtime.html) will be required) | ||
3. You have an Index to ingest data to. | ||
- [Create an Azure Cognitive Search service + connection](https://learn.microsoft.com/azure/machine-learning/prompt-flow/concept-connections?view=azureml-api-2) | ||
- Use a Faiss index (you can just use it now) | ||
|
||
## Let's Ingest and Index | ||
|
||
A DataIndex job is configured using the `azure-ai-ml` python sdk/cli, either directly in code or with a yaml file. | ||
|
||
### SDK | ||
|
||
The examples are runnable as Python scripts, assuming the pre-requisites have been acquired and configured in the script. | ||
Opening them in vscode enables executing each block below a `# %%` comment like a jupyter notebook cell. | ||
|
||
#### Cloud Creation | ||
|
||
##### Process this documentation using Azure OpenAI and Azure Cognitive Search | ||
|
||
- [local_docs_to_acs_mlindex.py](./data_index_job/local_docs_to_acs_mlindex.py) | ||
|
||
##### Index data from S3 using OneLake | ||
|
||
- [s3_to_acs_mlindex.py](./data_index_job/s3_to_acs_mlindex.py) | ||
- [scheduled_s3_to_asc_mlindex.py](./data_index_job/scheduled_s3_to_asc_mlindex.py) | ||
|
||
##### Ingest Azure Search docs from GitHub into a Faiss Index | ||
|
||
- [cog_search_docs_faiss_mlindex.py](./data_index_job/cog_search_docs_faiss_mlindex.py) | ||
|
||
#### Local Creation | ||
|
||
##### Process this documentation using Azure OpenAI and Azure Cognitive Search | ||
|
||
- [local_docs_to_acs_aoai_mlindex.py](./mlindex_local/local_docs_to_acs_aoai_mlindex.py) | ||
|
||
##### Process this documentation using SentenceTransformers and Faiss | ||
|
||
- [local_docs_to_faiss_mlindex.py](./mlindex_local/local_docs_to_faiss_mlindex.py) | ||
- [local_docs_to_faiss_mlindex_with_promptflow.py](./mlindex_local/local_docs_to_faiss_mlindex_with_promptflow.py) | ||
- Learn more about [Promptflow here](https://microsoft.github.io/promptflow/) | ||
|
||
##### Use a Langchain Documents to create an Index | ||
|
||
- [langchain_docs_to_mlindex.py](./mlindex_local/langchain_docs_to_mlindex.py) | ||
|
||
## Using the MLIndex asset | ||
|
||
More information about how to use MLIndex in various places [here](). | ||
|
||
## Appendix | ||
|
||
### Which Embeddings Model to use? | ||
|
||
There are currently two supported Embedding options: OpenAI's `text-embedding-ada-002` embedding model or HuggingFace embedding models. Here are some factors that might influence your decision: | ||
|
||
#### OpenAI | ||
|
||
OpenAI has [great documentation](https://platform.openai.com/docs/guides/embeddings) on their Embeddings model `text-embedding-ada-002`, it can handle up to 8191 tokens and can be accessed using [Azure OpenAI](https://learn.microsoft.com/azure/cognitive-services/openai/concepts/models#embeddings-models) or OpenAI directly. | ||
If you have an existing Azure OpenAI Instance you can connect it to AzureML, if you don't AzureML provisions a default one for you called `Default_AzureOpenAI`. | ||
The main limitation when using `text-embedding-ada-002` is cost/quota available for the model. Otherwise it provides high quality embeddings across a wide array of text domains while being simple to use. | ||
|
||
#### HuggingFace | ||
|
||
HuggingFace hosts many different models capable of embedding text into single-dimensional vectors. The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) ranks the performance of embeddings models on a few axis, not all models ranked can be run locally (e.g. `text-embedding-ada-002` is on the list), though many can and there is a range of larger and smaller models. When embedding with HuggingFace the model is loaded locally for inference, this will potentially impact your choice of compute resources. | ||
|
||
**NOTE:** The default PromptFlow Runtime does not come with HuggingFace model dependencies installed, Indexes created using HuggingFace embeddings will not work in PromptFlow by default. **Pick OpenAI if you want to use PromptFlow** | ||
|
||
### Setting up OneLake and S3 | ||
|
||
[Create a lakehouse with OneLake](https://learn.microsoft.com/fabric/onelake/create-lakehouse-onelake) | ||
|
||
[Setup a shortcut to S3](https://learn.microsoft.com/fabric/onelake/create-s3-shortcut) |
173 changes: 173 additions & 0 deletions
173
sdk/python/generative-ai/rag/code_first/data_index_job/cog_search_docs_faiss_mlindex.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
# %%[markdown] | ||
# # Local Documents to Azure Cognitive Search Index | ||
|
||
# %% Prerequisites | ||
# %pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/ | ||
# %pip install 'azureml-rag[faiss]>=0.2.0' | ||
# %pip install 'promptflow[azure]' promptflow-tools promptflow-vectordb | ||
|
||
# %% Authenticate to you AzureML Workspace, download a `config.json` from the top right hand corner menu of the Workspace. | ||
from azure.ai.ml import MLClient | ||
from azure.identity import DefaultAzureCredential | ||
|
||
ml_client = MLClient.from_config( | ||
credential=DefaultAzureCredential(), path="config.json" | ||
) | ||
|
||
# %% Create DataIndex configuration | ||
from azureml.rag.dataindex.entities import ( | ||
Data, | ||
DataIndex, | ||
IndexSource, | ||
CitationRegex, | ||
Embedding, | ||
IndexStore, | ||
) | ||
|
||
asset_name = "azure_search_docs_aoai_faiss" | ||
|
||
data_index = DataIndex( | ||
name=asset_name, | ||
description="Azure Cognitive Search docs embedded with text-embedding-ada-002 and indexed in a Faiss Index.", | ||
source=IndexSource( | ||
input_data=Data( | ||
type="uri_folder", | ||
path="<This will be replaced later>", | ||
), | ||
input_glob="articles/search/**/*", | ||
citation_url="https://learn.microsoft.com/en-us/azure", | ||
# Remove articles from the final citation url and remove the file extension so url points to hosted docs, not GitHub. | ||
citation_url_replacement_regex=CitationRegex( | ||
match_pattern="(.*)/articles/(.*)(\\.[^.]+)$", replacement_pattern="\\1/\\2" | ||
), | ||
), | ||
embedding=Embedding( | ||
model="text-embedding-ada-002", | ||
connection="azureml-rag-oai", | ||
cache_path=f"azureml://datastores/workspaceblobstore/paths/embeddings_cache/{asset_name}", | ||
), | ||
index=IndexStore(type="faiss"), | ||
# name is replaced with a unique value each time the job is run | ||
path=f"azureml://datastores/workspaceblobstore/paths/indexes/{asset_name}/{{name}}", | ||
) | ||
|
||
# %% Use git_clone Component to clone Azure Search docs from github | ||
ml_registry = MLClient(credential=ml_client._credential, registry_name="azureml") | ||
|
||
git_clone_component = ml_registry.components.get("llm_rag_git_clone", label="latest") | ||
|
||
# %% Clone Git Repo and use as input to index_job | ||
from azure.ai.ml.dsl import pipeline | ||
from azureml.rag.dataindex.data_index import index_data | ||
|
||
|
||
@pipeline(default_compute="serverless") | ||
def git_to_faiss( | ||
git_url, | ||
branch_name="", | ||
git_connection_id="", | ||
): | ||
git_clone = git_clone_component(git_repository=git_url, branch_name=branch_name) | ||
git_clone.environment_variables[ | ||
"AZUREML_WORKSPACE_CONNECTION_ID_GIT" | ||
] = git_connection_id | ||
|
||
index_job = index_data( | ||
description=data_index.description, | ||
data_index=data_index, | ||
input_data_override=git_clone.outputs.output_data, | ||
ml_client=ml_client, | ||
) | ||
|
||
return index_job.outputs | ||
|
||
|
||
# %% | ||
git_index_job = git_to_faiss("https://github.com/MicrosoftDocs/azure-docs.git") | ||
# Ensure repo cloned each run to get latest, comment out to have first clone reused. | ||
git_index_job.settings.force_rerun = True | ||
|
||
# %% Submit the DataIndex Job | ||
git_index_run = ml_client.jobs.create_or_update( | ||
git_index_job, | ||
experiment_name=asset_name, | ||
) | ||
git_index_run | ||
|
||
# %% Wait for it to finish | ||
ml_client.jobs.stream(git_index_run.name) | ||
|
||
# %% Check the created asset, it is a folder on storage containing an MLIndex yaml file | ||
mlindex_docs_index_asset = ml_client.data.get(asset_name, label="latest") | ||
mlindex_docs_index_asset | ||
|
||
# %% Try it out with langchain by loading the MLIndex asset using the azureml-rag SDK | ||
from azureml.rag.mlindex import MLIndex | ||
|
||
mlindex = MLIndex(mlindex_docs_index_asset) | ||
|
||
index = mlindex.as_langchain_vectorstore() | ||
docs = index.similarity_search("How can I enable Semantic Search on my Index?", k=5) | ||
docs | ||
|
||
# %% Take a look at those chunked docs | ||
import json | ||
|
||
for doc in docs: | ||
print(json.dumps({"content": doc.page_content, **doc.metadata}, indent=2)) | ||
|
||
# %% Try it out with Promptflow | ||
|
||
import promptflow | ||
|
||
pf = promptflow.PFClient() | ||
|
||
# %% List all the available connections | ||
for c in pf.connections.list(): | ||
print(c.name + " (" + c.type + ")") | ||
|
||
# %% Load index qna flow | ||
from pathlib import Path | ||
|
||
flow_path = Path.cwd().parent / "flows" / "bring_your_own_data_chat_qna" | ||
mlindex_path = mlindex_docs_index_asset.path | ||
|
||
# %% Put MLIndex uri into Vector DB Lookup tool inputs in [bring_your_own_data_chat_qna/flow.dag.yaml](../flows/bring_your_own_data_chat_qna/flow.dag.yaml) | ||
import re | ||
|
||
with open(flow_path / "flow.dag.yaml", "r") as f: | ||
flow_yaml = f.read() | ||
flow_yaml = re.sub( | ||
r"path: (.*)# Index uri", f"path: {mlindex_path} # Index uri", flow_yaml, re.M | ||
) | ||
with open(flow_path / "flow.dag.yaml", "w") as f: | ||
f.write(flow_yaml) | ||
|
||
# %% Run qna flow | ||
output = pf.flows.test( | ||
flow_path, | ||
inputs={ | ||
"chat_history": [], | ||
"chat_input": "How recently was Vector Search support added to Azure Cognitive Search?", | ||
}, | ||
) | ||
|
||
chat_output = output["chat_output"] | ||
for part in chat_output: | ||
print(part, end="") | ||
|
||
# %% Run qna flow with multiple inputs | ||
data_path = Path.cwd().parent / "flows" / "data" / "azure_search_docs_questions.jsonl" | ||
|
||
column_mapping = { | ||
"chat_history": "${data.chat_history}", | ||
"chat_input": "${data.chat_input}", | ||
"chat_output": "${data.chat_output}", | ||
} | ||
run = pf.run(flow=flow_path, data=data_path, column_mapping=column_mapping) | ||
pf.stream(run) | ||
|
||
print(f"{run}") | ||
|
||
|
||
# %% |
45 changes: 45 additions & 0 deletions
45
sdk/python/generative-ai/rag/code_first/data_index_job/local_docs_to_acs_mlindex.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# %%[markdown] | ||
# # Local Documents to Azure Cognitive Search Index | ||
|
||
# %% Prerequisites | ||
# %pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/ | ||
# %pip install 'azureml-rag[cognitive_search]>=0.2.0' | ||
|
||
# %% Authenticate to you AzureML Workspace, download a `config.json` from the top right hand corner menu of the Workspace. | ||
from azure.ai.ml import MLClient, load_data | ||
from azure.identity import DefaultAzureCredential | ||
|
||
ml_client = MLClient.from_config( | ||
credential=DefaultAzureCredential(), path="config.json" | ||
) | ||
|
||
# %% Load DataIndex configuration from file | ||
data_index = load_data("local_docs_to_acs_mlindex.yaml") | ||
print(data_index) | ||
|
||
# %% Submit the DataIndex Job | ||
index_job = ml_client.data.index_data(data_index=data_index) | ||
|
||
# %% Wait for it to finish | ||
ml_client.jobs.stream(index_job.name) | ||
|
||
# %% Check the created asset, it is a folder on storage containing an MLIndex yaml file | ||
mlindex_docs_index_asset = ml_client.data.get(data_index.name, label="latest") | ||
mlindex_docs_index_asset | ||
|
||
# %% Try it out with langchain by loading the MLIndex asset using the azureml-rag SDK | ||
from azureml.rag.mlindex import MLIndex | ||
|
||
mlindex = MLIndex(mlindex_docs_index_asset) | ||
|
||
index = mlindex.as_langchain_vectorstore() | ||
docs = index.similarity_search("What is an MLIndex?", k=5) | ||
docs | ||
|
||
# %% Take a look at those chunked docs | ||
import json | ||
|
||
for doc in docs: | ||
print(json.dumps({"content": doc.page_content, **doc.metadata}, indent=2)) | ||
|
||
# %% Try it out with Promptflow |
23 changes: 23 additions & 0 deletions
23
sdk/python/generative-ai/rag/code_first/data_index_job/local_docs_to_acs_mlindex.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
$schema: http://azureml/sdk-2-0/DataIndex.json | ||
type: uri_folder | ||
name: mlindex_docs_aoai_acs | ||
description: Python embedded with text-embedding-ada-002 and indexed in Azure Cognitive Search. | ||
|
||
source: | ||
input_data: | ||
type: uri_folder | ||
path: ../ | ||
chunk_size: 200 | ||
citation_url: 'https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag/refresh' | ||
|
||
embedding: | ||
model: azure_open_ai://deployment/text-embedding-ada-002/model/text-embedding-ada-002 | ||
connection: azureml-rag-oai | ||
cache_path: azureml://datastores/workspaceblobstore/paths/embeddings_cache/mlindex_docs_aoai_acs | ||
|
||
index: | ||
type: acs | ||
connection: azureml:azureml-rag-acs | ||
name: mlindex_docs_aoai | ||
|
||
path: azureml://datastores/workspaceblobstore/paths/indexes/mlindex_docs_aoai_acs/{name} |
Oops, something went wrong.