title |
---|
Incremental Adoption |
Migrating to Indexify is easy! Pretty much every multi-stage application code can be ported with minimal changes. Here is an example -
This following code reads a list of documents, chunks them, embeds the chunks and writes the embeddings to a database.
def chunk(doc: str) -> list[str]:
return doc.split("\n")
def embed_chunk(chunk: str) -> List[float]:
return embed(chunk)
### Indexing code
def myapp(docs: List[str]) -> None:
chunks = [chunk(doc) for doc in docs]
embeddings = [embed_chunk(chunk) for chunk in chunks]
for (embeddings, chunk) in zip(embeddings, chunks):
vectordb.write(chunk, embeddings)
This is a typical pattern in many applications. While this works in notebooks and in prototypes, it has the following limitations -
- The code is synchronous and runs on a single machine. It can't scale to large datasets.
- If any step fails, you have to re-run the entire process. e.g Chunk and Embed if the database write fails.
- You have to build a solution for re-indexing data if you change the embedding model.
- You have to build a solution for managing access control and namespaces for different data sources.
- Have to wrap it with FastAPI or other server frameworks to make it callable from other applications.
```python
@indexify_function()
def chunk(doc: str) -> list[str]:
return doc.split("\n")
class ChunkEmbeddings(BaseModel):
chunk: str
embeddings: List[float]
@indexify_function()
def embed_chunk(chunk: str) -> ChunkEmbeddings:
embeddings = embed(chunk)
return ChunkEmbeddings(chunk=chunk, embeddings=embeddings)
@indexify_function()
def db_writer(docs: List[str]) -> List[List[float]]:
chunks = [chunk(doc) for doc in docs]
embeddings = [embed_chunk(chunk) for chunk in chunks]
return embeddings
```
This is the main change. We define a graph that connects these functions. The edges of the graph defines the data flow across the functions.
from indexify import RemoteGraph, Graph
g = Graph(start_node=chunk)
g.add_edge(chunk, embed_chunk)
g.add_edge(embed_chunk, db_writer)
## Creates a remote endpoint for the graph
## This is the "deployment" step
graph = RemoteGraph.deploy(g)
At this point, your graph is deployed as an Remote API. You can call it from any application code.
docs = ["doc1", "doc2", "doc3"]
for doc in docs:
invocation_id = graph.run(block_until_done=True, doc=doc)
embeddings: List[List[float]] = graph.output(invocation_id, "embed_chunk")
- You get a remote API for your workflow without writing any server code.
- The invocations to the graph are automatically load balanced and parallelized.
- If any step fails, the step will be retried without re-running the entire process.
- Graphs are automatically versioned, and when you roll out a new model, you can re-run the graphs on the older data.
- The embedding function can be run on GPUs while the database writes and chunking can run on CPUs. Every second of GPU time is being spent on the embedding function, and you are not paying for the GPU when the database write is happening.