Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: vectorize #177

Merged
merged 64 commits into from
Oct 5, 2024
Merged

feat: vectorize #177

merged 64 commits into from
Oct 5, 2024

Conversation

RihanArfan
Copy link
Contributor

@RihanArfan RihanArfan commented Jun 19, 2024

Closes #174, Related #173

Adds support for using Vectorize indexes.

For docs: Vectorize through Cloudflare bindings accessed via const vectorize = hubVectorize(<index>) so their docs apply. https://developers.cloudflare.com/vectorize/reference/client-api/


How to use now?

While vector databases is still a wip PR, using it is pretty straightforward to use now and if you're fine with temporary caveats like manually adding bindings to environments, and developing via --remote then you can use it today. There may be breaking changes, so review when updating @nuxt-hub/core.

1. Create Index

You'll currently need to manually create a binding via wrangler while this PR is still in progress. This will eventually be handled by Nuxt Hub while deploying.

An index's dimensions and metrics should be set based on the embeddings model you're using. I'm using bge-base-en-v1.5, which needs 768/cosine. You cannot change this later without recreating the index (and triggering a new Pages deployment)

image

pnpx wrangler vectorize create ecommerce-products --dimensions=768 --metric=cosine

Once you've made the index , you can add a binding for it via the Cloudflare dash (Pages -> Settings -> Functions Vectorize index bindings).
image

Update: Make sure the binding name follows this format: VECTORIZE_<index name in upper case>. In this scenario it'd be VECTORIZE_PRODUCTS.

2. Use @nuxt-hub/core version built from this PR

npm i https://pkg.pr.new/nuxt-hub/core/@nuxthub/core@177

3. Enable Vectorize

// nuxt.config.ts
export default defineNuxtConfig({
  hub: {
    vectorize: {
      products: {
        metric: 'cosine',
        dimensions: '768',
        metadataIndexes: { name: 'string', price: 'number' }
      },
    },
    // ...
  },
})

4. Deploy the website

As --remote is required because Cloudflare doesn't support local Vectorize bindings yet, you'll need to push and deploy now so we can use Vectorize via bindings on the deployed application.

5. Done!

You need to use --remote for now.

pnpm dev --remote

Docs

What are vector databases for?

Read https://developers.cloudflare.com/vectorize/reference/what-is-a-vector-database/

Usage

See operations here https://developers.cloudflare.com/vectorize/reference/client-api/#operations

const vectorize = hubVectorize('products')
const { matches } = await vectorize.query(vectors, { topK: 5 })

vectorize.insert()
vectorize.upsert()
// etc.

Usage example

Querying

In this example, 1. a vector is generated from the query, 2. search is via Vectorize, 3. then data is enriched by querying the database. https://developers.cloudflare.com/vectorize/reference/what-is-a-vector-database/#vector-search

If you wanted to build a RAG experience, you'd have a 4th step where you pass all this information to an LLM as context in a prompt. See https://developers.cloudflare.com/workers-ai/tutorials/build-a-retrieval-augmented-generation-ai/

Code

import { z } from "zod";

interface EmbeddingResponse {
  shape: number[];
  data: number[][];
}

const Query = z.object({
  query: z.string().min(1).max(256),
  limit: z.coerce.number().int().min(1).max(20).default(10),
});

export default defineEventHandler(async (event) => {
  const { query, limit } = await getValidatedQuery(event, Query.parse);

  // 1. generate embeddings for search query
  const ai = hubAi();
  const embeddings: EmbeddingResponse = await ai.run(
    "@cf/baai/bge-base-en-v1.5",
    { text: [query] },
    // cache using ai gateway - https://developers.cloudflare.com/ai-gateway/
    // commented it out as requires creating an AI gateway from cf dash
    // { gateway: { id: "new-role" } },
  );
  const vectors = embeddings.data[0];

  // 2. query vectorize to find similar results
  const vectorize: VectorizeIndex = hubVectorize('jobs');
  const { matches } = await vectorize.query(vectors, {
    topK: limit,
    namespace: "job-titles",
  });

  // 3. get details for matching jobs
  const jobMatches = await useDrizzle().query.jobs.findMany({
    where: (jobs, { inArray }) =>
      inArray(
        jobs.id,
        matches.map((match) => match.id),
      ),
    with: {
      division: true,
      department: true,
      subDepartment: true,
    },
  });

  // 4. add score to job matches
  const jobMatchesWithScore = jobMatches.map((job) => {
    const match = matches.find((match) => match.id === job.id);
    return { ...job, score: match!.score };
  });

  // 5. sort by score
  jobMatchesWithScore.sort((a, b) => b.score - a.score);

  return jobMatchesWithScore;
});

Bulk vector generation and import

This example bulk generates and imports vectors for items in a database using a text embeddings model to create search experience.

Code

// server/tasks/generate-embeddings.ts

import { jobs } from "../database/schema";
import { asc, count } from "drizzle-orm";

import type { VectorizeIndex } from "@nuxthub/core";

export default defineTask({
  meta: {
    name: "vectorize:seed",
    description: "Generate vector text embeddings",
  },
  async run() {
    console.log("Running Vectorize seed task...");

    // count all rows
    const jobCount = (await useDrizzle().select({ count: count() }).from(tables.jobs))[0].count;

    // loop through total job row count in increments of X. Get job rows (id and jobTitle columns) with paginated based on loop
    const INCREMENT_AMOUNT = 20;

    // log total batches
    const totalBatches = Math.ceil(jobCount / INCREMENT_AMOUNT);
    console.log(`Total jobs: ${jobCount} total jobs (${totalBatches} batches)`);

    for (let i = 0; i < jobCount; i += INCREMENT_AMOUNT) {
      console.log(`⏳ Processing jobs ${i} - ${i + INCREMENT_AMOUNT}...`);

      // get id and job titles for batch
      const jobsChunk = await useDrizzle()
        .select()
        .from(tables.jobs)
        .orderBy(asc(jobs.id))
        .limit(INCREMENT_AMOUNT)
        .offset(i);

      // generate embeddings for job titles
      const ai = hubAi();
      const embeddings = await ai.run(
        "@cf/baai/bge-base-en-v1.5",
        { text: jobsChunk.map((job) => job.jobTitle) },
        { gateway: { id: "new-role" } },
      );
      const vectors = embeddings.data;

      // format embeddings with id and metadata (jobTitle) for vectorize index
      const formattedEmbeddings = jobsChunk.map((job, index) => {
        const { sufaCode: id, ...metadata } = job;

        return {
          id,
          namespace: "job-titles",
          metadata: { ...metadata },
          values: vectors[index],
        };
      });

      // save embeddings to vectorize index
      const vectorize: VectorizeIndex = hubVectorize('jobs');
      await vectorize.upsert(formattedEmbeddings);

      console.log(`✅ Processed jobs ${i} - ${i + INCREMENT_AMOUNT}...`);
    }

    console.log("Vectorize seed task completed!");
    return { result: "success" };
  },
});

Vectorize supports upserting 1000 via the Workers API and 5000 via the HTTP API currently, so it's unlikely that the looping for batches is necessary, however, I ran into some issues before which I didn't have time to debug so I made the chunks smaller.

You can likely simplify this code a lot, but it's a starting point.

More

It's possible to store core data in Vectorize directly as metadata on the record. If you fetch from Vectorize with metadata or values, you're limited to the top 30 results. If you only want to get IDs and match % back, you can get the top 100 results. (https://developers.cloudflare.com/vectorize/platform/limits/)

@RihanArfan RihanArfan force-pushed the feat/vectorize branch 2 times, most recently from d374cd8 to df6a752 Compare June 19, 2024 15:25
Copy link

pkg-pr-new bot commented Jun 19, 2024

Open in Stackblitz

pnpm add https://pkg.pr.new/nuxt-hub/core/@nuxthub/core@177

commit: cf33615

@RihanArfan RihanArfan force-pushed the feat/vectorize branch 2 times, most recently from 69f60a0 to b2ee77b Compare June 19, 2024 17:18
@RihanArfan
Copy link
Contributor Author

RihanArfan commented Jun 19, 2024

Turns out Vectorize doesn't support local development, only with wrangler with --remote. This is unlike Workers AI, which supports local development, however models are actually ran on Cloudflare with your account.

Issue tracking Vectorize local bindings: cloudflare/workers-sdk#4360
https://developers.cloudflare.com/workers/testing/local-development/#supported-resource-bindings-in-different-environments

For now, this feature could only be supported with --remote (either via NuxtHub's proxy or wrangler remote). This would involve adding Vectorize to endpoints to NuxtHub's backend and I don't think that's OSS. Alternatively it could be blocked until local development is supported with Vectorize. Alternatively, t

Copy link
Contributor

atinux commented Jun 20, 2024

Thanks for looking at it so quickly.

I think this could anyway be possible within the OSS as you would need to deploy your application at first in order to use Vectorize.

Would you be happy to work on the proxy API routes?

@RihanArfan
Copy link
Contributor Author

RihanArfan commented Jun 20, 2024

I didn't realise those routes were for anything more than just devtools preview with --remote for some reason lol 😄 I've added the proxy routes, but I don't think I can test them yet. If I understand correctly, bindings are added once the build hook is ran. Could you support adding the Vectorize bindings? Edit: With a fresh mind I realised I can manually add the bindings from CF dash myself 🤦

I'll continue my dissertation where I'll be testing test both AI and Vectorize integrations to build a simple vector search engine.

@RihanArfan RihanArfan force-pushed the feat/vectorize branch 6 times, most recently from 6173476 to aedb7c2 Compare June 25, 2024 15:11
@RihanArfan
Copy link
Contributor Author

RihanArfan commented Jun 25, 2024

Vectorize and AI works ✨

image

Got some small things to clean up code wise, which I'll get sorted hopefully by mid July August.

@RihanArfan
Copy link
Contributor Author

45 minutes of rebasing 😓 git is not my passion

@RihanArfan RihanArfan marked this pull request as ready for review October 4, 2024 03:35
@atinux atinux merged commit af4dc62 into nuxt-hub:main Oct 5, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cloudflare Vectorize
3 participants