Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGVectorStore: Allow insertion of rows with set ids #3594

Closed
MJDeligan opened this issue Dec 8, 2023 · 1 comment
Closed

PGVectorStore: Allow insertion of rows with set ids #3594

MJDeligan opened this issue Dec 8, 2023 · 1 comment
Labels
auto:improvement Medium size change to existing code to handle new use-cases

Comments

@MJDeligan
Copy link
Contributor

Currently the implementation of the PGVectorStore only allows setting the page content and metadata for documents when adding them (or vectors directly). The Python API additionally allows passing ids that are used when inserting the rows.

This feature is necessary for an implementation of the Indexing API as the hashed ids would need to be set manually for the rows of the vector store. See also #3390 .

@dosubot dosubot bot added the auto:improvement Medium size change to existing code to handle new use-cases label Dec 8, 2023
Copy link

dosubot bot commented Dec 8, 2023

🤖

Hey @MJDeligan! Back at it again, huh? Always good to see familiar faces around these parts 👾

Yes, it is possible to modify the PGVectorStore in the langchainjs repository to allow the insertion of rows with set ids, similar to the Python API, for the implementation of the Indexing API.

To achieve this, you would need to modify the addVectors method in the PGVectorStore class to accept an additional parameter for the ids. You would also need to modify the buildInsertQuery method to include the id column in the insert statement.

Here is how you could modify these methods:

  /**
   * Constructs the SQL query for inserting rows into the specified table.
   *
   * @param rows - The rows of data to be inserted, consisting of values and records.
   * @param chunkIndex - The starting index for generating query placeholders based on chunk positioning.
   * @returns The complete SQL INSERT INTO query string.
   */
  private async buildInsertQuery(rows: (string | Record<string, unknown>)[][]) {
    let collectionId;
    if (this.collectionTableName) {
      collectionId = await this.getOrCreateCollection();
    }

    const columns = [
      this.idColumnName,
      this.contentColumnName,
      this.vectorColumnName,
      this.metadataColumnName,
    ];

    if (collectionId) {
      columns.push("collection_id");
    }

    const valuesPlaceholders = rows
      .map((_, j) => this.generatePlaceholderForRowAt(j, columns.length))
      .join(", ");

    const text = `
      INSERT INTO ${this.tableName}(
        ${columns}
      )
      VALUES ${valuesPlaceholders}
    `;
    return text;
  }

  /**
   * Method to add vectors to the vector store. It converts the vectors into
   * rows and inserts them into the database.
   *
   * @param vectors - Array of vectors.
   * @param documents - Array of `Document` instances.
   * @param ids - Array of ids.
   * @returns Promise that resolves when the vectors have been added.
   */
  async addVectors(vectors: number[][], documents: Document[], ids: string[]): Promise<void> {
    const rows = [];
    let collectionId;
    if (this.collectionTableName) {
      collectionId = await this.getOrCreateCollection();
    }

    for (let i = 0; i < vectors.length; i += 1) {
      const values = [];
      const embedding = vectors[i];
      const embeddingString = `[${embedding.join(",")}]`;
      values.push(
        ids[i],
        documents[i].pageContent,
        embeddingString,
        documents[i].metadata
      );
      if (collectionId) {
        values.push(collectionId);
      }
      rows.push(values);
    }

    for (let i = 0; i < rows.length; i += this.chunkSize) {
      const chunk = rows.slice(i, i + this.chunkSize);
      const insertQuery = await this.buildInsertQuery(chunk);
      const flatValues = chunk.flat();
      try {
        await this.pool.query(insertQuery, flatValues);
      } catch (e) {
        console.error(e);
        throw new Error(`Error inserting: ${(e as Error).message}`);
      }
    }
  }

In the addVectors method, I added a new parameter ids which is an array of strings. This array should contain the ids for each document. I then included the id in the values that are pushed into the values array.

In the buildInsertQuery method, I added this.idColumnName to the columns array. This will include the id column in the insert statement.

You can find the current implementation of these methods in the langchain/src/vectorstores/pgvector.ts file in the langchainjs repository.

Please note that this modification will require you to manually provide the ids when calling the addVectors method. If you do not provide an id for a document, it will be undefined and this may cause issues when inserting the row into the database.

I hope this helps! If you have any further questions, feel free to ask.

Best regards,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 9, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

No branches or pull requests

1 participant