Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: SimilaritySearch with scoreThrehold not works with PgVector. #974

Closed
AnthonyDasse opened this issue Jul 30, 2024 · 3 comments
Closed

Comments

@AnthonyDasse
Copy link

AnthonyDasse commented Jul 30, 2024

Hello.

I think there are an issue with the function SimilaritySearch (with scoreThrehold option).

When I use the SimilaritySearch function with PgVector and I add the 'scoreThreshold' option to 0.80, I have no documents returned.
If I remove the 'scoreThreshold' option, I have many documents returned with a score greater than 0.80.

According to my research and the github issues of the langhchain library, the problem comes from a confusion between the distance strategy.

See the issues:

I think the error is around this part of the code : langchaingo/vectorstores/pgvector
if scoreThreshold != 0 { whereQuerys = append(whereQuerys, fmt.Sprintf("data.distance < %f", 1-scoreThreshold)) }

If we look at the pgvector document :

the sql request should be like this : https://github.com/pgvector/pgvector

SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;
@chew-z
Copy link

chew-z commented Aug 13, 2024

No one answered you so I will try...

  1. It seems to me that langchaingo can't handle newer embedding models. It deepends on github.com/pkoukk/tiktoken-go v 0.1.6 which is good only for older text-embedding-ada-002 embeddings. Newer embeddings models are handled since 0.1.7... You can't use text-embedding-3-large by definition and results with text-embedding-3-small seem suspicious to me.

That's only my experience. I should have investigated further but ...

  1. Since vector search is so critical I would not recommend depending on langachaingo but writing this for yourself. It is really just one database function and one small golang function after all.

@chew-z
Copy link

chew-z commented Aug 13, 2024

CREATE OR REPLACE FUNCTION public.find_closest_vector(input_vector VECTOR(1536), limit_results INT, filename VARCHAR, collection_name VARCHAR)
RETURNS TABLE(
    doc VARCHAR,
    similarity DOUBLE PRECISION
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        c.document,
        (1 - (c.embedding <=> input_vector)) AS similarity
    FROM
        langchain_pg_embedding c
    INNER JOIN langchain_pg_collection col ON c.collection_id = col.uuid
    WHERE
        col.name = collection_name AND
        c.cmetadata ->> 'filename' = filename
    ORDER BY
        c.embedding <=> input_vector -- This operator calculates the cosine distance
    LIMIT limit_results;
END;
$$ LANGUAGE plpgsql;

type VectorSearchResult struct {
	Document   pgtype.Text   `db:"doc"`
	Similarity pgtype.Float8 `db:"similarity"`
}

// VectorSearch queries the database for the closest vector to the given vector, with the specified limit and filename.
// It returns a slice of VectorSearchResult structs.
func VectorSearch(dbPool *pgxpool.Pool, vector *[]float32, limit int, filename string) []VectorSearchResult {
	// Create a new vector using the given vector slice
	v := pgvector.NewVector(*vector)

	// Execute the query using the dbPool and the vector, limit, filename, and collection name as parameters
	rows, err := dbPool.Query(ctx, "SELECT doc, similarity FROM public.find_closest_vector($1, $2, $3, $4)", v, limit, filename, PGCOLLECTION)
	if err != nil {
		log.Println("error while executing query - ", err)
	}

	// Collect the rows into a slice of VectorSearchResult structs
	result, err := pgx.CollectRows(rows, pgx.RowToStructByNameLax[VectorSearchResult])
	if err != nil {
		log.Printf("CollectRows error: %s", err.Error())
	}

	// Return the result
	return result
}

@AnthonyDasse
Copy link
Author

thank you @chew-z , i look that

@tmc tmc closed this as completed in 238d1c7 Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants