Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParadeDB (pg_search, pg_analytics) #758

Merged
merged 11 commits into from
Sep 17, 2024
Merged

ParadeDB (pg_search, pg_analytics) #758

merged 11 commits into from
Sep 17, 2024

Conversation

vitabaks
Copy link
Owner

@vitabaks vitabaks commented Sep 13, 2024

Issue: #739

Add ParadeDB (pg_search and pg_analytics extensions)

Variables

  • enable_paradedb to install pg_analytics, pg_search, and pgvector extensions
  • enable_pg_search, enable_pg_analytics to install a specific extension
  • pg_search_version, pg_analytics_version. Default: latest

Compatible with Debian 12, Ubuntu 22.04 and 24.04, and Red Hat 8 and 9 for Postgres 14-17

Deploy ParadeDB HA Cluster

To deploy a PostgreSQL High-Availability Cluster with the ParadeDB extensions, add the enable_paradedb variable:

ansible-playbook deploy_pgcluster.yml -e "enable_paradedb=true"

Additionally

Remove code duplication related to the installation of extensions during the PostgreSQL upgrade process (pg_upgrade.yml playbook). By importing the existing extensions.yml task from the packages role, we consolidate the logic for handling extension installations and reduce the redundancy in code.

@vitabaks vitabaks added the enhancement Improvement of the current functionality label Sep 13, 2024
@vitabaks vitabaks self-assigned this Sep 13, 2024
@vitabaks vitabaks marked this pull request as draft September 13, 2024 11:41
@vitabaks vitabaks added the new feature New functionality label Sep 16, 2024
@vitabaks vitabaks marked this pull request as ready for review September 17, 2024 11:25
@vitabaks
Copy link
Owner Author

vitabaks commented Sep 17, 2024

Test (ParadeDB HA Cluster)

ansible-playbook deploy_pgcluster.yml -e "enable_paradedb=true"

Ansible log:

PLAY [Deploy PostgreSQL HA Cluster (based on "Patroni")] ***********************
...
TASK [pre-checks : Create a list of extensions] ********************************
ok: [10.172.0.20]

TASK [pre-checks : Add required extensions to 'shared_preload_libraries' (if missing)] ***
ok: [10.172.0.20] => (item=pg_search)
ok: [10.172.0.20] => (item=pg_analytics)
...
TASK [packages : Looking up the latest version of pg_search] *******************
ok: [10.172.0.20]
ok: [10.172.0.21]
ok: [10.172.0.22]

TASK [packages : Install pg_search v0.9.4 package] *****************************
changed: [10.172.0.22]
changed: [10.172.0.20]
changed: [10.172.0.21]

TASK [packages : Looking up the latest version of pg_analytics] ****************
ok: [10.172.0.21]
ok: [10.172.0.22]
ok: [10.172.0.20]

TASK [packages : Install pg_analytics v0.1.4 package] **************************
changed: [10.172.0.20]
changed: [10.172.0.22]
changed: [10.172.0.21]
...

Create extensions

Note

Extensions can be created automatically if you define them in the postgresql_extensions variable.

postgres=# \dx
                 List of installed extensions
  Name   | Version |   Schema   |         Description          
---------+---------+------------+------------------------------
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
(1 row)

postgres=# show shared_preload_libraries ;
                    shared_preload_libraries                                                      
------------------------------------------------------------------
 pg_stat_statements,auto_explain,pg_search,pg_analytics
(1 row)

postgres=# create extension pg_analytics ;
CREATE EXTENSION
postgres=# create extension pg_search ;
CREATE EXTENSION
postgres=# create extension vector ;
CREATE EXTENSION
postgres=# 
postgres=# \dx
                                 List of installed extensions
     Name     | Version |   Schema   |                       Description                       
--------------+---------+------------+---------------------------------------------------------
 pg_analytics | 0.1.4   | public     | pg_analytics: Postgres for analytics, powered by DuckDB
 pg_search    | 0.9.4   | paradedb   | pg_search: Full text search for PostgreSQL using BM25
 plpgsql      | 1.0     | pg_catalog | PL/pgSQL procedural language
 vector       | 0.7.4   | public     | vector data type and ivfflat and hnsw access methods
(4 rows)

Check pg_search

ostgres=# CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

SELECT description, rating, category
FROM mock_items
LIMIT 3;
CALL
       description        | rating |  category   
--------------------------+--------+-------------
 Ergonomic metal keyboard |      4 | Electronics
 Plastic Keyboard         |      4 | Electronics
 Sleek running shoes      |      5 | Footwear
(3 rows)

postgres=# CALL paradedb.create_bm25(
        index_name => 'search_idx',
        schema_name => 'public',
        table_name => 'mock_items',
        key_field => 'id',
        text_fields => paradedb.field('description', tokenizer => paradedb.tokenizer('en_stem')) ||
                       paradedb.field('category'),
        numeric_fields => paradedb.field('rating')
);
CALL
postgres=# SELECT description, rating, category
FROM search_idx.search(
  '(description:keyboard OR category:electronics) AND rating:>2',
  limit_rows => 5
);
         description         | rating |  category   
-----------------------------+--------+-------------
 Plastic Keyboard            |      4 | Electronics
 Ergonomic metal keyboard    |      4 | Electronics
 Innovative wireless earbuds |      5 | Electronics
 Fast charging power bank    |      4 | Electronics
 Bluetooth-enabled speaker   |      3 | Electronics
(5 rows)

Check pg_search + pgvector

postgres=# ALTER TABLE mock_items ADD COLUMN embedding vector(3);

UPDATE mock_items m
SET embedding = ('[' ||
    ((m.id + 1) % 10 + 1)::integer || ',' ||
    ((m.id + 2) % 10 + 1)::integer || ',' ||
    ((m.id + 3) % 10 + 1)::integer || ']')::vector;

SELECT description, rating, category, embedding
FROM mock_items
LIMIT 3;
ALTER TABLE
UPDATE 41
       description        | rating |  category   | embedding 
--------------------------+--------+-------------+-----------
 Ergonomic metal keyboard |      4 | Electronics | [3,4,5]
 Plastic Keyboard         |      4 | Electronics | [4,5,6]
 Sleek running shoes      |      5 | Footwear    | [5,6,7]
(3 rows)

postgres=# CREATE INDEX on mock_items
USING hnsw (embedding vector_l2_ops);
CREATE INDEX
postgres=# SELECT description, category, rating, embedding
FROM mock_items
ORDER BY embedding <-> '[1,2,3]'
LIMIT 3;
       description       |  category  | rating | embedding 
-------------------------+------------+--------+-----------
 Artistic ceramic vase   | Home Decor |      4 | [1,2,3]
 Modern wall clock       | Home Decor |      4 | [1,2,3]
 Designer wall paintings | Home Decor |      5 | [1,2,3]
(3 rows)

postgres=# SELECT * FROM search_idx.score_hybrid(
    bm25_query => 'description:keyboard OR category:electronics',
    similarity_query => '''[1,2,3]'' <-> embedding',
    bm25_weight => 0.9,
    similarity_weight => 0.1
) LIMIT 5;
 id | score_hybrid 
----+--------------
  2 |   0.95714283
  1 |    0.8490507
 29 |          0.1
 39 |          0.1
  9 |          0.1
(5 rows)

postgres=# SELECT m.description, m.category, m.embedding, s.score_hybrid
FROM mock_items m
LEFT JOIN (
    SELECT * FROM search_idx.score_hybrid(
        bm25_query => 'description:keyboard OR category:electronics',
        similarity_query => '''[1,2,3]'' <-> embedding',
        bm25_weight => 0.9,
        similarity_weight => 0.1
    )
) s
ON m.id = s.id
LIMIT 5;
       description        |  category   | embedding | score_hybrid 
--------------------------+-------------+-----------+--------------
 Plastic Keyboard         | Electronics | [4,5,6]   |   0.95714283
 Ergonomic metal keyboard | Electronics | [3,4,5]   |    0.8490507
 Designer wall paintings  | Home Decor  | [1,2,3]   |          0.1
 Handcrafted wooden frame | Home Decor  | [1,2,3]   |          0.1
 Modern wall clock        | Home Decor  | [1,2,3]   |          0.1
(5 rows)

Check pg_analytics

postgres=# CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;

CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;

CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');
CREATE FOREIGN DATA WRAPPER
CREATE SERVER
CREATE FOREIGN TABLE
postgres=# SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
 vendorid | passenger_count | trip_distance 
----------+-----------------+---------------
        2 |               1 |          1.72
(1 row)

postgres=# SELECT COUNT(*) FROM trips;
  count  
---------
 2964624
(1 row)

passed

@vitabaks vitabaks merged commit 714fd72 into master Sep 17, 2024
15 checks passed
@vitabaks vitabaks deleted the paradedb branch September 17, 2024 14:08
@philippemnoel
Copy link

Thank you for adding us @vitabaks. This is really exciting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of the current functionality new feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants