Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Plugin Framework to execute a command in SQL #19709

Open
1 task done
cpegeric opened this issue Oct 30, 2024 · 0 comments
Open
1 task done

[Feature Request]: Plugin Framework to execute a command in SQL #19709

cpegeric opened this issue Oct 30, 2024 · 0 comments
Assignees
Labels
kind/feature priority/p0 Critical feature that should be implemented in this version
Milestone

Comments

@cpegeric
Copy link
Contributor

Is there an existing issue for the same feature request?

  • I have checked the existing issues.

Is your feature request related to a problem?

We need a plugin framework in SQL that allows to execute a custom plugin executable that can manipulate the data from (datalink or string) in database.  The result generated by the plugin can then be shown as a result in SQL.

For application, 
1. generate embedding from LLM vector with source data (datalink or string) stored in database and then insert embedding into the table in database.
2. perform RAG application ask() function to LLM in SQL.  You can create a plugin executable to implement the ask function. The plugin will do the following:
a. Read the question from stdin
b. generate the embedding of the question from LLM
c. perform SQL with the LLM index from our database to get the N-best results
d. send N-best results and question to LLM to get the RAG answer
e. send the RAG answer to database
f. show the answer as result

Describe the feature you'd like

Plugin is a framework to run an executable with input data (datalink or string) from stdin and
output must be a JSON array

Function signature

Plugin needs two input columns

  1. command line with arguments in JSON, e.g. ["cat", "filename"]. Note: file path must be an absolute path
  2. input data with datalink or string. With offset and size parameter in datalink, only portion
    of data is read and send to stdin.
    Return:
    output buffer in JSON Array format

Create Embedding in Database

One of the example is wikipedia dumps. Wiki dump is a multisteam file which combine multiple bzip2
data chunks and its index file has the (offset, ID, title). The advantage is we don't need to unpack
the whole file before getting the required data.
Also, for LLM application, you can run the table_function plugin_exec() to convert the data from datalink
into the embedding by calling external LLM vendor.
All you have to do it to write a plugin executable in any language that

  1. read the data from stdin, unzip data with bzip2, parse xml to get all pages
  2. For each page, cut into multiple chunks
  3. send each chunk to LLM to convert into embedding
  4. format list of chunk embedding into JSON Array format
  5. Write the JSON to stdout
  6. Insert the embedding into table with the JSON array returned from stdout

To do this, simple run the SQL like this:
INSERT INTO t1 SELECT json_unquote(json_extract(result, '$.chunk')), json_unquote(json_extract(result, '$.e')) from
plugin_exec('["python3", "gen_embedding.py", "arg1", "arg2"]', cast('stage://mys3/wiki.bz2?offset=0&size=123456' as datalink)) as f;

Example,

create stage llmstage URL='file:///$resources/plugin/';
create table src (pkey int primary key, dlink datalink);
create table embed (pkey int, chunk int, e vecf32(3), t varchar);
insert into src values
(0, 'stage://llmstage/multistream.json?offset=0&size=155'),
(1, 'stage://llmstage/multistream.json?offset=155&size=164');
insert into embed select src.pkey, json_unquote(json_extract(f.result, "$.chunk")),
json_unquote(json_extract(f.result, "$.e")), json_unquote(json_extract(f.result, "$.t"))
from src CROSS APPLY plugin_exec('["cat"]', src.dlink) as f;
select * from embed;
pkey    chunk    e    t
0    0    [1.1, 2.2, 3.3]    text0
0    1    [2.5, 3.5, 4.5]    text1
0    2    [9.1, 8.2, 7.3]    text2
1    3    [10.1, 20.2, 30.3]    text3
1    4    [20.5, 30.5, 40.5]    text4
1    5    [90.1, 80.2, 70.3]    text5
drop stage llmstage;
drop table src;
drop table embed;

LLM Ask function

You can also do ASK function with plugin "ask" command like this:
SELECT * FROM plugin_exec('["ask", "index_table"]', 'this is a question?') as f;

the plugin ask do the following:

  1. get the question from stdin
  2. get the embedding of the question from LLM
  3. query the database with index_table. SELECT * from index_table order by L2_DISTANCE(e, "[1.3,2.3,3.4]") LIMIT 3;
  4. send the result from database to LLM and get the answer
  5. return answer to stdout

Example

select * from plugin_exec('["python3", "$resources/plugin/ask.py", "index_table"]', 'this is question to LLM') as f;
result
"this is answer from LLM with index table name = index_table. Question: this is question to LLM"

Describe implementation you've considered

No response

Documentation, Adoption, Use Case, Migration Strategy

No response

Additional information

No response

@matrix-meow matrix-meow self-assigned this Oct 30, 2024
@sukki37 sukki37 assigned cpegeric and unassigned matrix-meow Oct 31, 2024
@sukki37 sukki37 added this to the 2.1.0 milestone Oct 31, 2024
@sukki37 sukki37 added the priority/p0 Critical feature that should be implemented in this version label Oct 31, 2024
@cpegeric cpegeric mentioned this issue Jan 8, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature priority/p0 Critical feature that should be implemented in this version
Projects
None yet
Development

No branches or pull requests

3 participants