Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(postgres sink): Add postgres sink #21248

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

jorgehermo9
Copy link
Contributor

@jorgehermo9 jorgehermo9 commented Sep 9, 2024

Closes #15765

This PR is not 100% ready by my side and there will likely be a few things wrong, but had a few questions and wanted to know if the direction seems right... So I would like an initial round of review if possible.

I tested the sink and it seems to be working, but I lack a lot of knowledge about Vector's internals and I'm not sure if the implementation is okay.

I inspired a lot from the databend and clickhouse sinks, but left a few questions as TODOs in the source. I found this sink a bit different from the others, as the others had the request_builder thing and encoding the payload in bytes (as most of the sinks are http based).. But I didn't think that fitted well in this case, as in the sqlx API I should wrap the events with the sqlx::types::Json type and that will do all the encoding with serde internally.

If someone want to manually test it, I used this Vector config:

[sources.demo_logs]
type = "demo_logs"
format = "apache_common"

[transforms.payload]
type = "remap"
inputs = ["demo_logs"]
source = """
.payload = .
"""

[sinks.postgres]
type = "postgres"
inputs = ["payload"]
endpoint = "postgres://postgres:postgres@localhost/test"
table = "test"

Run postgres server with podman run -e POSTGRES_PASSWORD=postgres -p 5432:5432 docker.io/postgres

and execute the following with psql -h localhost -U postgres:

CREATE DATABASE test;

then execute \c test
and last:

CREATE TABLE test (message TEXT, timestamp TIMESTAMP WITH TIME ZONE, payload JSONB);

And then, you will see logs in that table:

image

@jorgehermo9 jorgehermo9 requested a review from a team as a code owner September 9, 2024 22:33
@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: ci Anything related to Vector's CI environment labels Sep 9, 2024
}

/// Configuration for the `postgres` sink.
#[configurable_component(sink("postgres", "Deliver log data to a PostgreSQL database."))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I call this sink postgres or postgres_logs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm good question, could this evolve to handle both logs and metrics in the future?

Copy link
Contributor Author

@jorgehermo9 jorgehermo9 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if this can evolve to integrate with other postgres flavours such as timescaledb, which is oriented to time series

My thoughts on this: #21308 (comment)

Timescaledb tracking issue: #939

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be interesting to change the input

Input::log()
of this sink to allow for metrics and traces too. I'll give it a try.

@jorgehermo9 jorgehermo9 requested review from a team as code owners September 9, 2024 22:38
@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Sep 9, 2024
// TODO: If a single item of the batch fails, the whole batch will fail its insert.
// Is this intended behaviour?
sqlx::query(&format!(
"INSERT INTO {table} SELECT * FROM jsonb_populate_recordset(NULL::{table}, $1)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the table configuration can be a victim of sql injection, but in my opinion, we shouldn't avoid that kind of attacks at this level and the user should be responsible of ensuring that there is not sql injection in the config... The databend sink works like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I suppose sqlx does not support parameterized table names? Does the query builder help here? If none of the above works, then we can leave as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that could help in this case. See this statement about sqlx's query builder.

And we cannot use a variable bind ($ syntax) in postgres for table names, as the prepared statements are bounded to a query plan and it cannot change if the target table changes.

I think this is the better way to do it... sqlx does not check for sql injection

Copy link
Member

@pront pront Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Is it just the table config param currently, so we can add a validation check when building the config verification. This becomes more complicated if table becomes a template (per comment),

Edit: In that case we might be able to validate per event, or we could just add a notice in the docs to communicate that this sink isn't trying to be smart about security.

pub endpoint: String,

/// The table that data is inserted into.
pub table: String,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the table templatable? Like the clickhouse sink. That would complicate the code a little bit (with KeyPartitioner and so. If yes, I would like some guidance about it if possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a nice feature but not a must-have, we can do this incrementally. Once we finalized the rest of the comments we can come back to this if you are motivated to add this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

Copy link
Contributor

@aliciascott aliciascott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good for docs

@pront pront self-assigned this Oct 15, 2024
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jorgehermo9, thank you for this sizable contribution! On a high level, it looks great. I did a first review and left some comments. Don't hesitate to follow up, happy to discuss details.

scripts/integration/postgres/test.yaml Outdated Show resolved Hide resolved
}

/// Configuration for the `postgres` sink.
#[configurable_component(sink("postgres", "Deliver log data to a PostgreSQL database."))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm good question, could this evolve to handle both logs and metrics in the future?

pub endpoint: String,

/// The table that data is inserted into.
pub table: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a nice feature but not a must-have, we can do this incrementally. Once we finalized the rest of the comments we can come back to this if you are motivated to add this feature.

// TODO: If a single item of the batch fails, the whole batch will fail its insert.
// Is this intended behaviour?
sqlx::query(&format!(
"INSERT INTO {table} SELECT * FROM jsonb_populate_recordset(NULL::{table}, $1)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I suppose sqlx does not support parameterized table names? Does the query builder help here? If none of the above works, then we can leave as is.

/// The table that data is inserted into.
pub table: String,

/// The postgres connection pool size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it would be useful to explain what this pool is used for. Maybe a link to relevant docs would suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 21f3ad3. Do you think it is enough?

I also have doubts about using a connection pool. Can the event batches be executed in parallel for this sink? I don't know the specifics of vector's internals...

.batched(self.batch_settings.as_byte_size_config())

If the batches of events can be processed in parallel, then a connection pool is beneficial. If the batches are processed sequentially, then we should use a single postgres connection as a pooled connection does not have sense

src/sinks/postgres/integration_tests.rs Outdated Show resolved Hide resolved
src/sinks/postgres/integration_tests.rs Outdated Show resolved Hide resolved
@jorgehermo9
Copy link
Contributor Author

Thank you very much for the review @pront! I'm kinda busy these days but I will revisit this as soon as I can :)

@pront
Copy link
Member

pront commented Nov 25, 2024

There are a few failing checks. Also, let's add a new postgres semantic scope in https://github.com/vectordotdev/vector/blob/master/.github/workflows/semantic.yml. I will review once these are addressed. Thank you!

@pront pront changed the title feat(sink): Add postgres sink feat(postgres sink): Add postgres sink Nov 25, 2024
@jorgehermo9
Copy link
Contributor Author

jorgehermo9 commented Nov 25, 2024

I will work on this PR these days, I'll ping you whenever it is ready for another round. Thank you so much @pront!

@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Dec 6, 2024
@jorgehermo9
Copy link
Contributor Author

jorgehermo9 commented Dec 22, 2024

Hi @pront. I'm sorry for the delay. I added a bunch of new integration tests and I'm ready for another review round.

I still have a few doubts left maked as // TODO in code.

I also wonder if I should enable metrics & traces inputs in this shink (see this thread #21248 (comment)). If so, I will add a few more tests similar to insert_multiple_events test I added in this PR. Metrics can be useful for sinks like timescale #939. But I don't know if the current postgres sink is compatible with timescaledb. I think with timescaledb we should use this same sink but with other flag (see this comment, so I think we should really think ahead if this sink will be used for more than just logs.

I think there are a few opened comment threads, feel free to resolve them if you think my answer is enough. Didn't resolve them myself just in case my answer is not clear.


const POSTGRES_SINK_TAGS: [&str; 2] = ["endpoint", "protocol"];

fn pg_url() -> String {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified this a little bit.

I see that #21248 (comment) stated that this should be moved to test_utils.rs, but as this is simplified, I don't know if its worth.

And please not that now PG_URL env var is required. Before, it was optional and fallback to localhost if not set.

Take a look to the postgresql_metrics.rs diff
image

I prefer it to be simplified and not do any type of magic fallback... But I'm open to change it to what you suggest of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Hmm, I am bit skeptical of changing src/sources/postgresql_metrics.rs behavior in this PR, even it's just for tests. For example, when I grep for pg_socket_url(), I don't see it used in multiple places. It's fine to have a different pg_url() in this file.

@pront
Copy link
Member

pront commented Jan 3, 2025

Hi @pront. I'm sorry for the delay. I added a bunch of new integration tests and I'm ready for another review round.

I still have a few doubts left maked as // TODO in code.

Hi @jorgehermo9, I will take another look now. Thanks!

I also wonder if I should enable metrics & traces inputs in this shink (see this thread #21248 (comment)). If so, I will add a few more tests similar to insert_multiple_events test I added in this PR.

Yes, I was going to suggest exactly this. Add more inputs and observe how it works.

Metrics can be useful for sinks like timescale #939. But I don't know if the current postgres sink is compatible with timescaledb. I think with timescaledb we should use this same sink but with other flag (see this comment, so I think we should really think ahead if this sink will be used for more than just logs.

Are we talking about timescaledb and risingwave? Are there more? I am not familiar with all the nuances and I don't think we can plan for all extensions in advance. The proposed flavour (or flavor) property can be added incrementally without breaking the behavior of the sink.

@pront pront added the meta: awaiting author Pull requests that are awaiting their author. label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks domain: sources Anything related to the Vector's sources meta: awaiting author Pull requests that are awaiting their author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New sink: postgres
3 participants