Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pgtier blog post #654

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added public/community-tiering.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions src/blogAuthors.ts
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ export const authorsEnum = z.array(
'nick',
'ash',
'amy',
'shahnawaz',
])
.default('ryw'),
);
Expand Down Expand Up @@ -183,4 +184,11 @@ export const AUTHORS: Record<string, Author> = {
image_url: 'https://github.com/nhudson.png',
email: '[email protected]',
},
shahnawaz: {
name: 'Shahnawaz',
title: 'Senior Software Engineer',
url: 'https://github.com/shhnwz',
image_url: 'https://github.com/shhnwz.png',
email: '[email protected]',
},
};
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
101 changes: 101 additions & 0 deletions src/content/blog/2024-09-20-community-tiering/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
slug: open-source-tiering
title: 'Open and Accessible Data Tiering'
EvanHStanton marked this conversation as resolved.
Show resolved Hide resolved
authors: [adam, shahnawaz]
description: |
Open source data tiering project
EvanHStanton marked this conversation as resolved.
Show resolved Hide resolved
tags: [postgres, workloads]
date: 2024-09-20T09:00
image: './community-tiering.png'
planetPostgres: false
---


When evaluating data value, it's crucial to look beyond its literal measurements and consider the resources required for lifecycle management. As data scales, users often face increasing costs, turning an initially appealing subscription into a burden. One answer to such a problem focuses not so much on the amount of data, but how the data is stored.

At Tembo, we heard this challenge echoed repeatedly from both internal teams and the community. To address it, we built and open-sourced `pg_tier`, a Postgres extension that simplifies integration with AWS S3 and other object stores. With `pg_tier`, users can move Postgres tables to S3 while retaining the ability to query them as if they were still in Postgres.
EvanHStanton marked this conversation as resolved.
Show resolved Hide resolved

## Data lifecycle management

Before jumping into functionality, it's important to appreciate the context within which this extension operates - that is, the fundamentals of data lifecycle management.
EvanHStanton marked this conversation as resolved.
Show resolved Hide resolved

As data progresses through the various stages of its lifecycle, so too do its access patterns change. Upon injestion and querying, data is understood to be at the "hot" stage. While the following is of course circumstantial, it's safe to assume that, as this data ages, its frequency of access decreases. Metaphorically, the data continues to cool and eventually finds itself in "cold" storage.
EvanHStanton marked this conversation as resolved.
Show resolved Hide resolved

In addition to access patterns, organizations, such as banks, will likely adhere to certain governance postures that would enforce a data retention period. For these financial institutions, It simply wouldn't make sense to keep 7 - 10 year old data front and center, when they can store it at much lower costs.
Moreover, it's important to note that these stages aren't simply where the data is stored, but a combination of its location and formatting.

A good way to visualize these stages would be to break them down as follows:

### Postgres Database (Hot Storage)

In the initial stage, data is frequently accessed and queried. This is where data is most "active," requiring quick retrieval and processing.

### Aged Data Stage (Cool Storage)

As data becomes less frequently accessed over time, it eventually reaches a point where moving it out of Postgres and into an object store becomes more practical. At this stage, the data still requires immediate accessibility, even though it's not accessed as often. It's at this (and the following) stage where `pg_tier` offers the most value by moving the data to an object store, storing it in a Parquet file format, and generating foreign data wrapper metadata.

### Archival Stage (Cold Storage)

In the final stage, data is rarely accessed but retained for long-term storage and compliance. Object stores have evolved various tiers, but the lowest tiers still brings unnecessary costs when trying to access your data. `pg_tier` addresses this and provides users with bottomless storage and a low cost of data access.
EvanHStanton marked this conversation as resolved.
Show resolved Hide resolved

## Everyone can have bottomless storage on Postgres

The need for scalable and affordable storage is clear, and this is where Postgres users can significantly benefit. While engineers often archive data by copying it to S3 and deleting it from Postgres, querying that archived data presents challenges. Tools like [DuckDB](https://duckdb.org/), [Apache Pinot](https://pinot.apache.org/), or [ClickHouse](https://clickhouse.com/) offer solutions, but users typically need to build custom pipelines to move data to S3 and integrate it into these systems. The goal of `pg_tier`is to make this a standardized process, across all object storage formats and cloud providers, with a first class experience on Postgres.

## Enhancing parquet_s3_fdw for a touch-free experience

`pg_tier` builds on the established `parquet_s3_fdw` project, which enables the creation of a foreign data wrapper around S3 data, allowing users to query it as if it were still in Postgres. This integration eliminates the need for manual AWS credential configuration, offering a streamlined experience for working with S3 data directly from Postgres.

## Using `pg_tier` for Data Tiering

### Create a Table and Insert Data

Start by creating a table and populating it with some data:

```sql
CREATE TABLE people (
name text not null,
age numeric not null
);
INSERT INTO people VALUES ('Alice', 34), ('Bob', 45), ('Charlie', 56);
```

### Set Up Your S3 Credentials and Bucket

```sql
SELECT tier.set_tier_config(
'my-storage-bucket',
'AWS_ACCESS_KEY',
'AWS_SECRET_KEY',
'AWS_REGION'
);
```

### Tier the Table to S3

After setting up your S3 configuration, you can tier the table by moving it to S3 and converting it into a foreign table:

```sql
SELECT tier.table('people');
```
This command will move the people table to S3 and convert it into a foreign table that PostgreSQL can still query.

### Check the Table's Foreign Status

Once tiered, the table becomes a foreign table stored in S3. You can verify this by checking its schema:

```text
\d+ people
```

You should see something similar to this output:

```text
Foreign table "public.people"
Column | Type | Collation | Nullable | Default | FDW options | Storage | Stats target | Description
--------+---------+-----------+----------+---------+--------------+----------+--------------+-------------
name | text | | not null | | (key 'true') | extended | |
age | numeric | | not null | | (key 'true') | main | |
Server: pg_tier_s3_srv
FDW options: (dirname 's3://my-storage-bucket/public_people/')
```
EvanHStanton marked this conversation as resolved.
Show resolved Hide resolved