Introduction to Query Performance

on PostgreSQL

Table of Contents

What will cover
Good Performance
Principles
Agenda
Configurations & Tooling
Table Schemas
Queries
Indices
- The Nature
- EXPLAIN & Query Plan
Monitoring
To be Extreme
- No code
- No Database queries
Reference

What will cover

Purpose
- OLTP (Online Transaction Processing)
  - store and query the transactions
- ~~OLAP (Online Analytical Processing)~~
  - report data, for planning and management
Vendors
- Postgres
- ~~MySQL~~
- ~~OracleDB~~
- ~~SQL Server~~

Good Performance

Good Enough Number of Concurrent Connections
No Always-Slow Queries
Minimal Randomly-Slow Queries

It is IMPOSSIBLE to avoid Randomly-Slow Queries

Not all factors we are able to control

networking

connection pools

read-write bottlenecks from storage

Principles

Transactional queries should be FAST (<500ms)
Number of concurrently queries of the same sessions should be less than 5
Queries inside a transaction should be a constant in most cases
Locks that blocks other queries should be avoided (Especially Table-level Locking)
- optimistic lock - concurrent read is allowed; write is not allowed
- pessimistic lock - concurrent read nor write allowed

Advanced Topic: Locks, Latches, Enqueues and Mutex

Agenda

Configurations & Tooling
Table Schemas
Queries
Indices
Monitoring
To be Extreme

Configurations & Tooling

PgBouncer
- Manage connection pool efficiently
- Azure Flexible Server provides the integration of PgBounder as an addon
PgTune
- Optimize Postgres configuration file according to the hardware specification

Performance Insights

All query statistics are stored in schema azure_sys - check here

SELECT
    max(datname),
    query_id,
    query_sql_text,
    SUM(calls) total_calls,
    SUM(total_time) total_time,
    SUM(total_time) * 1.0 / SUM(calls) avg_time,
    SUM(rows) * 1.0 / SUM(calls)  avg_rows
FROM query_store.qs_view q
JOIN pg_database d ON q.db_id = d.oid
GROUP BY query_id, query_sql_text
ORDER BY avg_rows DESC
LIMIT 10 -- change top N based on preferences;

Reference is here

Table Schemas

Normalization
- design the schema so that any change of one field only affect one record
De-normalization
- store module key on tables under the module
- store the number of sum / average result of certain table in its parent table

Queries

How does a database engine process queries

Whether the query plan has cached, if yes, use the query plan
- What developers can do: Make sure the query plans can be re-used as much as possible
Analyze the query
- What developers can do: Be avoid to write complicated queries
Decide the query plan (Relational Algebra)
- The Execution Plan Optimizer of Postgres is Cost-based, NOT Rule-based
- What developers can do: Design the queries that can use the indices
Execute
- What developers can do: Do not fetch too much data per query

Common mistakes

Include non-DML (Data Manipulation Language) in a single Transaction
- Always-Slow Queries on itself, Randomly-Slow Queries
- Include Queries are purely to fetch data
- Include long running function calls NOT RELATED to database
  - send emails
  - send to service bus
  - communicate with external APIs
Get too many records from one query
- Always-Slow queries on itself, Randomly-Slow Queries
- growing tables
  - with OR and UNION
  - module tables without limit

N+1 Queries

Affect Concurrent Connections, Randomly-Slow Queries
while fetching data from the grand-children tables (companies -> projects -> records)

SELECT id, "name" FROM classes WHERE school_code = 'LSC';
-- | id     | name |
-- | 200101 | 1A   |
-- | 200102 | 1B   |
-- | 200103 | 1C   |
-- | 200104 | 1D   |
-- | 200105 | 1E   |
-- | 200106 | 1F   |
-- ...
SELECT id, "name" FROM students WHERE class_id = '200101';
SELECT id, "name" FROM students WHERE class_id = '200102';
SELECT id, "name" FROM students WHERE class_id = '200103';
SELECT id, "name" FROM students WHERE class_id = '200104';
SELECT id, "name" FROM students WHERE class_id = '200105';
SELECT id, "name" FROM students WHERE class_id = '200106';
-- ...

Store huge records (>500KB each)
- Always-Slow Queries on itself, Randomly-Slow Queries
- Unstructured JSON
  - no schema
  - schema with recursive nature
  - schema with dynamic keys
- Store encoded images
- Store binary data
Illegitimate Use of TEXT filter with LIKE
- Always-Slow Queries on itself, Randomly-Slow Queries
- prefix search is acceptable
- infix search is slow
- postfix search is slow

Optimizations with Full-Text Search in PostgreSQL

What should know

Query Plan
Be avoid to query with complicated conditions that requires Full Table Scan on several tables
Be aware of queries on Critical Tables (User Table, Modules Tables)
For each query, indices should be used for tables > 1k records

Indices

The Nature

Indices are a separate store that can help running the queries faster (no clustered indices support in PG)
Good - Make some DDL queries faster while it makes a small portions of DDL queries slower
Bad - Require more storage, DML queries slower

EXPLAIN & Query Plan

IF you are unsure how the query performs, try EXPLAIN or EXPLAIN ANALYZE

How to READ Query Plan

Scan Nodes

Advanced Topic(s): What does "Recheck Cond" in Explain result mean? "Recheck Cond:" line in query plans with a bitmap index scan

The Concept

Selectivity
Cardinality

Index Types

B-tree use by default - O(log n)
Hash Index could be a good choice UNDER a few circumstances - O(1)
- REQUIREMENT: Postgres Version > 11, not us
GIST & GIN for JSON data

Advanced Topic(s): How is it possible for Hash Index not to be faster than Btree for equality lookups? POSTGRESQL HASH INDEX PERFORMANCE

Micro-optimization

Good for frequently used queries
Partial Index
Index on Expressions

-- Can only use inefficient PREFIX search
SELECT * FROM projects WHERE directory LIKE '00123>%';

-- Can use Index on Expressions, IF the length of SEARCH token is FIXED
SELECT * FROM projects WHERE SUBSTRING(directory, 0, 6) = '00123>';

CREATE INDEX IF NOT EXISTS project_directory_idx ON "projects"(SUBSTRING(directory, 0, 6));

Monitoring

Alerts on Database Server Resource Usage
Alerts to slow queries

To be Extreme

No code

No code is the best way to write secure and reliable applications. Write nothing; deploy nowhere.

No Database queries

No queries is the best way to make the database performing good.

Query the replica
Cache the result in application
Do not store static configuration data
- store in the services providing strong consistency
  - S3
  - Blob storage

No Only necessary queries to Master Database is the best way to make the database performing good

Reference

Awesome Postgres
SQL Performance Explained
Use the Index, Luke
Percona PostgreSQL Blog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query-performance.md

query-performance.md

Introduction to Query Performance

on PostgreSQL

What will cover

Good Performance

Principles

Agenda

Configurations & Tooling

Table Schemas

Queries

How does a database engine process queries

Common mistakes

What should know

Indices

The Nature

EXPLAIN & Query Plan

How to READ Query Plan

The Concept

Index Types

Micro-optimization

Monitoring

To be Extreme

No code

No Database queries

Reference

Files

query-performance.md

Latest commit

History

query-performance.md

File metadata and controls

Introduction to Query Performance

on PostgreSQL

What will cover

Good Performance

Principles

Agenda

Configurations & Tooling

Table Schemas

Queries

How does a database engine process queries

Common mistakes

What should know

Indices

The Nature

EXPLAIN & Query Plan

How to READ Query Plan

The Concept

Index Types

Micro-optimization

Monitoring

To be Extreme

No code

No Database queries

Reference