Don't want to waste time figuring out whether some LLM can work for you? This app solves it.
Shortcomings of current benchmarks (BIRD, Spider, WikiSQL, SParC, etc...):
Absent from these are "complicated" queries - for example, queries containing window functions, CTEs, pre-aggregations, among others. The Natural Language prompts typically used are not very complicated, industry-specific, or realistic. None measure how well a generative model performs on a messy and unorganized database, with duplicate tables, unclear column names, and bad data
This SQL Benchmark consists of ~200 Question/Query samples run against a database which tracks the use and creation of coupons across events, newsletters and stores. The SQL queries were designed to test window functions, CTEs, pre-aggregations, REGEX, multiple joins, NULL value handling and JSON-formatting, among others. We use our own custom version of Execution Accuracy, in which we allow for extra number of columns, differing column names and differing row and column order. Just for reference, the Execution Accuracy used in BIRD compares rows, irrespective of row order or repeating rows. As a baseline, we tested the performance of GPT4 and Anthropic's Claude-3.
Sign-up for updates on Twitter
- Fanout / fact table - questions that ask for a fanout-ed implementation
- Non-trivial joins (inequality, multiple fields, non-obvious id, multiple similar ids, many-to-many, self-joins)
- Period-over-period requests
- Window function
- JSON and XML Data Handling
- Spatial and Geographical Data Analysis
- Date functions: time zone conversions, time based joins, etc.
- Complex string matching or conversion
- Activity schema handling
- Complex aggregates: HAVING, NULL-safety, group by calculated fields
- CTE usage / recursive queries
Question: Calculate the total sales revenue by product category?
Incorrect SQL Answer:
- SELECT c.category_name, SUM(o.amount) AS total_sales
- FROM categories c
- JOIN product_categories pc ON c.id = pc.category_id
- JOIN products p ON pc.product_id = p.id
- JOIN orders o ON p.id = o.product_id
- GROUP BY c.category_name;
Explanation: This query might double-count amount values for orders if products belong to multiple categories, inflating total sales figures.
Question: Calculate the total sales revenue by product category using a fact table to correctly aggregate sales data without duplication.
SELECT c.category_name, SUM(f.sales_amount) AS total_sales
FROM category_dim c
JOIN sales_fact f ON c.category_id = f.category_id
GROUP BY c.category_name;
Explanation: This query accurately calculates total sales by category, using a sales fact table that avoids duplication by design.
Question: Determine the total number of products sold for each product, using a fact table to correctly handle sales data aggregation.
SELECT p.product_name, SUM(f.quantity_sold) AS total_quantity
FROM product_dim p
JOIN sales_fact f ON p.product_id = f.product_id
GROUP BY p.product_name;
Explanation: This query uses a fact table that records the quantity sold for each product, ensuring accurate aggregation without the risk of duplication due to fanout.
Question: How can you calculate the month-over-month sales growth percentage for each product in a Postgres database?
Version 1:
SELECT
product_id,
EXTRACT(YEAR FROM sale_date) AS sale_year,
EXTRACT(MONTH FROM sale_date) AS sale_month,
SUM(sales_amount) AS total_sales,
(SUM(sales_amount) - LAG(SUM(sales_amount)) OVER (PARTITION BY product_id ORDER BY EXTRACT(YEAR FROM sale_date), EXTRACT(MONTH FROM sale_date))) / LAG(SUM(sales_amount)) OVER (PARTITION BY product_id ORDER BY EXTRACT(YEAR FROM sale_date), EXTRACT(MONTH FROM sale_date)) * 100 AS mom_growth
FROM
sales
GROUP BY
product_id,
sale_year,
sale_month
ORDER BY
product_id,
sale_year,
Sale_month;
Explanation: This query calculates the month-over-month sales growth for each product. It uses the LAG window function to access the previous row's sales amount, allowing for the calculation of the growth percentage.
Version 2:
WITH monthly_sales AS (
SELECT
product_id,
DATE_TRUNC('month', sale_date) AS month,
SUM(sales_amount) AS total_sales
FROM sales
GROUP BY product_id, month
),
previous_month_sales AS (
SELECT
a.product_id,
a.month,
a.total_sales,
b.total_sales AS prev_month_sales
FROM monthly_sales a
LEFT JOIN monthly_sales b ON a.product_id = b.product_id
AND a.month = (b.month + INTERVAL '1 month')
)
SELECT
product_id,
month,
total_sales,
prev_month_sales,
CASE
WHEN prev_month_sales IS NULL THEN NULL
ELSE (total_sales - prev_month_sales) / prev_month_sales * 100
END AS mom_growth_percentage
FROM previous_month_sales
ORDER BY product_id, month;
Explanation:
- The monthly_sales CTE (Common Table Expression) calculates the total sales for each product per month.
- The previous_month_sales CTE then performs a self-join on monthly_sales to match each month's sales with the previous month's sales for the same product. We achieve the month-over-month comparison by using an interval of 1 month in the join condition.
- Finally, we select the results, including the calculated Month-over-Month growth percentage. The CASE statement ensures that we handle cases where there is no previous month's sales data (e.g., the first month in the dataset), avoiding division by zero.
Question: How do you calculate the duration of each session for web visits, assuming a session ends when a user is inactive for more than 30 minutes?
WITH ranked_visits AS (
SELECT
user_id,
visit_timestamp,
LAG(visit_timestamp) OVER (PARTITION BY user_id ORDER BY visit_timestamp) AS previous_visit_timestamp,
EXTRACT(EPOCH FROM (visit_timestamp - LAG(visit_timestamp) OVER (PARTITION BY user_id ORDER BY visit_timestamp))) / 60 AS minutes_since_last_visit
FROM web_visits
),
sessions AS (
SELECT
user_id,
visit_timestamp,
SUM(CASE WHEN minutes_since_last_visit > 30 OR minutes_since_last_visit IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY user_id ORDER BY visit_timestamp) AS session_id
FROM ranked_visits
)
SELECT
user_id,
session_id,
MIN(visit_timestamp) AS session_start,
MAX(visit_timestamp) AS session_end,
EXTRACT(EPOCH FROM (MAX(visit_timestamp) - MIN(visit_timestamp))) / 60 AS session_duration_minutes
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_start;
Explanation:
- This query first calculates the time difference between consecutive visits for each user to identify session boundaries (defined as a gap of more than 30 minutes). It then assigns a session ID to each visit, incrementing the ID whenever a new session starts. Finally, it aggregates these visits by session to calculate the start, end, and duration of each session.
Question: How do you determine the conversion funnel for leads coming from various marketing channels, tracking their progress from lead creation to becoming a qualified lead, then to an opportunity, and finally to a closed sale?
WITH lead_stages AS (
SELECT
lead_id,
channel,
stage,
stage_timestamp,
LEAD(stage_timestamp) OVER (PARTITION BY lead_id ORDER BY stage_timestamp) AS next_stage_timestamp
FROM marketing_leads
)
SELECT
channel,
COUNT(DISTINCT CASE WHEN stage = 'Lead Created' THEN lead_id END) AS leads_created,
COUNT(DISTINCT CASE WHEN stage = 'Qualified Lead' THEN lead_id END) AS leads_qualified,
COUNT(DISTINCT CASE WHEN stage = 'Opportunity' THEN lead_id END) AS opportunities_created,
COUNT(DISTINCT CASE WHEN stage = 'Closed Sale' THEN lead_id END) AS sales_closed
FROM lead_stages
GROUP BY channel
ORDER BY channel;
Explanation:
- This query tracks each lead's progress through the marketing funnel stages. By using the LEAD function, it prepares data that shows when a lead moves to the next stage. It then counts distinct lead IDs at each stage, grouped by marketing channel, to evaluate the effectiveness of each channel in moving leads through the funnel towards a sale.
Question 1: How do you extract and aggregate information from a JSON document stored in a PostgreSQL table?
CREATE TABLE customer_orders (
id SERIAL PRIMARY KEY,
customer_id INT,
order_info JSONB
);
-- Example JSONB column data: { "items": [{"product": "Book", "quantity": 2}, {"product": "Pen", "quantity": 5}] }
-- Query to extract product names and their total quantities ordered across all orders
SELECT
jsonb_array_elements(order_info->'items')->>'product' AS product_name,
SUM((jsonb_array_elements(order_info->'items')->>'quantity')::int) AS total_quantity
FROM customer_orders
GROUP BY product_name;
Explanation:
- This query uses the jsonb_array_elements function to expand the JSON array in order_info->'items' into a set of JSONB elements, then extracts the product and quantity fields. It casts quantity to an integer and aggregates the total quantity ordered for each product across all orders.
Question: How do you account for Daylight Saving Time when converting between time zones?
-- Example of converting a timestamp from UTC to a time zone with DST (e.g., Eastern Time)
SELECT
event_time AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York' AS event_time_et
FROM global_events;
Explanation:
- PostgreSQL automatically handles Daylight Saving Time adjustments for time zones like 'America/New_York' when using the AT TIME ZONE conversion. The result will reflect the correct local time, including any DST shifts.
Question: How do you perform a join between two tables based on time intervals, such as finding records within the same hour?
CREATE TABLE user_logins (
user_id INT,
login_time TIMESTAMP WITHOUT TIME ZONE
);
CREATE TABLE user_actions (
user_id INT,
action_time TIMESTAMP WITHOUT TIME ZONE,
action VARCHAR(255)
);
-- Example query to find actions within an hour of login
SELECT
l.user_id,
l.login_time,
a.action,
a.action_time
FROM user_logins l
JOIN user_actions a ON l.user_id = a.user_id
AND a.action_time BETWEEN l.login_time AND l.login_time + INTERVAL '1 hour';
Explanation:
- This query joins the user_logins and user_actions tables on user_id, using a condition that matches actions (action_time) occurring within an hour (INTERVAL '1 hour') after the login time (login_time).
Question: How do you convert stored timestamps from one time zone to another?
CREATE TABLE global_events (
id SERIAL PRIMARY KEY,
event_name VARCHAR(255),
event_time TIMESTAMP WITH TIME ZONE
);
-- Example query to convert UTC to EST
SELECT
id,
event_name,
event_time AT TIME ZONE 'UTC' AT TIME ZONE 'America/New_York' AS event_time_est
FROM global_events;
Explanation:
- The AT TIME ZONE clause is used to convert event_time from UTC to EST. The first AT TIME ZONE 'UTC' interprets the stored timestamp as UTC (if not already specified as such), and the second AT TIME ZONE 'America/New_York' converts it to Eastern Standard Time.
SELECT
REGEXP_EXTRACT(url, 'https?://([^/]+)/') AS domain
FROM webpage_visits;
SELECT *
FROM inventory
WHERE product_id REGEXP_CONTAINS(product_id, '^[A-Za-z0-9]+$');
SELECT *
FROM products
WHERE name ILIKE '%eco-friendly%'; -- Case-insensitive search for 'eco-friendly' in the product name
SELECT
SUBSTRING(email FROM '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$') AS valid_email
FROM users;
CREATE TABLE user_activities (
id STRING NOT NULL,
user_id INT64 NOT NULL,
activity_type STRING NOT NULL,
activity_details STRUCT<
detail1 STRING,
detail2 INT64,
...
>,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
) PARTITION BY DATE(created_at);
SELECT user_id, activity_type, ARRAY_LENGTH(activity_details.someArray) AS detailCount
FROM user_activities
WHERE activity_type = 'view';
Explanation
- Benefits for BigQuery: BigQuery's approach is well-suited for analytics at scale. The use of partitioning (PARTITION BY DATE(created_at)) enhances query performance and cost efficiency for time-based queries. Structured data types (STRUCT, ARRAY) allow for complex data organization within each activity record.
CREATE TABLE user_activities (
id AUTOINCREMENT PRIMARY KEY,
user_id INT NOT NULL,
activity_type STRING NOT NULL,
activity_details VARIANT,
created_at TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
);
SELECT activity_type, COUNT(*)
FROM user_activities
GROUP BY activity_type;
Explanation
- Benefits for Snowflake: Snowflake's VARIANT type is ideal for activity_details, accommodating various structured and semi-structured formats. Snowflake's architecture also allows handling large volumes of data efficiently.
Question: Identify products with an average rating above 4.5, where more than 10 ratings have been submitted.
SELECT
product_id,
AVG(rating) AS avg_rating,
COUNT(rating) AS total_ratings
FROM product_reviews
GROUP BY product_id
HAVING AVG(rating) > 4.5 AND COUNT(rating) > 10;
Scenario: Calculate the total sales for each product, ensuring that sales with null values are treated as 0.
SELECT
product_id,
SUM(COALESCE(sales_amount, 0)) AS total_sales
FROM sales_data
GROUP BY product_id;
Scenario: Group sales data by quarter and year, then identify quarters with total sales exceeding $100,000.
SELECT
EXTRACT(YEAR FROM sale_date) AS sale_year,
EXTRACT(QUARTER FROM sale_date) AS sale_quarter,
SUM(sales_amount) AS total_sales
FROM sales_data
GROUP BY sale_year, sale_quarter
HAVING SUM(sales_amount) > 100000;
SELECT
department_id,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 5;
Question: How can you find the average salary of employees in each department and list departments with an average salary above a certain threshold, say $50,000?
WITH department_salary AS (
SELECT
department_id,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
)
SELECT
department_id,
avg_salary
FROM department_salary
WHERE avg_salary > 50000;
Question: How can you retrieve a list of employees and their managers, assuming the employees table has a self-referencing manager_id column?
WITH RECURSIVE employee_hierarchy AS (
SELECT
employee_id,
name,
manager_id
FROM employees
WHERE manager_id IS NULL -- Assuming top-level managers have a NULL manager_id
UNION ALL
SELECT
e.employee_id,
e.name,
e.manager_id
FROM employees e
INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT
eh.name AS employee_name,
m.name AS manager_name
FROM employee_hierarchy eh
LEFT JOIN employees m ON eh.manager_id = m.employee_id;
1. Ecommerce:
- Products
- Orders
- Order Items
- Customers
- Cart Items
- Product Reviews
- Product Categories
- Payment Transactions
- Shipping Methods
- Inventory
2. Fintech and Financial Services:
- Accounts
- Transactions
- Loans
- Payment Methods
- Investment Portfolios
- Credit Cards
- Customer Profiles
- Expense Reports
- Account Balances
- Currency Exchange Rates
3. Digital Marketing and Website Analytics:
- Page Views
- User Sessions
- Events
- Conversions
- Ad Clicks
- Sources
- Landing Pages
- Geographies
- Campaign Performance
- Email Campaigns
4. Marketplace (General and Niche):
- Listings
- Users
- Transactions
- Reviews
- Favorites
- Categories
- Messages
- Bids
- Shipping Options
- Payment Logs
5. SaaS (Software as a Service):
- Users
- Subscriptions
- Activity Logs
- Features Accessed
- Invoices
- User Settings
- Permissions
- Service Integrations
- Application Errors
- Billing Info
6. Delivery and Logistics:
- Orders
- Deliveries
- Routes
- Vehicles
- Drivers
- Tracking Events
- Warehouses
- Inventory Transfers
- Delivery Exceptions
- Customer Addresses
7. Content and Digital Media:
- Articles
- Users
- Comments
- Categories
- Media Assets
- Subscriptions
- Authors
- Views Stats
- Advertising Slots
- Likes/Favorites
8. Social Networks and Platforms:
- Users
- Posts
- Comments
- Friendships
- Groups
- Messages
- Notifications
- Page Follows
- User Activities
- Privacy Settings
9. EdTech and Online Education:
- Courses
- Enrollments
- Students
- Assignments
- Quiz Attempts
- Discussion Posts
- Teacher Profiles
- Course Materials
- Progress Tracks
- Certificates Earned
10. Automotive and Transportation:
- Vehicles
- Reservations
- Fleet Maintenance Records
- Driver Logs
- Trip Histories
- Fuel Logs
- Parts Inventory
- Customer Feedback
- Insurance Records
- Route Optimizations
11. Real Estate and Property Management:
- Properties
- Leases
- Tenants
- Maintenance Requests
- Property Listings
- Viewings Appointments
- Financial Transactions
- Owners
- Building Amenities
- Insurance Policies
12. Gaming and eSports:
- Users
- Games
- Matches
- Player Stats
- Teams
- Tournaments
- Leaderboards
- In-game Purchases
- Game Servers
- Event Logs
13. Analytics and Data Services:
- Datasets
- Data Queries
- User Reports
- Analytics Models
- Execution Logs
- Data Sources
- Visualization Settings
- User Preferences
- Scheduled Tasks
- Access Permissions
14. Insurance Tech:
- Policies
- Claims
- Policy Holders
- Risk Assessments
- Insurance Products
- Agent Profiles
- Coverage Options
- Payment Histories
- Claim Adjustments
- Customer Inquiries
15. Charities and Non-profits:
- Donors
- Donations
- Projects
- Beneficiaries
- Volunteer Profiles
- Events
- Fundraising Campaigns
- Grants
- Financial Reports
- Feedback Surveys