#Prepare for SQL interview
-
General: where / group by / order by / left join / right join / inner join / null / not null /insert/ having / distinct / like / union / avg / sum / min / max / substring / REGEXP(RLIKE) / limit+top / CONCAT
-
Arithmetic: ROUND / POWER / ABS / concat / CEIL / FLOOR / LN / MOD = % / SQRT / EXP
-
Date operations: NOW() / CURDATE() / CURTIME() / DATE() / DATE_ADD() / DATE_SUB() / DATEDIFF() / DATE_FORMAT() /EXTRACT
- example 1: SELECT EXTRACT(MONTH FROM '2012-02-01')
- exapmle 2: select last 7 days in mysql
SELECT * FROM table WHERE table.date >= DATE(NOW()) - INTERVAL 7 DAY -- WHERE table.date BETWEEN DATE_SUB(NOW(), INTERVAL 7 DAY) AND NOW()
-
Missingness handling: COALESCE(var1, var2, var3, var4, ..)
- COALESCE() function returns the first non-NULL value of a list, or NULL if there are no non-NULL values.
-
Syntax
window_function (expression) OVER ( [ PARTITION BY expr_list ] [ ORDER BY order_list ][ frame_clause ] )
-
Value: FIRST_VALUE()/LAG()/LAST_VALUE()/LEAD()
-
Aggregate: AVG()/COUNT()/MAX()/MIN()/SUM()
-
Ranking: CUME_DIST()/DENSE_RANK()/NTILE()/PERCENT_RANK()/RANK()/ROW_NUMBER()
-
Special: unbound PROCEEDING and unbound FOLLOWING, RANGE (details+visual )
There are five types of boundaries, which are UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW, PRECEDING, and FOLLOWING. UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING represent the first row of the partition and the last row of the partition, respectively. For the other three types of boundaries, they specify the offset from the position of the current input row and their specific meanings are defined based on the type of the frame. There are two types of frames, ROW frame and RANGE frame.
SQL windows function examples (ref)
- Aggregation function as window function
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
- LAG()/LEAD() as window function
SELECT *
FROM (
SELECT start_terminal,
duration_seconds,
duration_seconds -LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
) sub
WHERE sub.difference IS NOT NULL
LAG() pulls from previous rows and LEAD() pulls from following rows.
- ROW_NUMBER()/RANK() as window function
SELECT start_terminal,
start_time,
duration_seconds,
ROW_NUMBER() OVER (PARTITION BY start_terminal
ORDER BY start_time)
AS row_number
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
RANK() is slightly different from ROW_NUMBER(). If you order by start_time, for example, it might be the case that some terminals have rides with two identical start times. In this case, they are given the same rank, whereas ROW_NUMBER() gives them different numbers. You can use NTILE(*# of buckets*)
, for example, NTILE(100), to identify what percentile a given row falls into.
-
Bounded and Unbounded ROWS (ref)
- Bounded
--ROWS PRECEDING AND FOLLOWING SELECT Year, DepartmentID, Revenue, sum(Revenue) OVER (PARTITION by DepartmentID ORDER BY [YEAR] ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as BeforeAndAfter FROM REVENUE ORDER BY departmentID, year;
- Unbounded
-- ROWS UNBOUNDED FOLLOWING -- http://stevestedman.com/?p=1485 SELECT Year, DepartmentID, Revenue, min(Revenue) OVER (PARTITION by DepartmentID ORDER BY [YEAR] ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as MinRevenueBeyond FROM REVENUE ORDER BY departmentID, year;
-
Create window function alias If you’re planning to write several window functions in to the same query, using the same window, you can create an alias.
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quartile,
NTILE(5) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quintile,
NTILE(100) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
ORDER BY start_terminal, duration_seconds
can be rewritten as
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER ntile_window AS quartile,
NTILE(5) OVER ntile_window AS quintile,
NTILE(100) OVER ntile_window AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < '2012-01-08'
WINDOW ntile_window AS
(PARTITION BY start_terminal ORDER BY duration_seconds)
ORDER BY start_terminal, duration_seconds
- Spark2.0 SQL guide
- window functions in Spark
- SQL optimization: link1, link2
(1)Q: what is the difference between WHERE and HAVING
(2)Q: what is the difference between GROUP BY and ORDER BY
- A: ORDER BY allows you to sort the result set according to different criteria, such as first sort by name from a-z, then sort by the price highest to lowest. GROUP BY will aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc). ORDER BY can be used within the over() clause
(3)Q: how to create a view
- A: The basic CREATE VIEW syntax is as follows: CREATE VIEW view_name AS SELECT column1, column2..... FROM table_name WHERE [condition]; You can include multiple tables in your SELECT statement in very similar way as you use them in normal SQL SELECT query. In SQL, a view is a virtual table based on the result-set of an SQL statement. A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database.
(4)Q: What is the difference between aggregation function used together with group by
and aggregation function used as window function
?
- A: A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result. You can’t use window functions and standard aggregations in the same query. More specifically, you can’t include window functions in a
GROUP BY
clause.
(5)Q: What is the difference between ROWS
and RANGE
(ref)
- A: ROWS means the specific row or rows specified, and RANGE refers to those same rows plus any others that have the same matching values.ROW frames are based on physical offsets from the position of the current input row, which means that CURRENT ROW, PRECEDING, or FOLLOWING specifies a physical offset. RANGE frames are based on logical offsets from the position of the current input row, and have similar syntax to the ROW frame. A logical offset is the difference between the value of the ordering expression of the current input row and the value of that same expression of the boundary row of the frame.