#Prepare for SQL interview

Basic

General: where / group by / order by / left join / right join / inner join / null / not null /insert/ having / distinct / like / union / avg / sum / min / max / substring / REGEXP(RLIKE) / limit+top / CONCAT
Arithmetic: ROUND / POWER / ABS / concat / CEIL / FLOOR / LN / MOD = % / SQRT / EXP
Date operations: NOW() / CURDATE() / CURTIME() / DATE() / DATE_ADD() / DATE_SUB() / DATEDIFF() / DATE_FORMAT() /EXTRACT
- example 1: SELECT EXTRACT(MONTH FROM '2012-02-01')
- exapmle 2: select last 7 days in mysql
```
 SELECT * 
 FROM table
 WHERE table.date >= DATE(NOW()) - INTERVAL 7 DAY
 -- WHERE table.date BETWEEN DATE_SUB(NOW(), INTERVAL 7 DAY) AND NOW()
```
Missingness handling: COALESCE(var1, var2, var3, var4, ..)
- COALESCE() function returns the first non-NULL value of a list, or NULL if there are no non-NULL values.

Window functions

Syntax

window_function (expression) OVER (
 [ PARTITION BY expr_list ]
 [ ORDER BY order_list ][ frame_clause ] )

Value: FIRST_VALUE()/LAG()/LAST_VALUE()/LEAD()
Aggregate: AVG()/COUNT()/MAX()/MIN()/SUM()
Ranking: CUME_DIST()/DENSE_RANK()/NTILE()/PERCENT_RANK()/RANK()/ROW_NUMBER()
Special: unbound PROCEEDING and unbound FOLLOWING, RANGE (details+visual )

There are five types of boundaries, which are UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW, PRECEDING, and FOLLOWING. UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING represent the first row of the partition and the last row of the partition, respectively. For the other three types of boundaries, they specify the offset from the position of the current input row and their specific meanings are defined based on the type of the frame. There are two types of frames, ROW frame and RANGE frame.

SQL windows function examples (ref)

Aggregation function as window function

SELECT start_terminal,
       duration_seconds,
       SUM(duration_seconds) OVER
         (PARTITION BY start_terminal ORDER BY start_time)
         AS running_total
  FROM tutorial.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'

LAG()/LEAD() as window function

SELECT *
  FROM (
    SELECT start_terminal,
           duration_seconds,
           duration_seconds -LAG(duration_seconds, 1) OVER
             (PARTITION BY start_terminal ORDER BY duration_seconds)
             AS difference
      FROM tutorial.dc_bikeshare_q1_2012
     WHERE start_time < '2012-01-08'
     ORDER BY start_terminal, duration_seconds
       ) sub
 WHERE sub.difference IS NOT NULL

LAG() pulls from previous rows and LEAD() pulls from following rows.

ROW_NUMBER()/RANK() as window function

SELECT start_terminal,
       start_time,
       duration_seconds,
       ROW_NUMBER() OVER (PARTITION BY start_terminal
                          ORDER BY start_time)
                    AS row_number
  FROM tutorial.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'

RANK() is slightly different from ROW_NUMBER(). If you order by start_time, for example, it might be the case that some terminals have rides with two identical start times. In this case, they are given the same rank, whereas ROW_NUMBER() gives them different numbers. You can use NTILE(*# of buckets*), for example, NTILE(100), to identify what percentile a given row falls into.

Bounded and Unbounded ROWS (ref)

Bounded

--ROWS PRECEDING AND FOLLOWING
SELECT Year, DepartmentID, Revenue,
sum(Revenue) OVER (PARTITION by DepartmentID
ORDER BY [YEAR]
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) as BeforeAndAfter
FROM REVENUE
ORDER BY departmentID, year;

Unbounded

-- ROWS UNBOUNDED FOLLOWING
-- http://stevestedman.com/?p=1485
SELECT Year, DepartmentID, Revenue,
min(Revenue) OVER (PARTITION by DepartmentID
ORDER BY [YEAR]
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as MinRevenueBeyond
FROM REVENUE
ORDER BY departmentID, year;

Create window function alias If you’re planning to write several window functions in to the same query, using the same window, you can create an alias.

SELECT start_terminal,
       duration_seconds,
       NTILE(4) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS quartile,
       NTILE(5) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS quintile,
       NTILE(100) OVER
         (PARTITION BY start_terminal ORDER BY duration_seconds)
         AS percentile
  FROM tutorial.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'
 ORDER BY start_terminal, duration_seconds

can be rewritten as

SELECT start_terminal,
       duration_seconds,
       NTILE(4) OVER ntile_window AS quartile,
       NTILE(5) OVER ntile_window AS quintile,
       NTILE(100) OVER ntile_window AS percentile
  FROM tutorial.dc_bikeshare_q1_2012
 WHERE start_time < '2012-01-08'
WINDOW ntile_window AS
         (PARTITION BY start_terminal ORDER BY duration_seconds)
 ORDER BY start_terminal, duration_seconds

Other SQL related tools

Spark2.0 SQL guide
window functions in Spark
SQL optimization: link1, link2

Possible questions

(1)Q: what is the difference between WHERE and HAVING

A: HAVING is used to check conditions after the aggregation takes place.

(2)Q: what is the difference between GROUP BY and ORDER BY

A: ORDER BY allows you to sort the result set according to different criteria, such as first sort by name from a-z, then sort by the price highest to lowest. GROUP BY will aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc). ORDER BY can be used within the over() clause

(3)Q: how to create a view

A: The basic CREATE VIEW syntax is as follows: CREATE VIEW view_name AS SELECT column1, column2..... FROM table_name WHERE [condition]; You can include multiple tables in your SELECT statement in very similar way as you use them in normal SQL SELECT query. In SQL, a view is a virtual table based on the result-set of an SQL statement. A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database.

(4)Q: What is the difference between aggregation function used together with group by and aggregation function used as window function?

A: A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result. You can’t use window functions and standard aggregations in the same query. More specifically, you can’t include window functions in a GROUP BY clause.

(5)Q: What is the difference between ROWS and RANGE (ref)

A: ROWS means the specific row or rows specified, and RANGE refers to those same rows plus any others that have the same matching values.ROW frames are based on physical offsets from the position of the current input row, which means that CURRENT ROW, PRECEDING, or FOLLOWING specifies a physical offset. RANGE frames are based on logical offsets from the position of the current input row, and have similar syntax to the ROW frame. A logical offset is the difference between the value of the ordering expression of the current input row and the value of that same expression of the boundary row of the frame.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL.md

SQL.md

Basic

Window functions

SQL windows function examples (ref)

Other SQL related tools

Possible questions

Files

SQL.md

Latest commit

History

SQL.md

File metadata and controls

Basic

Window functions

SQL windows function examples (ref)

Other SQL related tools

Possible questions