DS: Advance SQL

 Advance SQL

What are Window Functions?

Window functions in SQL are a type of analytical function that perform calculations

across a set of rows that are related to the current row, called a "window". A

window function calculates a value for each row in the result set based on a subset

of the rows that are defined by a window specification.

The window specification is defined using the OVER() clause in SQL, which specifies

the partitioning and ordering of the rows that the window function will operate

on. The partitioning divides the rows into groups based on a specific column or

expression, while the ordering defines the order in which the rows are processed

within each group.


Data Set:

CREATE TABLE marks (

student_id INTEGER PRIMARY KEY AUTO_INCREMENT,

name VARCHAR(255),

branch VARCHAR(255),

marks INTEGER

);



INSERT INTO marks (name, branch, marks) VALUES

('Nitish', 'EEE', 82),

('Rishabh', 'EEE', 91),

('Anukant', 'EEE', 69),

('Rupesh', 'EEE', 55),

('Shubham', 'CSE', 78),

('Ved', 'CSE', 43),

('Deepak', 'CSE', 98),

('Arpan', 'CSE', 95),

('Vinay', 'ECE', 95),

('Ankit', 'ECE', 88),

('Anand', 'ECE', 81),

('Rohit', 'ECE', 95),

('Prashant', 'MECH', 75),

('Amit', 'MECH', 69),

('Sunny', 'MECH', 39),

('Gautam', 'MECH', 51);



CREATE TABLE students (

name VARCHAR(255),

branch VARCHAR(255),

marks INTEGER

);


INSERT INTO students (name, branch, marks) VALUES

-- EEE branch

('Nitish', 'EEE', 82),

('Rishabh', 'EEE', 91),

('Anukant', 'EEE', 69),

('Rupesh', 'EEE', 55),


-- CSE branch

('Shubham', 'CSE', 78),

('Ved', 'CSE', 43),

('Deepak', 'CSE', 98),


-- ME branch

('Aman', 'ME', 72),

('Pratik', 'ME', 66),

('Harsh', 'ME', 88),


-- IT branch

('Siddharth', 'IT', 81),

('Nikhil', 'IT', 90),

('Karan', 'IT', 76);

Aggregate Function with OVER()

Find all the students who have marks higher than the avg marks of
their respective branch


# Window functions

select *, AVG(marks) over() from student.marks

select *, AVG(marks) over(partition by branch) from student.marks

# smallest marks and highest marks students

select *, AVG(marks) over(), min(marks) over(), max(marks) over(), min(marks) over( partition by branch), max(marks) over(partition by branch) from student.marks order by student_id

# get the students whose marks is greatere than their respective branch's average marks

select * from (

select *, avg(marks) over(partition by branch) as 'branch_avg'

from student.marks) t

where marks > t.branch_avg



RANK/DENSE_RANK/ROW_NUMBER

1. Find top 2 most paying customers of each month
2. Create roll no from branch and marks

#RANK/DENSE_RANK/ROW_NUMBER

#usecase- you want to rank all the student in your branch

# 1. Find top 2 most paying customers of each month

# 2. Create roll no. from branch and marks

select *, rank() over(order by marks desc)

from student.marks


# get the rank w.r.t branch

select *, rank() over(partition by branch order by marks desc)

from student.marks

#Dence ranks-> uses 2 if two student got the same rank 1,

# where normal rank gives 3

select *,

rank() over(partition by branch order by marks desc),

dense_rank() over(partition by branch order by marks desc)

from student.marks




# Row_number

select *, row_number() over(partition by branch) from marks;

select *, concat(branch, '_', row_number() over(partition by branch))

from marks



use zomato

select * from zomato.orders

select monthname(date), user_id, sum(amount) from zomato.orders

group by monthname(date), user_id

order by month(date)


FIRST_VALUE/LAST VALUE/NTH_VALUE

These functions return values from within a window frame, which is useful for analytics and reporting.
Use cases: 
1. Find the branch toppers
2. FRAME Clause
3. Find the last guy of each branch
4. Alternate way of writing Window functions
5. Find the 2nd last guy of each branch, 5th topper of each branch

Sample DataSet: 
sale_id employee month sales_amount
1 Alice Jan 500
2 Alice Feb 700
3 Alice Mar 600
4 Bob Jan 300
5 Bob Feb 800
6 Bob Mar 500

1. FIRST_VALUE()

Goal: Get each employee's first month's sales

SELECT
  employee,
  month,
  sales_amount,
  FIRST_VALUE(month) OVER (PARTITION BY employee ORDER BY month) AS first_month,
  FIRST_VALUE(sales_amount) OVER (PARTITION BY employee ORDER BY month) AS first_sales
FROM sales;


 Explanation:

  • PARTITION BY employee: Do the calculation per employee

  • ORDER BY month: Order by month to define the "first"


2. LAST_VALUE()

Goal: Get each employee’s last month’s sales

SELECT employee, month, sales_amount, LAST_VALUE(month) OVER ( PARTITION BY employee ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS last_month, LAST_VALUE(sales_amount) OVER ( PARTITION BY employee ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS last_sales FROM sales;

⚠️ Without ROWS BETWEEN ..., LAST_VALUE() may return incorrect values because the default window frame ends at the current row.


3. NTH_VALUE()

Goal: Get each employee’s 2nd month’s sales

SELECT employee, month, sales_amount, NTH_VALUE(month, 2) OVER ( PARTITION BY employee ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS second_month, NTH_VALUE(sales_amount, 2) OVER ( PARTITION BY employee ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS second_sales FROM sales;

Output Example:

employeemonthsales_amountsecond_monthsecond_sales
AliceJan500Feb700
AliceFeb700Feb700
AliceMar600Feb700
BobJan300Feb800
BobFeb800Feb800
BobMar500Feb800


Summary 

FunctionWhat It Returns
FIRST_VALUE()            First value in the window frame
LAST_VALUE()            Last value in the window frame
NTH_VALUE(n)            The n-th value in the window frame

Frames

A frame in a window function is a subset of rows within the partition that
determines the scope of the window function calculation. The frame is defined
using a combination of two clauses in the window function: ROWS and BETWEEN.
The ROWS clause specifies how many rows should be included in the frame
relative to the current row. For example, ROWS 3 PRECEDING means that the
frame includes the current row and the three rows that precede it in the partition.
The BETWEEN clause specifies the boundaries of the frame.
Examples
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW - means that the
frame includes all rows from the beginning of the partition up to and including the
current row.

•ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: the frame includes the
current row and the row immediately before and after it.
•ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: the
frame includes all rows in the partition.
•ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING: the frame includes the
current row and the three rows before it and the two rows after it.

This is crucial in windowed aggregations (like SUM, AVG, etc.) to control how many rows are included in the calculation relative to the current row.

Sample Table: sales

idemployeesales
1Alice100
2Alice200
3Alice150
4Alice300
5Alice250

Goal: Use SUM(sales) with different frame definitions using ROWS BETWEEN ... AND ...

1. ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Frame: From the first row of the partition up to the current row (cumulative sum)

SELECT id, employee, sales, SUM(sales) OVER ( PARTITION BY employee ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total FROM sales;

 Result:

idsalesrunning_total
1100100
2200300
3150450
4300750
52501000

2. ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING

Frame: Current row, one before, and one after (3-row moving sum)

SELECT id, employee, sales, SUM(sales) OVER ( PARTITION BY employee ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) AS moving_sum FROM sales;

Result:

idsalesmoving_sum
1100300
2200450
3150650
4300700
5250550

3. ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

🔒 Frame: Includes all rows in the partition (like a full partition sum)

SELECT id, employee, sales, SUM(sales) OVER ( PARTITION BY employee ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS total_sales FROM sales;

Result:

idsalestotal_sales
11001000
22001000
31501000
43001000
52501000

4. ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING

Frame: 3 rows before, current row, and 2 rows after (6-row frame max)

SELECT id, employee, sales, SUM(sales) OVER ( PARTITION BY employee ORDER BY id ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING ) AS custom_frame_sum FROM sales;

 Result:

idsalescustom_frame_sum
1100100 + 200 + 150 + 300 = 750
2200100 + 200 + 150 + 300 + 250 = 1000
3150100 + 200 + 150 + 300 + 250 = 1000
4300200 + 150 + 300 + 250 = 900
5250150 + 300 + 250 = 700

LEAD & LAG in SQL

  • LEAD(column, offset, default) → Get value after current row.

  • LAG(column, offset, default) → Get value before current row.


Sample Table: sales

idemployeemonthsales
1AliceJan100
2AliceFeb150
3AliceMar200
4BobJan120
5BobFeb180
6BobMar160

Example: Using LEAD and LAG to Compare Sales Month-over-Month

SELECT employee, month, sales, LAG(sales) OVER (PARTITION BY employee ORDER BY month) AS previous_month_sales, LEAD(sales) OVER (PARTITION BY employee ORDER BY month) AS next_month_sales FROM sales;

Output:

employeemonthsalesprevious_month_salesnext_month_sales
AliceJan100NULL150
AliceFeb150100200
AliceMar200150NULL
BobJan120NULL180
BobFeb180120160
BobMar160180NULL

Advanced: Calculate Difference from Previous Month

SELECT employee, month, sales, LAG(sales) OVER (PARTITION BY employee ORDER BY month) AS previous_month_sales, sales - LAG(sales) OVER (PARTITION BY employee ORDER BY month) AS sales_diff FROM sales;

Output:

employeemonthsalesprevious_month_    salessales_diff
AliceJan100NULLNULL
AliceFeb15010050
AliceMar20015050
BobJan120NULLNULL
BobFeb18012060
BobMar160180-20

Parameters

You can use the offset and default parameters as well:

LEAD(sales, 2, 0) OVER (...) -- Look 2 rows ahead, return 0 if not found LAG(sales, 1, 0) OVER (...) -- Look 1 row behind, return 0 if not found

Use Cases

Use CaseUse LEAD / LAG
Compare current row with previous    LAG()
Compare current row with next    LEAD()
Detect changes in a sequence    LAG() + comparison
Calculate growth rate    LAG()


Comments

Popular posts from this blog

A Road-Map to Become Solution Architect

Pickle Vs M, VS ONNX vs SavedModel vs TorchScript

Module 3: Fine-Tuning and Customizing Generative AI Models