Window Functions -
Window function:
pySpark window functions are useful when you want to examine relationships within group of data rather than between group of data. It performs statictical operations like below explained.PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations.
There are mainly three types of Window function:
Analytical Function
Ranking Function
Aggregate Function
By using Window function, the query directly fetches data from the next row within the window, eliminating the need for an explicit self join that would create a temporary copy of the table and join it back to itself.
What is Split , apply and combine in window functions?
While not explicitly named as "Split, Apply, Combine" in the context of window functions, these concepts implicitly guide their usage:
1. Split:
Partitioning: Divides the data into groups based on specified criteria using the PARTITION BY clause. Each group serves as a distinct window for calculations.
Ordering: Arranges the rows within each partition using the ORDER BY clause. This ordering affects how window functions access and process data within the window.
2. Apply:
Window Function Application: The chosen window function(s) are applied to each row within its respective window, performing calculations or operations on the designated rows.
3. Combine:
Result Integration: The results of the window function calculations are seamlessly integrated into the output, generating new columns alongside original data without aggregation.
Concise Output: The final result retains all original rows while incorporating the derived values from the window functions, providing a comprehensive view of the data with added insights.
SELECT customer_id, order_date, order_amount,
SUM(order_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total
FROM orders;
Breakdown:
Split:
PARTITION BY customer_id: Splits data into groups based on customer ID.
ORDER BY order_date: Orders rows within each group by order date.
Apply:
SUM(order_amount) OVER (...): Calculates the running total of order amounts for each customer, applied to each row within its window.
Combine:
The original columns (customer_id, order_date, order_amount) are combined with the calculated running_total column, providing a comprehensive view of each order and its cumulative value within the customer's purchase history.