When I play with Kaggle datasets, my default programing language is Python. Once I was exploring Kaggle dataset and I wanted to determine the transaction number for each credit card in Python, my first thought was *It’s pretty easy in Transact SQL thanks to window functions. But can I use this concept in Python?*

Contents

# What is the window function?

The window function is the SQL concept where the dataset is split into several sets and some function is applied to each set separately. The window function always requires `OVER`

clause, sometimes followed by `PARTITION BY`

option. I won’t duplicate the internet content, so If you need a detailed explanation, this website seems to describe the concept straightforward.

# Real-life example

My aim was to determine the transaction number for each credit card number. The dataset size was significant (**~590 000 rows**, **2.7GB**). It consisted of two columns:

*TransactionDT*– transaction date, Timestamp format*credit _card_number*

The index represented a unique *TransactionID*. Below is a subset of the data frame:

## SQL

As mentioned above, this case would be easy to solve in SQL thanks to the window function. The `OVER(PARTITION BY credit_card_number ...)`

would temporarily create a subset for each value in credit_card_number. `(... ORDER BY TransactionDT ASC)`

would then order the entities in each subset by the transaction date. Then`ROW_NUMBER()`

function (which numbers the entities in the subset) would be applied for each subset. The query below shows how to use the window functions to get the transaction number for each credit card number:

select credit_card_number, TransactionDT, ROW_NUMBER() OVER(PARTITION BY credit_card_number ORDER BY TransactionDT ASC) AS card_number_occcurence2 from train

## Python

Although the SQL solution is easy, I have already written plenty of the code in Python. To avoid the mess, I wanted to have all the code in one programming language, Python. So how to write `OVER (PARTITION BY ...)`

in Python-like manner? I’ll present two approaches: using** groupby** and the **trivial for loop**. At first, we’ll order the dataset by credit card number and transaction date, as this step is required by both of the approaches:

train.sort_values(['credit_card_number','TransactionDT'],ascending=True, inplace=True)

### Groupby approach

The steps to imitate SQL window function in Python will be the following:

- split the dataset into subsets
- for each subset order the data by transaction date
- determine the number of transaction for each subset

To split the **dataset into subsets** we will use the `groupby`

function. It literally returns the dataset split by unique values of the argument column. **Be aware**, the Python’s `groupby`

performance **differs from SQL** `GROUP BY`

function, which needs to be used with **aggregate functions** (avg, max, sum, count, etc.).

The data is already ordered, so we can move to step three.

To get the number of transactions we can apply the following pseudo-code for each subset: `np.arange(len(subset))`

. It will return the vector with the row number (which is also a transaction number for given credit card). To use this function with `groupby`

, we’ll use **transform** function. According to documentation, `transform`

returns the DataFrame of the same shape as original, but with values transformed by the argument function. For instance, single-column dataset transformed by`transform(lambda x : np.arange(len(x)))`

will return single-column dataset. The only column will contain consecutive numbers starting from 0, of the same length as input data frame. Indeed, the output is basically the row number and the number of the transaction! The same transformation used with `groupby('credit_card_number')`

will return transaction number for each credit card! To sum up, the final function will be the following:

train["card_number_occcurence"] = train.groupby('credit_card_number').transform(lambda x : np.arange(len(x)))

What’s satisfying, the procedure takes only **20.1 sec** (for ~590 000 rows).

### For-loop approach

One could wonder, why not to use **for loop** to determine transaction number for each credit card. The answer is **poor performance**. The first approach uses pre-defined, optimized functions. The for loop could be slow. I decided to check the for loop performance time. It was performed on already ordered data. For each unique credit card number, the subset of its transactions is created. Then the transaction number is applied (`np.arange(len(x))`

):

for credit_card in train["credit_card_number"].unique(): train.loc[train["credit_card_number"] == credit_card, "card_number_occcurence2"] = np.arange(len(train.loc[train["credit_card_number"] == credit_card]))

The example is working anyhow, but slowly- it takes in **1971.7 sec** (**~33min**).

# Sum Up

It’s possible to imitate SQL `OVER(PARTITION BY ...)`

in Python using one-line code. What else, the computation time is acceptable.