When I play with Kaggle datasets, my default programing language is Python. Once I was exploring Kaggle dataset and I wanted to determine the transaction number for each credit card in Python, my first thought was It’s pretty easy in Transact SQL thanks to window functions. But can I use this concept in Python?
What is the window function?
The window function is the SQL concept where the dataset is split into several sets and some function is applied to each set separately. The window function always requires
OVER clause, sometimes followed by
PARTITION BY option. I won’t duplicate the internet content, so If you need a detailed explanation, this website seems to describe the concept straightforward.
My aim was to determine the transaction number for each credit card number. The dataset size was significant (~590 000 rows, 2.7GB). It consisted of two columns:
- TransactionDT – transaction date, Timestamp format
- credit _card_number
The index represented a unique TransactionID. Below is a subset of the data frame:
As mentioned above, this case would be easy to solve in SQL thanks to the window function. The
OVER(PARTITION BY credit_card_number ...) would temporarily create a subset for each value in credit_card_number.
(... ORDER BY TransactionDT ASC) would then order the entities in each subset by the transaction date. Then
ROW_NUMBER() function (which numbers the entities in the subset) would be applied for each subset. The query below shows how to use the window functions to get the transaction number for each credit card number:
select credit_card_number, TransactionDT, ROW_NUMBER() OVER(PARTITION BY credit_card_number ORDER BY TransactionDT ASC) AS card_number_occcurence2 from train
Although the SQL solution is easy, I have already written plenty of the code in Python. To avoid the mess, I wanted to have all the code in one programming language, Python. So how to write
OVER (PARTITION BY ...) in Python-like manner? I’ll present two approaches: using groupby and the trivial for loop. At first, we’ll order the dataset by credit card number and transaction date, as this step is required by both of the approaches:
The steps to imitate SQL window function in Python will be the following:
- split the dataset into subsets
- for each subset order the data by transaction date
- determine the number of transaction for each subset
To split the dataset into subsets we will use the
groupby function. It literally returns the dataset split by unique values of the argument column. Be aware, the Python’s
groupby performance differs from SQL
GROUP BY function, which needs to be used with aggregate functions (avg, max, sum, count, etc.).
The data is already ordered, so we can move to step three.
To get the number of transactions we can apply the following pseudo-code for each subset:
np.arange(len(subset)). It will return the vector with the row number (which is also a transaction number for given credit card). To use this function with
groupby, we’ll use transform function. According to documentation,
transform returns the DataFrame of the same shape as original, but with values transformed by the argument function. For instance, single-column dataset transformed by
transform(lambda x : np.arange(len(x)))will return single-column dataset. The only column will contain consecutive numbers starting from 0, of the same length as input data frame. Indeed, the output is basically the row number and the number of the transaction! The same transformation used with
groupby('credit_card_number') will return transaction number for each credit card! To sum up, the final function will be the following:
train["card_number_occcurence"] = train.groupby('credit_card_number').transform(lambda x : np.arange(len(x)))
What’s satisfying, the procedure takes only 20.1 sec (for ~590 000 rows).
One could wonder, why not to use for loop to determine transaction number for each credit card. The answer is poor performance. The first approach uses pre-defined, optimized functions. The for loop could be slow. I decided to check the for loop performance time. It was performed on already ordered data. For each unique credit card number, the subset of its transactions is created. Then the transaction number is applied (
for credit_card in train["credit_card_number"].unique(): train.loc[train["credit_card_number"] == credit_card, "card_number_occcurence2"] = np.arange(len(train.loc[train["credit_card_number"] == credit_card]))
The example is working anyhow, but slowly- it takes in 1971.7 sec (~33min).
It’s possible to imitate SQL
OVER(PARTITION BY ...) in Python using one-line code. What else, the computation time is acceptable.