import pandas as pd
import numpy as np
import zipfile
import matplotlib.pyplot as plt
import seaborn as snstitle: “RFM Analysis” format: html
RFM Segmentation
Outline
- Business Understanding
- Modelling Workflow
- Data Pipelines
- EDA
- Preprocessing
- Clustering
- Business Recommendation
1. Business Understanding
# 2. Modelling Workflow
3. Importing Data
Data Description
InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction.StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.Description: Product (item) name. Nominal.Quantity: The quantities of each product (item) per transaction. Numeric.InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.Price: Unit price. Numeric, Product price per unit in sterling.Customer ID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.Country: Country name. Nominal, the name of the country where each customer resides.
## Import Data
filepath = "data/raw/online+retail.zip"
with zipfile.ZipFile(filepath, 'r') as z:
print(z.namelist())['Online Retail.xlsx']
with zipfile.ZipFile(filepath, "r") as z:
z.extractall("data/raw/unzipped")df1 = pd.read_excel('data/raw/unzipped/Online Retail.xlsx')display(df1)| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 541904 | 581587 | 22613 | PACK OF 20 SPACEBOY NAPKINS | 12 | 2011-12-09 12:50:00 | 0.85 | 12680.0 | France |
| 541905 | 581587 | 22899 | CHILDREN'S APRON DOLLY GIRL | 6 | 2011-12-09 12:50:00 | 2.10 | 12680.0 | France |
| 541906 | 581587 | 23254 | CHILDRENS CUTLERY DOLLY GIRL | 4 | 2011-12-09 12:50:00 | 4.15 | 12680.0 | France |
| 541907 | 581587 | 23255 | CHILDRENS CUTLERY CIRCUS PARADE | 4 | 2011-12-09 12:50:00 | 4.15 | 12680.0 | France |
| 541908 | 581587 | 22138 | BAKING SET 9 PIECE RETROSPOT | 3 | 2011-12-09 12:50:00 | 4.95 | 12680.0 | France |
541909 rows × 8 columns
filepath = "data/raw/online+retail+ii.zip"
with zipfile.ZipFile(filepath, 'r') as z:
print(z.namelist())['online_retail_II.xlsx']
with zipfile.ZipFile(filepath, "r") as z:
z.extractall("data/raw/unzipped")df2 = pd.read_excel("data/raw/unzipped/online_retail_II.xlsx")display(df2)| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 525456 | 538171 | 22271 | FELTCRAFT DOLL ROSIE | 2 | 2010-12-09 20:01:00 | 2.95 | 17530.0 | United Kingdom |
| 525457 | 538171 | 22750 | FELTCRAFT PRINCESS LOLA DOLL | 1 | 2010-12-09 20:01:00 | 3.75 | 17530.0 | United Kingdom |
| 525458 | 538171 | 22751 | FELTCRAFT PRINCESS OLIVIA DOLL | 1 | 2010-12-09 20:01:00 | 3.75 | 17530.0 | United Kingdom |
| 525459 | 538171 | 20970 | PINK FLORAL FELTCRAFT SHOULDER BAG | 2 | 2010-12-09 20:01:00 | 3.75 | 17530.0 | United Kingdom |
| 525460 | 538171 | 21931 | JUMBO STORAGE BAG SUKI | 2 | 2010-12-09 20:01:00 | 1.95 | 17530.0 | United Kingdom |
525461 rows × 8 columns
Check Data Type
df1.dtypesInvoiceNo object
StockCode object
Description object
Quantity int64
InvoiceDate datetime64[ns]
UnitPrice float64
CustomerID float64
Country object
dtype: object
df1['CustomerID'] = df1['CustomerID'].astype(object)Check for duplicates
df1[df1.duplicated(keep='first')]| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 517 | 536409 | 21866 | UNION JACK FLAG LUGGAGE TAG | 1 | 2010-12-01 11:45:00 | 1.25 | 17908.0 | United Kingdom |
| 527 | 536409 | 22866 | HAND WARMER SCOTTY DOG DESIGN | 1 | 2010-12-01 11:45:00 | 2.10 | 17908.0 | United Kingdom |
| 537 | 536409 | 22900 | SET 2 TEA TOWELS I LOVE LONDON | 1 | 2010-12-01 11:45:00 | 2.95 | 17908.0 | United Kingdom |
| 539 | 536409 | 22111 | SCOTTIE DOG HOT WATER BOTTLE | 1 | 2010-12-01 11:45:00 | 4.95 | 17908.0 | United Kingdom |
| 555 | 536412 | 22327 | ROUND SNACK BOXES SET OF 4 SKULLS | 1 | 2010-12-01 11:49:00 | 2.95 | 17920.0 | United Kingdom |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 541675 | 581538 | 22068 | BLACK PIRATE TREASURE CHEST | 1 | 2011-12-09 11:34:00 | 0.39 | 14446.0 | United Kingdom |
| 541689 | 581538 | 23318 | BOX OF 6 MINI VINTAGE CRACKERS | 1 | 2011-12-09 11:34:00 | 2.49 | 14446.0 | United Kingdom |
| 541692 | 581538 | 22992 | REVOLVER WOODEN RULER | 1 | 2011-12-09 11:34:00 | 1.95 | 14446.0 | United Kingdom |
| 541699 | 581538 | 22694 | WICKER STAR | 1 | 2011-12-09 11:34:00 | 2.10 | 14446.0 | United Kingdom |
| 541701 | 581538 | 23343 | JUMBO BAG VINTAGE CHRISTMAS | 1 | 2011-12-09 11:34:00 | 2.08 | 14446.0 | United Kingdom |
5268 rows × 8 columns
- There are 5268 rows of duplicated data
df1.drop_duplicates(keep='first', inplace=True)df1.duplicated().sum()np.int64(0)
4. Exploratory Data Analysis
eda_data = df1.copy()
# validate
print('Original data shape :', df1.shape)
print('EDA data shape :', eda_data.shape)Original data shape : (536641, 8)
EDA data shape : (536641, 8)
4.1 Check for data descriptions
eda_data.describe()| Quantity | InvoiceDate | UnitPrice | |
|---|---|---|---|
| count | 536641.000000 | 536641 | 536641.000000 |
| mean | 9.620029 | 2011-07-04 08:57:06.087421952 | 4.632656 |
| min | -80995.000000 | 2010-12-01 08:26:00 | -11062.060000 |
| 25% | 1.000000 | 2011-03-28 10:52:00 | 1.250000 |
| 50% | 3.000000 | 2011-07-19 14:04:00 | 2.080000 |
| 75% | 10.000000 | 2011-10-18 17:05:00 | 4.130000 |
| max | 80995.000000 | 2011-12-09 12:50:00 | 38970.000000 |
| std | 219.130156 | NaN | 97.233118 |
Summary - The Quantity and UnitPrice minimum value are negatives - There are potential outliers on the data
eda_data[eda_data['Quantity']<0]| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 141 | C536379 | D | Discount | -1 | 2010-12-01 09:41:00 | 27.50 | 14527.0 | United Kingdom |
| 154 | C536383 | 35004C | SET OF 3 COLOURED FLYING DUCKS | -1 | 2010-12-01 09:49:00 | 4.65 | 15311.0 | United Kingdom |
| 235 | C536391 | 22556 | PLASTERS IN TIN CIRCUS PARADE | -12 | 2010-12-01 10:24:00 | 1.65 | 17548.0 | United Kingdom |
| 236 | C536391 | 21984 | PACK OF 12 PINK PAISLEY TISSUES | -24 | 2010-12-01 10:24:00 | 0.29 | 17548.0 | United Kingdom |
| 237 | C536391 | 21983 | PACK OF 12 BLUE PAISLEY TISSUES | -24 | 2010-12-01 10:24:00 | 0.29 | 17548.0 | United Kingdom |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 540449 | C581490 | 23144 | ZINC T-LIGHT HOLDER STARS SMALL | -11 | 2011-12-09 09:57:00 | 0.83 | 14397.0 | United Kingdom |
| 541541 | C581499 | M | Manual | -1 | 2011-12-09 10:28:00 | 224.69 | 15498.0 | United Kingdom |
| 541715 | C581568 | 21258 | VICTORIAN SEWING BOX LARGE | -5 | 2011-12-09 11:57:00 | 10.95 | 15311.0 | United Kingdom |
| 541716 | C581569 | 84978 | HANGING HEART JAR T-LIGHT HOLDER | -1 | 2011-12-09 11:58:00 | 1.25 | 17315.0 | United Kingdom |
| 541717 | C581569 | 20979 | 36 PENCILS TUBE RED RETROSPOT | -5 | 2011-12-09 11:58:00 | 1.25 | 17315.0 | United Kingdom |
10587 rows × 8 columns
eda_data[eda_data['UnitPrice']<0]| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 299983 | A563186 | B | Adjust bad debt | 1 | 2011-08-12 14:51:00 | -11062.06 | NaN | United Kingdom |
| 299984 | A563187 | B | Adjust bad debt | 1 | 2011-08-12 14:52:00 | -11062.06 | NaN | United Kingdom |
Dropping Negative Quantity and UnitPrice
print('Shape before dropping negative quantity and unitprice: ', eda_data.shape)
eda_data = eda_data[(eda_data['UnitPrice']>0) & (eda_data['Quantity']>0)]
print('Shape after dropping negative quantity and unitprice: ', eda_data.shape)Shape before dropping negative quantity and unitprice: (536641, 8)
Shape after dropping negative quantity and unitprice: (524878, 8)
eda_data.isna().sum()InvoiceNo 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 132186
Country 0
dtype: int64
print("Shape before dropping Nan: ", eda_data.shape)
eda_data = eda_data.dropna()
print("Shape after dropping Nan: ", eda_data.shape)Shape sebelum drop Nan: (524878, 8)
Shape setelah drop Nan: (392692, 8)
# Validate
eda_data.isna().sum()InvoiceNo 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 0
Country 0
dtype: int64
4.2 Checking InvoiceNo
eda_data['InvoiceNo'].value_counts()InvoiceNo
576339 542
579196 533
580727 529
578270 442
573576 435
...
551888 1
536400 1
536405 1
581458 1
581459 1
Name: count, Length: 18532, dtype: int64
eda_data[eda_data['InvoiceNo'] == 576339]['StockCode'].nunique()541
There are 541 number of items in a single transaction in invoice 5766339.
4.3 Checking CustomerID
eda_data['CustomerID'].nunique()4338
Summary - From 536641 rows of transaction, there were only 4338 number of unique customer.
4.4 Create RFM+T features
Checking last transaction date
eda_data["InvoiceDate"].min(), eda_data['InvoiceDate'].max()(Timestamp('2010-12-01 08:26:00'), Timestamp('2011-12-09 12:50:00'))
Summary - This dataset contains data from 1 Desember 2010 until 9 Desember 2011 - Let’s take 11 Desember 2011 as the reference day.
ref_date = pd.to_datetime('2011-12-11')
ref_dateTimestamp('2011-12-11 00:00:00')
Define the RFM+T function
def recency(x):
"""Calculate days after customer's last transaction to the reference date
Args:
x(datetime): InvoiceDate
Returns:
integer: number of days """
return (ref_date - x.max()).days
def frequency(x):
"""Calculate the number of transactions done by each customer
Args:
x(object): InvoiceNo
Returns:
integer: number of transactions"""
return len(set(x))
def tenure(x):
"""Calculate the day difference between customer first and last purchase
Args:
x(datetime): InvoiceDate
Returns:
integer: number of days"""
return (x.max()-x.min()).days
def generate_rfmts(data):
""" Generate recency, frequency, monetary, tenure, and sales from data"""
# Create sales feature
data['Sales'] = data['Quantity']* data['UnitPrice']
#Grouping by CustomerID
agg_dict = {
'InvoiceDate' : [recency, tenure],
'InvoiceNo' : [frequency],
'Sales' : 'sum'
}
data_rfmts = data.groupby('CustomerID').agg(agg_dict)
#Rename columns
data_rfmts.columns = ['Recency', 'Tenure', 'Frequency', 'Monetary']
data_rfmts.index = data_rfmts.index.astype('int')
return data_rfmtsrfm_data = generate_rfmts(eda_data)rfm_data| Recency | Tenure | Frequency | Monetary | |
|---|---|---|---|---|
| CustomerID | ||||
| 12346 | 326 | 0 | 1 | 77183.60 |
| 12347 | 3 | 365 | 7 | 4310.00 |
| 12348 | 76 | 282 | 4 | 1797.24 |
| 12349 | 19 | 0 | 1 | 1757.55 |
| 12350 | 311 | 0 | 1 | 334.40 |
| ... | ... | ... | ... | ... |
| 18280 | 278 | 0 | 1 | 180.60 |
| 18281 | 181 | 0 | 1 | 80.82 |
| 18282 | 8 | 118 | 2 | 178.05 |
| 18283 | 4 | 333 | 16 | 2045.53 |
| 18287 | 43 | 158 | 3 | 1837.28 |
4338 rows × 4 columns
Checking RFM data Descriptions
rfm_data.describe()| Recency | Tenure | Frequency | Monetary | |
|---|---|---|---|---|
| count | 4338.000000 | 4338.000000 | 4338.000000 | 4338.000000 |
| mean | 93.059474 | 130.448594 | 4.272015 | 2048.688081 |
| std | 100.012264 | 132.039554 | 7.697998 | 8985.230220 |
| min | 1.000000 | 0.000000 | 1.000000 | 3.750000 |
| 25% | 18.000000 | 0.000000 | 1.000000 | 306.482500 |
| 50% | 51.000000 | 92.500000 | 2.000000 | 668.570000 |
| 75% | 142.750000 | 251.750000 | 5.000000 | 1660.597500 |
| max | 374.000000 | 373.000000 | 209.000000 | 280206.020000 |
Summary - There are no negative numbers in the data. - Frequency and Monetary seems to have an outlier.
Checking Tenure
- Tenure = 0 means customer last and first purchase are on the same day.
rfm_data[rfm_data['Tenure'] == 0]| Recency | Tenure | Frequency | Monetary | |
|---|---|---|---|---|
| CustomerID | ||||
| 12346 | 325 | 0 | 1 | 77183.60 |
| 12349 | 18 | 0 | 1 | 1757.55 |
| 12350 | 310 | 0 | 1 | 334.40 |
| 12353 | 204 | 0 | 1 | 89.00 |
| 12354 | 232 | 0 | 1 | 1079.40 |
| ... | ... | ... | ... | ... |
| 18276 | 43 | 0 | 1 | 335.86 |
| 18277 | 58 | 0 | 1 | 110.38 |
| 18278 | 73 | 0 | 1 | 173.90 |
| 18280 | 277 | 0 | 1 | 180.60 |
| 18281 | 180 | 0 | 1 | 80.82 |
1555 rows × 4 columns
Checking Frequency
- check if tenure = 0 means a one time buyer
rfm_data[(rfm_data['Tenure'] == 0) & (rfm_data['Frequency'] > 1)]| Recency | Tenure | Frequency | Monetary | |
|---|---|---|---|---|
| CustomerID | ||||
| 12365 | 292 | 0 | 2 | 641.38 |
| 12410 | 309 | 0 | 2 | 693.33 |
| 12424 | 163 | 0 | 2 | 1760.96 |
| 12452 | 17 | 0 | 2 | 430.57 |
| 12512 | 67 | 0 | 2 | 132.18 |
| ... | ... | ... | ... | ... |
| 18013 | 156 | 0 | 2 | 187.29 |
| 18024 | 153 | 0 | 2 | 389.78 |
| 18034 | 22 | 0 | 2 | 418.68 |
| 18048 | 204 | 0 | 2 | 172.29 |
| 18139 | 18 | 0 | 6 | 8438.34 |
62 rows × 4 columns
- not all of them are one time buyer, they do more than 1 transaction on one day.
- Our interest is to segment the repurchase customer.
- Exclude the one time buyer from the data
rfm_data = rfm_data[rfm_data['Frequency'] > 1]
rfm_data.describe()| Recency | Tenure | Frequency | Monetary | |
|---|---|---|---|---|
| count | 2845.000000 | 2845.000000 | 2845.000000 | 2845.000000 |
| mean | 59.386292 | 198.905448 | 5.989104 | 2907.985706 |
| std | 70.246203 | 113.866961 | 9.044247 | 10899.538240 |
| min | 1.000000 | 0.000000 | 2.000000 | 6.900000 |
| 25% | 12.000000 | 98.000000 | 2.000000 | 614.660000 |
| 50% | 31.000000 | 207.000000 | 4.000000 | 1154.470000 |
| 75% | 76.000000 | 299.000000 | 6.000000 | 2429.830000 |
| max | 374.000000 | 373.000000 | 209.000000 | 280206.020000 |
Plotting the Distribution
for col in rfm_data.columns:
x = rfm_data[col]
sns.histplot(x)
plt.title(col)
plt.show()Summary - The Monetary & Frequency data is highly skewed. - Need to drop these outliers.
Dropping Outliers
def drop_outlier(data, cols):
""""""
data_clean = data.copy()
for col in cols:
q1, q3 = np.quantile(data_clean[col], q=[0.25, 0.75])
iqr = q3 - q1
ub = q3 + 1.5*iqr # upper bound
lb = q1 - 1.5*iqr # lower bound
#Filter data
data_clean = data_clean[(data_clean[col]>=lb) & (data_clean[col]<=ub)]
return data_cleanrfm_data = drop_outlier(rfm_data, cols =['Monetary',
'Frequency'])
print('rfm data shape:', rfm_data.shape)
rfm_data.describe()rfm data shape: (2395, 4)
| Recency | Tenure | Frequency | Monetary | |
|---|---|---|---|---|
| count | 2395.000000 | 2395.000000 | 2395.000000 | 2395.000000 |
| mean | 66.670981 | 177.201670 | 3.768267 | 1251.421504 |
| std | 72.737947 | 107.313335 | 1.903081 | 980.459476 |
| min | 1.000000 | 0.000000 | 2.000000 | 6.900000 |
| 25% | 16.000000 | 82.000000 | 2.000000 | 547.485000 |
| 50% | 37.000000 | 182.000000 | 3.000000 | 942.340000 |
| 75% | 89.000000 | 266.000000 | 5.000000 | 1687.945000 |
| max | 374.000000 | 371.000000 | 9.000000 | 5126.760000 |
for col in rfm_data.columns:
x = rfm_data[col]
sns.histplot(x)
plt.title(col)
plt.show()Normalizing Data
from sklearn.preprocessing import MinMaxScaler
def fit_scaler(data):
"""
Function to fit a min-max scaler
Parameters
----------
data : pandas DataFrame
The sample RFM data
Returns
-------
scaler : object
The Min-Max scaler
"""
# Create & fit scaler
scaler = MinMaxScaler()
scaler.fit(data)
return scaler
def transform_scaler(data, scaler):
"""
Function to fit a min-max scaler
Parameters
----------
data : pandas DataFrame
The sample RFM data
scaler : object
The Min-Max scaler
Returns
-------
data_clean : object
The clean data
"""
# Transform scaler
data_clean = pd.DataFrame(scaler.transform(data))
data_clean.index = data.index
data_clean.columns = data.columns
# Validate
print("Data shape :", data.shape)
return data_clean# Fit a scaler
scaler = fit_scaler(data = rfm_data)
# Transform data
data_rfm_scaled = transform_scaler(data = rfm_data,
scaler = scaler)
data_rfm_scaled.head()Data shape : (2395, 4)
| Recency | Tenure | Frequency | Monetary | |
|---|---|---|---|---|
| CustomerID | ||||
| 12347 | 0.005362 | 0.983827 | 0.714286 | 0.840472 |
| 12348 | 0.201072 | 0.760108 | 0.285714 | 0.349685 |
| 12352 | 0.096515 | 0.700809 | 0.857143 | 0.488127 |
| 12356 | 0.058981 | 0.814016 | 0.142857 | 0.547775 |
| 12358 | 0.002681 | 0.401617 | 0.000000 | 0.226795 |
data_rfm_scaled.describe()| Recency | Tenure | Frequency | Monetary | |
|---|---|---|---|---|
| count | 2395.000000 | 2395.000000 | 2395.000000 | 2395.000000 |
| mean | 0.176062 | 0.477633 | 0.252610 | 0.243077 |
| std | 0.195008 | 0.289254 | 0.271869 | 0.191501 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.040214 | 0.221024 | 0.000000 | 0.105586 |
| 50% | 0.096515 | 0.490566 | 0.142857 | 0.182708 |
| 75% | 0.235925 | 0.716981 | 0.428571 | 0.328338 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
Create Preprocessing Pipeline
6. Clustering
6.1 Ranking Method
rfm_rank = data_rfm_scaled.copy()- Assume the company can only afford 5 marketing strategies, therefore maximum number of cluster would be 5.
- We will rank each RFM column the data into 5 (binning).
def binning_rfm(df):
"""
Function for binning the data for making ranking method
Parameter
---------
df : pandas dataframe
dataframe input for binning method
Return
------
df : pandas dataframe
output binned dataframe
"""
# Calculate quantiles for each variable
quantiles = df.quantile(q=[0.2, 0.4, 0.6, 0.8])
# Create bins for each variable
r_bins = [-np.inf] + list(quantiles['Recency']) + [np.inf]
f_bins = [-np.inf] + list(quantiles['Frequency']) + [np.inf]
m_bins = [-np.inf] + list(quantiles['Monetary']) + [np.inf]
# Create labels for the bins
labels = [1, 2, 3, 4, 5]
# label for recency (lower recency, higher ranking)
labels_r = [5, 4, 3, 2, 1]
# Bin each variable into five groups
df['R_Group'] = pd.cut(df['Recency'], bins=r_bins, labels=labels_r).astype(int)
df['F_Group'] = pd.cut(df['Frequency'], bins=f_bins, labels=labels).astype(int)
df['M_Group'] = pd.cut(df['Monetary'], bins=m_bins, labels=labels).astype(int)
# RFM score with average
df['RFM_Score'] = np.round((df['R_Group']+df['F_Group']+df['M_Group'])/3,2)
return dfrfm_rank = binning_rfm(rfm_rank)rfm_rank.head()| Recency | Tenure | Frequency | Monetary | R_Group | F_Group | M_Group | RFM_Score | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | ||||||||
| 12347 | 0.005362 | 0.983827 | 0.714286 | 0.840472 | 5 | 5 | 5 | 5.00 |
| 12348 | 0.201072 | 0.760108 | 0.285714 | 0.349685 | 2 | 3 | 4 | 3.00 |
| 12352 | 0.096515 | 0.700809 | 0.857143 | 0.488127 | 3 | 5 | 5 | 4.33 |
| 12356 | 0.058981 | 0.814016 | 0.142857 | 0.547775 | 4 | 2 | 5 | 3.67 |
| 12358 | 0.002681 | 0.401617 | 0.000000 | 0.226795 | 5 | 1 | 3 | 3.00 |
rfm_rank.dtypesRecency float64
Tenure float64
Frequency float64
Monetary float64
R_Group int64
F_Group int64
M_Group int64
RFM_Score float64
dtype: object