title: “RFM Analysis” format: html

RFM Segmentation

Outline

Business Understanding
Modelling Workflow
Data Pipelines
EDA
Preprocessing
Clustering
Business Recommendation

import pandas as pd
import numpy as np
import zipfile
import matplotlib.pyplot as plt
import seaborn as sns

1. Business Understanding

# 2. Modelling Workflow

3. Importing Data

Data Description

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
Price: Unit price. Numeric, Product price per unit in sterling.
Customer ID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

## Import Data

filepath = "data/raw/online+retail.zip"

with zipfile.ZipFile(filepath, 'r') as z:
    print(z.namelist())

['Online Retail.xlsx']

with zipfile.ZipFile(filepath, "r") as z:
    z.extractall("data/raw/unzipped")

df1 = pd.read_excel('data/raw/unzipped/Online Retail.xlsx')

display(df1)

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
...	...	...	...	...	...	...	...	...
541904	581587	22613	PACK OF 20 SPACEBOY NAPKINS	12	2011-12-09 12:50:00	0.85	12680.0	France
541905	581587	22899	CHILDREN'S APRON DOLLY GIRL	6	2011-12-09 12:50:00	2.10	12680.0	France
541906	581587	23254	CHILDRENS CUTLERY DOLLY GIRL	4	2011-12-09 12:50:00	4.15	12680.0	France
541907	581587	23255	CHILDRENS CUTLERY CIRCUS PARADE	4	2011-12-09 12:50:00	4.15	12680.0	France
541908	581587	22138	BAKING SET 9 PIECE RETROSPOT	3	2011-12-09 12:50:00	4.95	12680.0	France

541909 rows × 8 columns

filepath = "data/raw/online+retail+ii.zip"

with zipfile.ZipFile(filepath, 'r') as z:
    print(z.namelist())

['online_retail_II.xlsx']

with zipfile.ZipFile(filepath, "r") as z:
    z.extractall("data/raw/unzipped")

df2 = pd.read_excel("data/raw/unzipped/online_retail_II.xlsx")

display(df2)

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country
0	489434	85048	15CM CHRISTMAS GLASS BALL 20 LIGHTS	12	2009-12-01 07:45:00	6.95	13085.0	United Kingdom
1	489434	79323P	PINK CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom
2	489434	79323W	WHITE CHERRY LIGHTS	12	2009-12-01 07:45:00	6.75	13085.0	United Kingdom
3	489434	22041	RECORD FRAME 7" SINGLE SIZE	48	2009-12-01 07:45:00	2.10	13085.0	United Kingdom
4	489434	21232	STRAWBERRY CERAMIC TRINKET BOX	24	2009-12-01 07:45:00	1.25	13085.0	United Kingdom
...	...	...	...	...	...	...	...	...
525456	538171	22271	FELTCRAFT DOLL ROSIE	2	2010-12-09 20:01:00	2.95	17530.0	United Kingdom
525457	538171	22750	FELTCRAFT PRINCESS LOLA DOLL	1	2010-12-09 20:01:00	3.75	17530.0	United Kingdom
525458	538171	22751	FELTCRAFT PRINCESS OLIVIA DOLL	1	2010-12-09 20:01:00	3.75	17530.0	United Kingdom
525459	538171	20970	PINK FLORAL FELTCRAFT SHOULDER BAG	2	2010-12-09 20:01:00	3.75	17530.0	United Kingdom
525460	538171	21931	JUMBO STORAGE BAG SUKI	2	2010-12-09 20:01:00	1.95	17530.0	United Kingdom

525461 rows × 8 columns

Check Data Type

df1.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

df1['CustomerID'] = df1['CustomerID'].astype(object)

Check for duplicates

df1[df1.duplicated(keep='first')]

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
517	536409	21866	UNION JACK FLAG LUGGAGE TAG	1	2010-12-01 11:45:00	1.25	17908.0	United Kingdom
527	536409	22866	HAND WARMER SCOTTY DOG DESIGN	1	2010-12-01 11:45:00	2.10	17908.0	United Kingdom
537	536409	22900	SET 2 TEA TOWELS I LOVE LONDON	1	2010-12-01 11:45:00	2.95	17908.0	United Kingdom
539	536409	22111	SCOTTIE DOG HOT WATER BOTTLE	1	2010-12-01 11:45:00	4.95	17908.0	United Kingdom
555	536412	22327	ROUND SNACK BOXES SET OF 4 SKULLS	1	2010-12-01 11:49:00	2.95	17920.0	United Kingdom
...	...	...	...	...	...	...	...	...
541675	581538	22068	BLACK PIRATE TREASURE CHEST	1	2011-12-09 11:34:00	0.39	14446.0	United Kingdom
541689	581538	23318	BOX OF 6 MINI VINTAGE CRACKERS	1	2011-12-09 11:34:00	2.49	14446.0	United Kingdom
541692	581538	22992	REVOLVER WOODEN RULER	1	2011-12-09 11:34:00	1.95	14446.0	United Kingdom
541699	581538	22694	WICKER STAR	1	2011-12-09 11:34:00	2.10	14446.0	United Kingdom
541701	581538	23343	JUMBO BAG VINTAGE CHRISTMAS	1	2011-12-09 11:34:00	2.08	14446.0	United Kingdom

5268 rows × 8 columns

There are 5268 rows of duplicated data

df1.drop_duplicates(keep='first', inplace=True)

df1.duplicated().sum()

np.int64(0)

4. Exploratory Data Analysis

eda_data = df1.copy()

# validate
print('Original data shape :', df1.shape)
print('EDA data shape      :', eda_data.shape)

Original data shape : (536641, 8)
EDA data shape      : (536641, 8)

4.1 Check for data descriptions

eda_data.describe()

	Quantity	InvoiceDate	UnitPrice
count	536641.000000	536641	536641.000000
mean	9.620029	2011-07-04 08:57:06.087421952	4.632656
min	-80995.000000	2010-12-01 08:26:00	-11062.060000
25%	1.000000	2011-03-28 10:52:00	1.250000
50%	3.000000	2011-07-19 14:04:00	2.080000
75%	10.000000	2011-10-18 17:05:00	4.130000
max	80995.000000	2011-12-09 12:50:00	38970.000000
std	219.130156	NaN	97.233118

Summary - The Quantity and UnitPrice minimum value are negatives - There are potential outliers on the data

eda_data[eda_data['Quantity']<0]

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
141	C536379	D	Discount	-1	2010-12-01 09:41:00	27.50	14527.0	United Kingdom
154	C536383	35004C	SET OF 3 COLOURED FLYING DUCKS	-1	2010-12-01 09:49:00	4.65	15311.0	United Kingdom
235	C536391	22556	PLASTERS IN TIN CIRCUS PARADE	-12	2010-12-01 10:24:00	1.65	17548.0	United Kingdom
236	C536391	21984	PACK OF 12 PINK PAISLEY TISSUES	-24	2010-12-01 10:24:00	0.29	17548.0	United Kingdom
237	C536391	21983	PACK OF 12 BLUE PAISLEY TISSUES	-24	2010-12-01 10:24:00	0.29	17548.0	United Kingdom
...	...	...	...	...	...	...	...	...
540449	C581490	23144	ZINC T-LIGHT HOLDER STARS SMALL	-11	2011-12-09 09:57:00	0.83	14397.0	United Kingdom
541541	C581499	M	Manual	-1	2011-12-09 10:28:00	224.69	15498.0	United Kingdom
541715	C581568	21258	VICTORIAN SEWING BOX LARGE	-5	2011-12-09 11:57:00	10.95	15311.0	United Kingdom
541716	C581569	84978	HANGING HEART JAR T-LIGHT HOLDER	-1	2011-12-09 11:58:00	1.25	17315.0	United Kingdom
541717	C581569	20979	36 PENCILS TUBE RED RETROSPOT	-5	2011-12-09 11:58:00	1.25	17315.0	United Kingdom

10587 rows × 8 columns

eda_data[eda_data['UnitPrice']<0]

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
299983	A563186	B	Adjust bad debt	1	2011-08-12 14:51:00	-11062.06	NaN	United Kingdom
299984	A563187	B	Adjust bad debt	1	2011-08-12 14:52:00	-11062.06	NaN	United Kingdom

Dropping Negative Quantity and UnitPrice

print('Shape before dropping negative quantity and unitprice: ', eda_data.shape)

eda_data = eda_data[(eda_data['UnitPrice']>0) & (eda_data['Quantity']>0)]

print('Shape after dropping negative quantity and unitprice: ', eda_data.shape)

Shape before dropping negative quantity and unitprice:  (536641, 8)
Shape after dropping negative quantity and unitprice:  (524878, 8)

eda_data.isna().sum()

InvoiceNo           0
StockCode           0
Description         0
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     132186
Country             0
dtype: int64

print("Shape before dropping Nan: ", eda_data.shape)

eda_data = eda_data.dropna()

print("Shape after dropping Nan: ", eda_data.shape)

Shape sebelum drop Nan:  (524878, 8)
Shape setelah drop Nan:  (392692, 8)

# Validate
eda_data.isna().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

4.2 Checking `InvoiceNo`

eda_data['InvoiceNo'].value_counts()

InvoiceNo
576339    542
579196    533
580727    529
578270    442
573576    435
         ... 
551888      1
536400      1
536405      1
581458      1
581459      1
Name: count, Length: 18532, dtype: int64

eda_data[eda_data['InvoiceNo'] == 576339]['StockCode'].nunique()

There are 541 number of items in a single transaction in invoice 5766339.

4.3 Checking `CustomerID`

eda_data['CustomerID'].nunique()

Summary - From 536641 rows of transaction, there were only 4338 number of unique customer.

4.4 Create RFM+T features

Checking last transaction date

eda_data["InvoiceDate"].min(), eda_data['InvoiceDate'].max()

(Timestamp('2010-12-01 08:26:00'), Timestamp('2011-12-09 12:50:00'))

Summary - This dataset contains data from 1 Desember 2010 until 9 Desember 2011 - Let’s take 11 Desember 2011 as the reference day.

ref_date = pd.to_datetime('2011-12-11')

ref_date

Timestamp('2011-12-11 00:00:00')

Define the RFM+T function

def recency(x):
    """Calculate days after customer's last transaction to the reference date
        Args:
            x(datetime): InvoiceDate
        
        Returns:
            integer: number of days """
    return (ref_date - x.max()).days

def frequency(x):
    """Calculate the number of transactions done by each customer
        Args:
            x(object): InvoiceNo
        
        Returns:
            integer: number of transactions"""
    return len(set(x))

def tenure(x):
    """Calculate the day difference between customer first and last purchase
        Args:
            x(datetime): InvoiceDate
        
        Returns:
            integer: number of days"""
    return (x.max()-x.min()).days

def generate_rfmts(data):
    """ Generate recency, frequency, monetary, tenure, and sales from data"""
    
    # Create sales feature
    data['Sales'] = data['Quantity']* data['UnitPrice']

    #Grouping by CustomerID
    agg_dict = {
        'InvoiceDate' : [recency, tenure],
        'InvoiceNo' : [frequency],
        'Sales' : 'sum'
    }
    data_rfmts = data.groupby('CustomerID').agg(agg_dict)

    #Rename columns
    data_rfmts.columns = ['Recency', 'Tenure', 'Frequency', 'Monetary']

    data_rfmts.index = data_rfmts.index.astype('int')

    return data_rfmts

rfm_data = generate_rfmts(eda_data)

rfm_data

	Recency	Tenure	Frequency	Monetary
CustomerID
12346	326	0	1	77183.60
12347	3	365	7	4310.00
12348	76	282	4	1797.24
12349	19	0	1	1757.55
12350	311	0	1	334.40
...	...	...	...	...
18280	278	0	1	180.60
18281	181	0	1	80.82
18282	8	118	2	178.05
18283	4	333	16	2045.53
18287	43	158	3	1837.28

4338 rows × 4 columns

Checking RFM data Descriptions

rfm_data.describe()

	Recency	Tenure	Frequency	Monetary
count	4338.000000	4338.000000	4338.000000	4338.000000
mean	93.059474	130.448594	4.272015	2048.688081
std	100.012264	132.039554	7.697998	8985.230220
min	1.000000	0.000000	1.000000	3.750000
25%	18.000000	0.000000	1.000000	306.482500
50%	51.000000	92.500000	2.000000	668.570000
75%	142.750000	251.750000	5.000000	1660.597500
max	374.000000	373.000000	209.000000	280206.020000

Summary - There are no negative numbers in the data. - Frequency and Monetary seems to have an outlier.

Checking Tenure

Tenure = 0 means customer last and first purchase are on the same day.

rfm_data[rfm_data['Tenure'] == 0]

	Recency	Tenure	Frequency	Monetary
CustomerID
12346	325	0	1	77183.60
12349	18	0	1	1757.55
12350	310	0	1	334.40
12353	204	0	1	89.00
12354	232	0	1	1079.40
...	...	...	...	...
18276	43	0	1	335.86
18277	58	0	1	110.38
18278	73	0	1	173.90
18280	277	0	1	180.60
18281	180	0	1	80.82

1555 rows × 4 columns

Checking Frequency

check if tenure = 0 means a one time buyer

rfm_data[(rfm_data['Tenure'] == 0) & (rfm_data['Frequency'] > 1)]

	Recency	Tenure	Frequency	Monetary
CustomerID
12365	292	0	2	641.38
12410	309	0	2	693.33
12424	163	0	2	1760.96
12452	17	0	2	430.57
12512	67	0	2	132.18
...	...	...	...	...
18013	156	0	2	187.29
18024	153	0	2	389.78
18034	22	0	2	418.68
18048	204	0	2	172.29
18139	18	0	6	8438.34

62 rows × 4 columns

not all of them are one time buyer, they do more than 1 transaction on one day.
Our interest is to segment the repurchase customer.
Exclude the one time buyer from the data

rfm_data = rfm_data[rfm_data['Frequency'] > 1]
rfm_data.describe()

	Recency	Tenure	Frequency	Monetary
count	2845.000000	2845.000000	2845.000000	2845.000000
mean	59.386292	198.905448	5.989104	2907.985706
std	70.246203	113.866961	9.044247	10899.538240
min	1.000000	0.000000	2.000000	6.900000
25%	12.000000	98.000000	2.000000	614.660000
50%	31.000000	207.000000	4.000000	1154.470000
75%	76.000000	299.000000	6.000000	2429.830000
max	374.000000	373.000000	209.000000	280206.020000

Plotting the Distribution

for col in rfm_data.columns:
    x = rfm_data[col]
    sns.histplot(x)
    plt.title(col)
    plt.show()

Summary - The Monetary & Frequency data is highly skewed. - Need to drop these outliers.

Dropping Outliers

def drop_outlier(data, cols):
    """"""
    data_clean = data.copy()

    for col in cols:
        q1, q3 = np.quantile(data_clean[col], q=[0.25, 0.75])
        iqr = q3 - q1
        ub = q3 + 1.5*iqr # upper bound
        lb = q1 - 1.5*iqr # lower bound

        #Filter data
        data_clean = data_clean[(data_clean[col]>=lb) & (data_clean[col]<=ub)]

    return data_clean

rfm_data = drop_outlier(rfm_data, cols =['Monetary',
                                        'Frequency'])

print('rfm data shape:', rfm_data.shape)
rfm_data.describe()

rfm data shape: (2395, 4)

	Recency	Tenure	Frequency	Monetary
count	2395.000000	2395.000000	2395.000000	2395.000000
mean	66.670981	177.201670	3.768267	1251.421504
std	72.737947	107.313335	1.903081	980.459476
min	1.000000	0.000000	2.000000	6.900000
25%	16.000000	82.000000	2.000000	547.485000
50%	37.000000	182.000000	3.000000	942.340000
75%	89.000000	266.000000	5.000000	1687.945000
max	374.000000	371.000000	9.000000	5126.760000

for col in rfm_data.columns:
    x = rfm_data[col]
    sns.histplot(x)
    plt.title(col)
    plt.show()

Normalizing Data

from sklearn.preprocessing import MinMaxScaler

def fit_scaler(data):
    """
    Function to fit a min-max scaler

    Parameters
    ----------
    data : pandas DataFrame
        The sample RFM data

    Returns
    -------
    scaler : object
        The Min-Max scaler
    """
    # Create & fit scaler
    scaler = MinMaxScaler()
    scaler.fit(data)

    return scaler

def transform_scaler(data, scaler):
    """
    Function to fit a min-max scaler

    Parameters
    ----------
    data : pandas DataFrame
        The sample RFM data

    scaler : object
        The Min-Max scaler

    Returns
    -------
    data_clean : object
        The clean data
    """
    # Transform scaler
    data_clean = pd.DataFrame(scaler.transform(data))
    data_clean.index = data.index
    data_clean.columns = data.columns

    # Validate
    print("Data shape :", data.shape)

    return data_clean

# Fit a scaler
scaler = fit_scaler(data = rfm_data)

# Transform data
data_rfm_scaled = transform_scaler(data = rfm_data,
                                   scaler = scaler)

data_rfm_scaled.head()

Data shape : (2395, 4)

	Recency	Tenure	Frequency	Monetary
CustomerID
12347	0.005362	0.983827	0.714286	0.840472
12348	0.201072	0.760108	0.285714	0.349685
12352	0.096515	0.700809	0.857143	0.488127
12356	0.058981	0.814016	0.142857	0.547775
12358	0.002681	0.401617	0.000000	0.226795

data_rfm_scaled.describe()

	Recency	Tenure	Frequency	Monetary
count	2395.000000	2395.000000	2395.000000	2395.000000
mean	0.176062	0.477633	0.252610	0.243077
std	0.195008	0.289254	0.271869	0.191501
min	0.000000	0.000000	0.000000	0.000000
25%	0.040214	0.221024	0.000000	0.105586
50%	0.096515	0.490566	0.142857	0.182708
75%	0.235925	0.716981	0.428571	0.328338
max	1.000000	1.000000	1.000000	1.000000

Create Preprocessing Pipeline

6. Clustering

6.1 Ranking Method

rfm_rank = data_rfm_scaled.copy()

Assume the company can only afford 5 marketing strategies, therefore maximum number of cluster would be 5.
We will rank each RFM column the data into 5 (binning).

def binning_rfm(df):
    """
    Function for binning the data for making ranking method

    Parameter
    ---------
    df : pandas dataframe
         dataframe input for binning method

    Return
    ------
    df : pandas dataframe
         output binned dataframe
    """
    # Calculate quantiles for each variable
    quantiles = df.quantile(q=[0.2, 0.4, 0.6, 0.8])

    # Create bins for each variable
    r_bins = [-np.inf] + list(quantiles['Recency']) + [np.inf]
    f_bins = [-np.inf] + list(quantiles['Frequency']) + [np.inf]
    m_bins = [-np.inf] + list(quantiles['Monetary']) + [np.inf]

    # Create labels for the bins
    labels = [1, 2, 3, 4, 5]

    # label for recency (lower recency, higher ranking)
    labels_r = [5, 4, 3, 2, 1]

    # Bin each variable into five groups
    df['R_Group'] = pd.cut(df['Recency'], bins=r_bins, labels=labels_r).astype(int)
    df['F_Group'] = pd.cut(df['Frequency'], bins=f_bins, labels=labels).astype(int)
    df['M_Group'] = pd.cut(df['Monetary'], bins=m_bins, labels=labels).astype(int)

    # RFM score with average
    df['RFM_Score'] = np.round((df['R_Group']+df['F_Group']+df['M_Group'])/3,2)

    return df

rfm_rank = binning_rfm(rfm_rank)

rfm_rank.head()

	Recency	Tenure	Frequency	Monetary	R_Group	F_Group	M_Group	RFM_Score
CustomerID
12347	0.005362	0.983827	0.714286	0.840472	5	5	5	5.00
12348	0.201072	0.760108	0.285714	0.349685	2	3	4	3.00
12352	0.096515	0.700809	0.857143	0.488127	3	5	5	4.33
12356	0.058981	0.814016	0.142857	0.547775	4	2	5	3.67
12358	0.002681	0.401617	0.000000	0.226795	5	1	3	3.00

rfm_rank.dtypes

Recency      float64
Tenure       float64
Frequency    float64
Monetary     float64
R_Group        int64
F_Group        int64
M_Group        int64
RFM_Score    float64
dtype: object