The integration of Google Analytics 4 and Big Query to understand customer segmentation with RFM

Working with an E-commerce selling supplements that launched in 2021 the growth in the category driven by e-commerce has accelerated post 2020. There have been multiple phases for the business where in 2021 it was reacting to the situation and going live ASAP. With strong performance by end 2022 launched a new Shopify e-commerce website and potentially in 2025 trailing out subscriptions.

Launching the new e-commerce website for Jan 2023, it meant investing in a better infrastructure around analytics and one of the areas of focus was getting more out of Google Analytics 4. They had a good understanding of their core metrics but do not have an advanced set-up to get more sophisticated with analytics. With Google Analytics 4 and the integration with Big Query it provided the perfect opportunity to generate rich insights.

The insights they were missing were a deeper understanding of customer purchase behaviour and customer segmentation. Google Analytics 4 provided insights on purchase behaviour that was focused on the purchase journey from visiting the website to making a purchase that provided good insights, but they wanted more.

(Image from Google Merch Shop GA4 demo account)

Some of the key questions they wanted to be able to answer:

Who are my most loyal customers.
Which of my customers are most likely to churn.
Who are the customers that cannot be lost.
Which customers are likely to be loyal.

What the E-commerce business required was RFM to understand customer segmentation.

This is where Google Analytics 4 event model and the integration with Big Query was invaluable being able to understand customer purchase behavior. The event_name = purchase (this is triggered by a tag built in Google Tag Manager) is what the code is built around. The core components of building the query:

user_pesudo_id – is generated when the user first visits the website. The user_pseudo_id will be recognized when the user visits the site again from the same device and browser.
event_date – the date on which the event was logged.
transaction_id – the unique order ID of the customer’s purchase
purchase_revenue – the revenue generated for that purchase associated to the ecommerce.transaction_id

The ID variable becomes the key data that RFM is built around.

Building the Query

To understand customer segmentation, it’s built around RFM:

Recency: how frequently a customer made a purchase
Frequency: how often a customer made a purchase
Monetary: how much has a customer spent

To be able to understand RFM, constructed a SQL query that employs a Common Table Expression (CTE) that provides a modular approach but also readability and efficiency.

In total there were 4 CTE’s used. Could have constructed less but in terms of workflow to be able to understand the code it became about efficiency.

First CTE

In the first CTE, for events that were purchases:

Queried all user ids.
Summed up the total revenue from all purchases.
Calculated the average revenue of all purchases and the total individual transactions made by each customer.

Using Big Query’s DATE_DIFF function to determine the difference between the latest order date of all customers and then calculate the latest order date of each customer in reference to the overall latest order date. For example, if the last order in our data was 10th March, each customer’s recency will be calculated relative to this date, so if a customer’s last order was made on 7th March, their recency value will be 3 days and if a customer’s last order was placed on 8th March, their recency value would be 2 days and so on and so forth.

WITH rfm_data AS (
SELECT
user_pseudo_id AS user_id,
SUM(ecommerce.purchase_revenue) AS total_revenue,
ROUND(AVG(ecommerce.purchase_revenue), 2) AS average_revenue,
COUNT(DISTINCT ecommerce.transaction_id) AS total_transactions,
MAX(DATETIME(TIMESTAMP_MICROS(event_timestamp), 'UTC')) AS most_recent_purchase,
(
SELECT
MAX(DATETIME(TIMESTAMP_MICROS(event_timestamp), 'UTC'))
FROM `enter.tablename_123456.events_*`
) AS max_order_date,
DATE_DIFF(
(
SELECT
MAX(DATETIME(TIMESTAMP_MICROS(event_timestamp), 'UTC'))
FROM `enter.tablename_123456.events_*`
),
MAX(DATETIME(TIMESTAMP_MICROS(event_timestamp), 'UTC')), day
) AS recency
FROM
`enter.tablename_123456.events_*`
WHERE
_TABLE_SUFFIX >= FORMAT_DATE('%Y%m', DATE_SUB(CURRENT_DATE(), INTERVAL 3 MONTH))
AND
event_name = 'purchase'
GROUP BY
1
)

Second CTE

In the second CTE, the customer segmentation is executed based on the key RFM metrics based on a person recency, frequency, and monetary. It utilizes the NTILE function divides customers into five equally sized groups, ranking them from 1 (lowest) to 5 (highest) based on their relative performance in that metric. In theory, a perfect customer will score 5 across all metrics, indicating frequent purchases, at consistent intervals, with significant revenue generated from each transaction, whereas a terrible customer will score 1 for all the metrics indicating infrequent purchases at inconsistent intervals with very little revenue generated over these purchases.

The NTILE function is applied for the 3 RFM metrics and based on the time range specified in the first CTE, customers will be segmented into segments labelled 1-5 based on their spending habits.

While using the NTILE function, also included the user_id in the window function OVER, but why? Bucketing customers into segments without explicitly supplying the user_id leads to duplicates. A user_id could appear twice or three times based on the individual purchases made during the time-period being analyzed. By including user_id, it ties all transactions to a user and calculates the RFM metrics independently based on their aggregated habits not individual habits, which would have led to duplicates.

rfm_segments AS (
SELECT
user_id,
total_revenue,
recency,
frequency,
average_revenue as monetary,
NTILE(5) OVER (ORDER BY recency DESC, user_id) AS rfm_recency,
NTILE(5) OVER (ORDER BY frequency, user_id) AS rfm_frequency,
NTILE(5) OVER (ORDER BY average_revenue, user_id) AS rfm_monetary,
FROM rfm_data
)

Third CTE

The third CTE was splitting our data into quantiles produced any score from 1 to 5 for each customer for each of the rfm_recency, rfm_frequency and rfm_monetary columns, needed to combine the score for each column into one column so we can collectively label the segments. Using CONCAT function from SQL to combine these scores together, firstly since the quantiles were integers as a result of the previous subquery, we explicitly casted or converted them into strings before concatenating or pasting them together with other rfm metrics.

These are not the only labels we will assign to the next CTE but just for reference. Customers in the first quantile of any score are not good customers in general but those in the 5th or around thereabouts for any metric are good customers, we will assign a lot of combinations and the criteria for assigning customers to a category.

rfm_strings AS (
SELECT
*,
CONCAT(
CAST(rfm_recency AS STRING),
CAST(rfm_frequency AS STRING),
CAST(rfm_monetary AS STRING)
) AS rfm_string
FROM rfm_segments
)

Fourth CTE

In the fourth and final CTE it’s all about assigning labels to the rfm scores which was done by concatenating the rfm metrics calculating using quantiles.

The customer segments were based on the following RFM score logic:

rfm_mapping AS (
SELECT
*,
CASE
WHEN
rfm_string IN ('555', '554', '544', '545', '454', '455', '445')
THEN 'champions'
WHEN
rfm_string IN ('543', '444', '435', '355', '354', '345', '344', '335')
THEN 'loyal customers'
WHEN
rfm_string IN ('553', '551', '552', '541', '542', '533', '532', '531', '452', '451', '442', '441', '431', '453', '433', '432', '423', '353', '352', '351', '342', '341', '333', '323')
THEN 'potential loyalist'
WHEN
rfm_string IN ('512', '511', '422', '421', '412', '411', '311')
THEN 'recent customers'
WHEN
rfm_string IN ('525', '524', '523', '522', '521', '515', '514', '513', '425', '424', '413', '414', '415', '315', '314', '313')
THEN 'promising'
WHEN
rfm_string IN ('535', '534', '443', '434', '343', '334', '325', '324')
THEN 'customers needing attention'
WHEN
rfm_string IN ('331', '321', '312', '221', '213')
THEN 'about to sleep'
WHEN
rfm_string IN ('255', '254', '245', '244', '253', '252', '243', '242', '235', '234', '225', '224', '153', '152', '145', '143', '142', '135', '134', '133', '125', '124')
THEN 'at risk'
WHEN
rfm_string IN ('155', '154', '144', '214', '215', '115', '114', '113')
THEN 'cannot lose them'
WHEN
rfm_string IN ('332', '322', '231', '241', '251', '233', '232', '223', '222', '132', '123', '122', '212', '211')
THEN 'hibernating'
WHEN
rfm_string IN ('111', '112', '121', '131', '141', '151')
THEN 'lost'
ELSE 'NA'
END AS rfm_segment
FROM rfm_strings
)

Customer Segment Label	RFM Score
champions	‘555’, ‘554’, ‘544’, ‘545’, ‘454’, ‘455’, ‘445’
loyal customers	‘543’, ‘444’, ‘435’, ‘355’, ‘354’, ‘345’, ‘344’, ‘335’
potential loyalists	‘553’, ‘551’, ‘552’, ‘541’, ‘542’, ‘533’, ‘532’, ‘531’, ‘452’, ‘451’, ‘442’, ‘441’, ‘431’, ‘453’, ‘433’, ‘432’, ‘423’, ‘353’, ‘352’, ‘351’, ‘342’, ‘341’, ‘333’, ‘323’
recent customers	‘512’, ‘511’, ‘422’, ‘421’, ‘412’, ‘411’, ‘311’
promising	‘525’, ‘524’, ‘523’, ‘522’, ‘521’, ‘515’, ‘514’, ‘513’, ‘425’, ‘424’, ‘413’, ‘414’, ‘415’, ‘315’, ‘314’, ‘313’
customers needing attention	‘535’, ‘534’, ‘443’, ‘434’, ‘343’, ‘334’, ‘325’, ‘324’
about to sleep	‘331’, ‘321’, ‘312’, ‘221’, ‘213’
at risk	‘255’, ‘254’, ‘245’, ‘244’, ‘253’, ‘252’, ‘243’, ‘242’, ‘235’, ‘234’, ‘225’, ‘224’, ‘153’, ‘152’, ‘145’, ‘143’, ‘142’, ‘135’, ‘134’, ‘133’, ‘125’, ‘124’
cannot lose them	‘155’, ‘154’, ‘144’, ‘214’, ‘215’, ‘115’, ‘114’, ‘113’
hibernating	‘332’, ‘322’, ‘231’, ‘241’, ‘251’, ‘233’, ‘232’, ‘223’, ‘222’, ‘132’, ‘123’, ‘122’, ‘212’, ‘211’
lost	‘111’, ‘112’, ‘121’, ‘131’, ‘141’, ‘151’

Final Code

The final code shows:

Customers the total revenue they generated.
Customers number of transactions.
How many days has it taken since their last purchase.
The frequency which is their total transactions.
Their monetary value which is the average revenue they generated across all transactions,
The RFM score which is a combination of their recency, frequency and monetary scores and the segment to which they belong to based on our look up table.

WITH rfm_data AS (
SELECT
user_pseudo_id AS user_id,
SUM(ecommerce.purchase_revenue) AS total_revenue,
ROUND(AVG(ecommerce.purchase_revenue), 2) AS average_revenue,
COUNT(DISTINCT ecommerce.transaction_id) AS frequency,
MAX(DATETIME(TIMESTAMP_MICROS(event_timestamp), 'UTC')) AS most_recent_purchase,
(
SELECT 
MAX(DATETIME(TIMESTAMP_MICROS(event_timestamp), 'UTC'))
FROM `enter.tablename_123456.events_*`
) AS max_order_date,
DATE_DIFF(
(
SELECT 
MAX(DATETIME(TIMESTAMP_MICROS(event_timestamp), 'UTC'))
FROM `enter.tablename_123456.events_*`
),
MAX(DATETIME(TIMESTAMP_MICROS(event_timestamp), 'UTC')), day
) AS recency
FROM 
`enter.tablename_123456.events_*`
WHERE 
_TABLE_SUFFIX >= FORMAT_DATE('%Y%m', DATE_SUB(CURRENT_DATE(), INTERVAL 24 MONTH)) 
AND 
event_name = 'purchase'
GROUP BY 
1
),

rfm_segments AS (
SELECT
user_id,
total_revenue,
recency,
frequency,
average_revenue as monetary,
NTILE(5) OVER (ORDER BY recency DESC, user_id) AS rfm_recency,
NTILE(5) OVER (ORDER BY frequency, user_id) AS rfm_frequency,
NTILE(5) OVER (ORDER BY average_revenue, user_id) AS rfm_monetary,
FROM rfm_data
),

rfm_strings AS (
SELECT
*,
CONCAT(
CAST(rfm_recency AS STRING),
CAST(rfm_frequency AS STRING),
CAST(rfm_monetary AS STRING)
) AS rfm_string
FROM rfm_segments
),

rfm_mapping AS (
SELECT
*,
CASE
WHEN 
rfm_string IN ('555', '554', '544', '545', '454', '455', '445') 
THEN 'champions'
WHEN 
rfm_string IN ('543', '444', '435', '355', '354', '345', '344', '335') 
THEN 'loyal customers'
WHEN 
rfm_string IN ('553', '551', '552', '541', '542', '533', '532', '531', '452', '451', '442', '441', '431', '453', '433', '432', '423', '353', '352', '351', '342', '341', '333', '323') 
THEN 'potential loyalist'
WHEN
rfm_string IN ('512', '511', '422', '421', '412', '411', '311') 
THEN 'recent customers'
WHEN
rfm_string IN ('525', '524', '523', '522', '521', '515', '514', '513', '425', '424', '413', '414', '415', '315', '314', '313')
THEN 'promising'
WHEN
rfm_string IN ('535', '534', '443', '434', '343', '334', '325', '324') 
THEN 'customers needing attention'
WHEN
rfm_string IN ('331', '321', '312', '221', '213') 
THEN 'about to sleep'
WHEN
rfm_string IN ('255', '254', '245', '244', '253', '252', '243', '242', '235', '234', '225', '224', '153', '152', '145', '143', '142', '135', '134', '133', '125', '124')
THEN 'at risk'
WHEN
rfm_string IN ('155', '154', '144', '214', '215', '115', '114', '113')
THEN 'cannot lose them'
WHEN
rfm_string IN ('332', '322', '231', '241', '251', '233', '232', '223', '222', '132', '123', '122', '212', '211') 
THEN 'hibernating'
WHEN 
rfm_string IN ('111', '112', '121', '131', '141', '151')
THEN 'lost'
ELSE 'NA'
END AS rfm_segment
FROM rfm_strings
)

SELECT 
user_id AS customer_id,
total_revenue,
frequency AS total_transactions,
recency,
monetary,
rfm_recency,
rfm_frequency,
rfm_monetary,
rfm_string AS rfm_score,
rfm_segment AS segment
FROM rfm_mapping
ORDER BY rfm_score DESC;

Alternative Approach

An alternative approach was to use subqueries. In this example below, nested subqueries to replicate the CTE approach:

The innermost subquery calculated the base metrics as per the first CTE and then calculated the recency in the second inner subquery and the customer segmentation in the third inner query. The results and segments produced were labeled in the outermost query for each customer segment based on their recency, frequency, and monetary scores.

The alternative approach allowed us to replicate almost everything, but it did not have the readability and performance perks of what CTE’s can provide. This was heavily optimized to reduce processing time even further and lost the readability and modularity that was possible with CTE’s. With CTE’s we could control a lot of moving parts explicitly but with subqueries due to performance pitfalls, it was better to perform everything and the concatenation before labelling the segments. It’s far easier to debug if there is an issue because of the way CTE’s are divided into the CTE’s so it is easier to spot and fix a bug in the CTE’s but it is more difficult to do that with the queries.

SELECT
user_id,
rfm_string,
CASE
WHEN rfm_string IN ('555', '554', '544', '545', '454', '455', '445') THEN 'champions'
WHEN rfm_string IN ('543', '444', '435', '355', '354', '345', '344', '335') THEN 'loyal customers'
WHEN rfm_string IN ('553', '551', '552', '541', '542', '533', '532', '531', '452', '451', '442', '441', '431', '453', '433', '432', '423', '353', '352', '351', '342', '341', '333', '323') THEN 'potential loyalist'
WHEN rfm_string IN ('512', '511', '422', '421', '412', '411', '311') THEN 'recent customers'
WHEN rfm_string IN ('525', '524', '523', '522', '521', '515', '514', '513', '425', '424', '413', '414', '415', '315', '314', '313') THEN 'promising'
WHEN rfm_string IN ('535', '534', '443', '434', '343', '334', '325', '324') THEN 'customers needing attention'
WHEN rfm_string IN ('331', '321', '312', '221', '213') THEN 'about to sleep'
WHEN rfm_string IN ('255', '254', '245', '244', '253', '252', '243', '242', '235', '234', '225', '224', '153', '152', '145', '143', '142', '135', '134', '133', '125', '124') THEN 'at risk'
WHEN rfm_string IN ('155', '154', '144', '214', '215', '115', '114', '113') THEN 'cannot lose them'
WHEN rfm_string IN ('332', '322', '231', '241', '251', '233', '232', '223', '222', '132', '123', '122', '212', '211') THEN 'hibernating'
WHEN rfm_string IN ('111', '112', '121', '131', '141', '151') THEN 'lost'
ELSE 'NA'
END AS rfm_segment
FROM (
SELECT
user_id,
CONCAT(CAST(NTILE(5) OVER (ORDER BY recency) AS STRING),
CAST(NTILE(5) OVER (ORDER BY total_transactions) AS STRING),
CAST(NTILE(5) OVER (ORDER BY average_revenue) AS STRING)) AS rfm_string
FROM (
SELECT
user_id,
total_revenue,
average_revenue,
total_transactions,
most_recent_purchase,
DATE_DIFF(MAX(most_recent_purchase) OVER(), most_recent_purchase, DAY) AS recency
FROM (
SELECT
user_pseudo_id AS user_id,
SUM(ecommerce.purchase_revenue) AS total_revenue,
ROUND(AVG(ecommerce.purchase_revenue), 2) AS average_revenue,
COUNT(DISTINCT ecommerce.transaction_id) AS total_transactions,
MAX(PARSE_DATE('%Y%m%d', event_date)) AS most_recent_purchase
FROM `enter.tablename_123456.events_*`
GROUP BY user_pseudo_id
)
)
);

Modifying the query for the business model

The query is built are for an e-commerce business model, but any business would want to better understand customer segmentation through the lens of recency, frequency and monetary (RFM). The context of the insights will change depending on the business model. (link to GitHub to download SQL code)

For publishers where the goal may be to get users to access the app and read content, RFM may be focused on an event related to content engagement so within the query the purchase event would need to be replaced with the content_engaged
- It will provide insights; the recency and the frequency of the app is being used which will relate to a monetary value of the content engaged with.
For a specific business model, it may require the user to log-in the user_pseudo_id can be replaced with user_id.

Visualization

Having a rich data set is the first step, then it’s all about visualizations to provide insights to those burning questions. There are many ways to build out visualization using Big Query, there is a connector to Looker Studio (Data Studio) or Python or Tableau all depending on the ecosystem in place.

(Data in the charts has been modified for this article)

This chart looks at the % revenue generated by the different segments.

This chart looks at the % of customers in the different segments.

Looking ahead

To provide richer insights and bring the RFM to life:

Understand the value of the discount codes used?
What is the gross profit per customer?
Using the RFM insights to tailor marketing messages in different ways from a user who is using discounts heavily v highly valued customer.

Being able to understand customer segmentation is part of the data puzzle for brands. This is only possible with good quality long data, a sophisticated Google Tag Manager + Google Analytics 4 set-up and a well thought out event tracking in place and connecting to Big Query. Most importantly is a solid strategy in place with the right business questions being asked.

Worked on this project with Jeffrey Ohene who built the SQL code in Big Query.

The integration of Google Analytics 4 and Big Query to understand customer segmentation with RFM

You May Also Like

Building A Strategic Maturity Model

A Pizza Data Story from survival to revival and growth

Re-Building the Google Martech foundations