Skip to main content

In the first article explained the process for an E-commerce subscription model brand to developing an analytics plan. The insights would be built around:

  • Cohort Analysis
  • Time Lag Analysis
  • A top level Customer Segmentation

To deliver the analytics plan it became about data collection, data processing and building the code for the insights.

Data Collection

The data collection was the critical which had many challenges. Collecting the right data and the right structured data was critical to be able to deliver the insights.

The data that was needed:

  • Customer ID
  • Order ID
  • Date of Order
  • Revenue
  • Type of Purchase
  • Source & Medium (if possible)

The ideal dataset would look like this:

Customer ID Order ID Date of Order Revenue Type of Purchase
ABC123 74561 01/02/2022 £75.00 Membership
ABC123 75120 10/02/2022 £90.00 Product

There were multiple martech vendors that housed a lot of rich data to help develop the insights.

  • Google Analytics: provides a marketing channel and campaign view.
  • Big Commerce: provides a customer, orders, and product view.
  • Net Suite: source of truth for all orders & inventory.

The Challenges

The key data points were housed within BigCommerce which was seen as the data source to use which had all the data variables required.

Example dataset provided:

customer_id last_order_num last_order_date last_order_amount attribution type
ABC123 74561 01/02/2022 £75.00 Search New
ABC421 75120 10/02/2022 £90.00 Campaign (UTM) Returning
ABC317 77890 11/02/2022 £55.00 Direct New
ABC864 73470 05/02/2022 £110.00 Referrer New
ABC904 75234 22/02/2022 £45.00 Search New
ABC652 77902 17/02/2022 £95.00 Campaign (UTM) Returning

The dataset that was provided covered the data that was needed:

  • customer_id = Customer ID
  • last_order_number = Order ID
  • last_order_date = Date of Order
  • last_order_amount = Revenue
  • type = Type of Purchase
  • attribution = Source & Medium

There was a thorough validation of the data provided which required:

  • Summarising the data provided to understand:
    • New customers by month
    • Revenue by month
    • Splitting the type of purchases by month
  • Cross checking the data provided v Business reports

What was found the data that was provided directly from BigCommerce did not provide the granularity required:

  • It did not provide a breakdown of each order the customer made
  • Not able to breakdown which order related to a membership or product purchase

The insights required was not possible with the structure of the dataset.

The Solution

With the migration from Magento to Big Commerce a script was created for every transactional made on the website collecting a variety of key data that was stored locally. It had collected the data required to deliver the insights.

The data that was provided:

  • Customer ID
  • Order ID
  • Order Date
  • Purchase Type

The data could be merged with Google Analytics and Net Suite.

With Net Suite housing the final orders, when merging the data, it allowed to remove any order that did not progress.

Merging the different datasets to get a single view before building the code to develop the insights.

Customer_ID Order_ID Order_Date Purchase_Type GA_Source-Medium NetSuite

_Revenue

Data Requirements

To build the right insights it required having the right data and through a process of understanding the data and the business the following data was excluded:

  • Customers who received free memberships or discounts on purchases
  • Customers who had no membership purchase but made product purchases (customer could only make a purchase by purchasing a membership)
  • A small subset of customers was not assigned to a membership purchase (did not have a customer_id)

This ensured the insights provided value without additional noise and it’s actionable insights.

Data Management

With no data warehouse in place one of the options was to store data in a PostgreSQL database. The data could easily be managed within excel as it was not hitting the 1m rows limit. Without adding to the process, it made sense to work with using the csv files which would be imported into R.

Building  in R

R was the preferred tool for this project, due to its versatility in statistical analysis, data manipulation, and visualization. Leveraging Tidyverse packages like dplyr and ggplot2, the code demonstrated R’s powerful toolkit, enhancing both readability and efficiency. The expressive syntax and seamless flow of data manipulation tasks, aided by the %>% operator, underscored R’s proficiency. Additionally, R’s strength in graphical representation, exemplified through ggplot2, proved instrumental in creating our customized visualizations. The use of lubridate for time-based analysis showcased R’s simplicity when handling time-related data. Functions like arrange, filter, and mutate facilitated effective data preparation and their readability were also crucial.

R’s suitability in statistical analysis, data manipulation and readability, made it a relatively straightforward selection.

Script Dependencies

Using R allowed to use dplyr data manipulation, cleaning, and analysis tasks. In addition, ggplot2 was used to create charts and lubridate to manage the dates for tables. Also used the options function to set scipen to 999 to disable scientific notation for our plots.

Building the Code

The process of building the code.

Sharing below the key steps and the R code.

Building the base

Firstly, merging the 3 datasets to get the table below to build the base which would then allow to build the necessary code for the Cohort Analysis, Time Lag + Customer Segmentation.

Customer_ID Order_ID Order_Date Purchase_Type GA_Source-Medium NetSuite

_Revenue

Merging the datasets, it would firstly require an inner join of Netsuite and Big Commerce tables to return only unique order_id’s in both tables then join with Google Analytics. Next step was to create a new customer_status variable that indicates what type of purchase a transaction is.

Creating 3 labels:

  • When a customer purchases a membership
  • When a customer makes a product purchase
  • When a customer renews membership
# DATA PROCESSING

# Column renaming
colnames(netsuite_transactions) <- c("netsuite_order_id", "netsuite_revenue")
colnames(google_analytics) <- c("ga_order_id", "ga_source_medium")
colnames(purchases) <- c("purchase_type", "customer_id", "order_id", "order_date")

str(netsuite_transactions)
str(google_analytics)
str(purchases)

# Convert order_date to date object
purchases$order_date <- as.Date(purchases$order_date, format = "%m/%d/%Y") 

# Filter out non-customers
customer_purchases <- purchases %>%
filter(!is.na(customer_id))

# Convert factors
google_analytics$ga_source_medium <- as.factor(google_analytics$ga_source_medium)
customer_purchases$purchase_type <- as.factor(customer_purchases$purchase_type)

# Plots

# Boxplot showing customers by source and medium
google_analytics %>%
group_by(ga_source_medium) %>%
summarize(
customers = n()
) %>%
ggplot(aes(
x = reorder(ga_source_medium, customers),
y = customers)) +
geom_col(fill="#FCB321", col = "grey20") +
labs(
title = "Boxplot of Customers by Source & Medium",
x = "Google Analytics Source & Medium",
y = "# of Customers"
) +
coord_flip() +
theme_minimal()

# Line Chart showing product purchases over time
plot_data <- customer_purchases %>%
filter(purchase_type == "Product Purchase") %>%
group_by(order_date) %>%
summarise(
transactions = n())

avg_purchases <- mean(plot_data$transactions)

ggplot(plot_data, aes(x = order_date, y = transactions)) +
geom_line(col = "#FA7903") +
geom_hline(yintercept = avg_purchases, linetype = "dashed", col = "grey10") + # Add average line
labs(title = "Product Purchases Over Time",
x = "Date",
y = "# of Purchases") +
theme_minimal() +
annotate("text", x = max(plot_data$order_date) + 90, y = avg_purchases,
label = paste("Avg. Purchases/Day:", round(avg_purchases, 2)), size = 2.7,
vjust = -0.5, hjust = 1, color = "grey10")

# Histogram showing Transaction Revenue distribution
netsuite_transactions %>%
filter(netsuite_revenue > 0) %>%
ggplot(aes(x = netsuite_revenue)) +
geom_histogram(bins = 30, col= "grey20", fill = "#CB2332") +
scale_x_log10() +
labs(
title = "Histogram Showing Revenue Distribution",
x = "Revenue (£)",
y = "# of Transactions"
) +
theme_minimal()

# Joining all tables
df <- inner_join(netsuite_transactions,
customer_purchases,
by = c("netsuite_order_id" = "order_id"))

df <- inner_join(df,
google_analytics,
by = c("netsuite_order_id" = "ga_order_id"))

#
df <- df %>%
select(customer_id, netsuite_order_id, order_date,
purchase_type, ga_source_medium, netsuite_revenue)

colnames(df)[2] <- "order_id"

head(df)

# Assign a customer status to each purchase type
df <- df %>%
arrange(customer_id, order_date) %>%
group_by(customer_id) %>%
mutate(customer_status = case_when(
purchase_type == "Membership For You" ~ "New",
purchase_type == "Individual Membership Renewal" ~ "Renewal",
purchase_type == "Product Purchase" & lag(purchase_type) %in%
c("Membership For You", "Individual Membership Renewal") ~ "New",
TRUE ~ "Returning"
)) %>%
ungroup()

head(df)

# Separate years
df_2021 <- df %>% filter(year(order_date) == "2021")
df_2022 <- df %>% filter(year(order_date) == "2022")

# Adding month and year column - 2021
df_2021$month <- month.name[month(df_2021$order_date)]
df_2021$year <- year(df_2021$order_date)

# Adding month and year column: 2022
df_2022$month <- month.name[month(df_2022$order_date)]
df_2022$year <- year(df_2022$order_date)

Overview of monthly key metrics

The insights would be done at a monthly level across so being able to provide a view of key metrics.

  • Total customers
  • Total revenue
  • Membership revenue
  • Number of customers who made a first purchase
  • Total revenue from first purchases
  • Average order value for first purchases
  • Number of customers who made a product purchase
  • Product revenue
  • Total transactions
  • Average order value
# MONTHLY METRICS TABLE

# Define the order of months
month_order <- c(
"January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
)

# Monthly summary 2021
monthly_summary_2021 <- df_2021 %>%
group_by(month) %>%
summarise(
total_customers = n_distinct(customer_id),
total_revenue = sum(netsuite_revenue),
membership_revenue = sum(netsuite_revenue[purchase_type %in% c("Membership For You", "Individual Membership Renewal")]),
product_revenue = sum(netsuite_revenue[purchase_type == "Product Purchase"]),
aov = round(mean(netsuite_revenue[purchase_type == "Product Purchase"]), 2)
) %>%
mutate(month = factor(month, levels = month_order)) %>%
arrange(month)

# Transactions
transactions <- df_2021 %>%
filter(purchase_type == "Product Purchase") %>%
group_by(month) %>%
summarise(total_transactions = n())

# customers who made a first purchase
cust_first_purchases <- df_2021 %>%
filter(purchase_type == "Product Purchase" & customer_status == "New") %>%
group_by(month) %>%
summarise(
first_purchases = n(),
first_purchase_revenue = sum(netsuite_revenue),
aov_first_purchase = mean(netsuite_revenue)
)

# customers who made a purchase
cust_purchases <- df_2021 %>%
filter(purchase_type == "Product Purchase") %>%
group_by(month) %>%
summarise(
customer_purchases = n_distinct(customer_id)
)

# new customers
nouvel_clients <- df_2021 %>%
filter(customer_status == "New") %>%
group_by(month) %>%
summarise(
new_customers = n_distinct(customer_id)
)

# Total customers
tot_cust <- df_2021 %>%
group_by(month) %>%
summarise(
total_customers = n_distinct(customer_id)
)

# Join final table
monthly_summary_2021_metrics <- monthly_summary_2021 %>%
left_join(transactions, by = "month") %>%
left_join(cust_first_purchases, by = "month") %>%
left_join(cust_purchases, by = "month") %>%
left_join(nouvel_clients, by = "month") %>%
left_join(tot_cust, by = "month") %>%
select(
month, total_customers.x, total_revenue, membership_revenue,
first_purchases, first_purchase_revenue, aov_first_purchase,
customer_purchases,new_customers, product_revenue,
total_transactions
)

monthly_summary_2021_metrics

# Monthly summary 2022
monthly_summary_2022 <- df_2022 %>%
group_by(month) %>%
summarise(
total_customers = n_distinct(customer_id),
total_revenue = sum(netsuite_revenue),
membership_revenue = sum(netsuite_revenue[purchase_type %in% c("Membership For You", "Individual Membership Renewal")]),
product_revenue = sum(netsuite_revenue[purchase_type == "Product Purchase"]),
aov = round(mean(netsuite_revenue[purchase_type == "Product Purchase"]), 2)
) %>%
mutate(month = factor(month, levels = month_order)) %>%
arrange(month)

# Transactions
transactions <- df_2022 %>%
filter(purchase_type == "Product Purchase") %>%
group_by(month) %>%
summarise(total_transactions = n())

# customers who made a first purchase
cust_first_purchases <- df_2022 %>%
filter(purchase_type == "Product Purchase" & customer_status == "New") %>%
group_by(month) %>%
summarise(
first_purchases = n(),
first_purchase_revenue = sum(netsuite_revenue),
aov_first_purchase = mean(netsuite_revenue)
)

# customers who made a purchase
cust_purchases <- df_2022 %>%
filter(purchase_type == "Product Purchase") %>%
group_by(month) %>%
summarise(
customer_purchases = n_distinct(customer_id)
)

# new customers
nouvel_clients <- df_2022 %>%
filter(customer_status == "New") %>%
group_by(month) %>%
summarise(
new_customers = n_distinct(customer_id)
)

# Total customers
tot_cust <- df_2022 %>%
group_by(month) %>%
summarise(
total_customers = n_distinct(customer_id)
)

# Join final table
monthly_summary_2022_metrics <- monthly_summary_2022 %>%
left_join(transactions, by = "month") %>%
left_join(cust_first_purchases, by = "month") %>%
left_join(cust_purchases, by = "month") %>%
left_join(nouvel_clients, by = "month") %>%
left_join(tot_cust, by = "month") %>%
select(
month, total_customers.x, total_revenue, membership_revenue,
first_purchases, first_purchase_revenue, aov_first_purchase,
customer_purchases,new_customers, product_revenue,
total_transactions
)

monthly_summary_2022_metrics

Cohort Analysis

A membership lasts 12 months, if a customer signs up to be a member in January 2022 and renews in a 12 month upto period the renewal would be assigned back to January 2022 the month the customer signed up. The focus was on membership purchases to understand when customers acquired their memberships then understand if those memberships made a membership renewal.

Filtered out purchase_type from the data to isolate data related to membership purchases. Then organised the data by month and calculate the number of new customers who purchased memberships each month. These calculations are conducted separately for each month using a loop, resulting in a list of data frames representing new customers by month.

After obtaining the counts of new customers by month, combine the results into a single data frame for each year. Then organize the months factorially and arrange them in chronological order to ensure we have the standard jan-dec format.

Calculate the number of membership renewals for each month of the years. This is done by segmenting the data into different groups based on customer IDs and filtering out specific customer IDs as required. Subsequently, sum up the renewal counts for each month to obtain the total renewals.

With the renewal counts obtained then calculate the retention rate for each month by dividing the number of renewals by the total number of new customers for that month and multiplying the result by 100 to express it as a percentage.

# RETENTION AND CHURN

# Filter out product purchases to have membership purchase data
nouvelle <- df_2021 %>%
filter(!purchase_type == "Product Purchase")

months <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
)

new_customers_by_month <- lapply(months, function(month) {
df_2021 %>%
filter(month == month & purchase_type == "Membership For You") %>%
group_by(month) %>%
summarise(new_customers = n())
})

# Combine the results into a single data frame
new_customers_df <- new_customers_by_month[[1]]

new_customers_df

new_customers_df <- new_customers_df %>%
mutate(month = factor(month,
levels = c("January", "February", "March", "April",
"May", "June", "July", "August",
"September", "October", "November",
"December"))) %>%
arrange(month)

# Renewals

jan_renewals <- nouvelle %>%
filter(customer_id <= 209 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

feb_renewals <- nouvelle %>%
filter(customer_id >= 210 & customer_id <= 291 &
purchase_type == "Individual Membership Renewal" &
!customer_id %in% c(277, 279)) %>% # 279 and 277
group_by(month) %>%
summarise(renewals = n())

march_renewals <- nouvelle %>%
filter(customer_id >= 293 & customer_id <= 437 &
purchase_type == "Individual Membership Renewal" &
!customer_id %in% c(303, 413, 425, 431, 434)) %>% # 303, 413, 425, 431, 434
group_by(month) %>%
summarise(renewals = n())

apr_renewals <- nouvelle %>%
filter(customer_id >= 438 & customer_id <= 609 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

may_renewals <- nouvelle %>%
filter(customer_id >= 610 & customer_id <= 893 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

june_renewals <- nouvelle %>%
filter(customer_id >= 894 & customer_id <= 1187 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

july_renewals <- nouvelle %>%
filter(customer_id >= 1188 & customer_id <= 1372 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

aug_renewals <- nouvelle %>%
filter(customer_id >= 1373 & customer_id <= 1602 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

sep_renewals <- nouvelle %>%
filter(customer_id >= 1603 & customer_id <= 1849 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

oct_renewals <- nouvelle %>%
filter(customer_id >= 1850 & customer_id <= 2096 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

nov_renewals <- nouvelle %>%
filter(customer_id >= 2097 & customer_id <= 2612 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

dec_renewals <- nouvelle %>%
filter(customer_id >= 2613 & customer_id <= 3288 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

# Calculate renewals then assign to months
jan_renewals <- sum(jan_renewals$renewals)
feb_renewals <- sum(feb_renewals$renewals)
march_renewals <- sum(march_renewals$renewals)
apr_renewals <- sum(apr_renewals$renewals)
may_renewals <- sum(may_renewals$renewals)
june_renewals <- sum(june_renewals$renewals)
july_renewals <- sum(july_renewals$renewals)
aug_renewals <- sum(aug_renewals$renewals)
sep_renewals <- sum(sep_renewals$renewals)
oct_renewals <- sum(oct_renewals$renewals)
nov_renewals <- sum(nov_renewals$renewals)
dec_renewals <- sum(dec_renewals$renewals)

renewals <- c(jan_renewals, feb_renewals, march_renewals, apr_renewals,
may_renewals, june_renewals, july_renewals, aug_renewals,
sep_renewals, oct_renewals, nov_renewals, dec_renewals)

new_customers_df$renewals <- renewals

retention_churn_2021 <- new_customers_df %>%
mutate(
retention_rate = round((renewals/new_customers * 100), 2)
)

retention_churn_2021

# 2022
# Filter out product purchases to have membership purchase data

nouvelle <- df_2022 %>%
filter(!purchase_type == "Product Purchase")

months <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"
)

new_customers_by_month <- lapply(months, function(month) {
df_2022 %>%
filter(month == month & purchase_type == "Membership For You") %>%
group_by(month) %>%
summarise(new_customers = n())
})

# Combine the results into a single data frame
new_customers_df <- new_customers_by_month[[1]]

new_customers_df

new_customers_df <- new_customers_df %>%
mutate(month = factor(month,
levels = c("January", "February", "March", "April",
"May", "June", "July", "August",
"September", "October", "November",
"December"))) %>%
arrange(month)

# Renewals

jan_renewals <- nouvelle %>%
filter(customer_id <= 3565 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

feb_renewals <- nouvelle %>%
filter(customer_id >= 3566 & customer_id <= 3786 &
purchase_type == "Individual Membership Renewal" &
!customer_id %in% c(277, 279)) %>% # 279 and 277
group_by(month) %>%
summarise(renewals = n())

march_renewals <- nouvelle %>%
filter(customer_id >= 3787 & customer_id <= 4059 &
purchase_type == "Individual Membership Renewal" &
!customer_id %in% c(303, 413, 425, 431, 434)) %>% # 303, 413, 425, 431, 434
group_by(month) %>%
summarise(renewals = n())

apr_renewals <- nouvelle %>%
filter(customer_id >= 4060 & customer_id <= 4251 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

may_renewals <- nouvelle %>%
filter(customer_id >= 4252 & customer_id <= 4420 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

june_renewals <- nouvelle %>%
filter(customer_id >= 4421 & customer_id <= 4584 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

july_renewals <- nouvelle %>%
filter(customer_id >= 4585 & customer_id <= 4724 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

aug_renewals <- nouvelle %>%
filter(customer_id >= 4725 & customer_id <= 4878 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

sep_renewals <- nouvelle %>%
filter(customer_id >= 4879 & customer_id <= 5010 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

oct_renewals <- nouvelle %>%
filter(customer_id >= 5011 & customer_id <= 5155 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

nov_renewals <- nouvelle %>%
filter(customer_id >= 5156 & customer_id <= 5586 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

dec_renewals <- nouvelle %>%
filter(customer_id >= 5587 & customer_id <= 6051 &
purchase_type == "Individual Membership Renewal") %>%
group_by(month) %>%
summarise(renewals = n())

Calculate renewals then assign to months
jan_renewals <- sum(jan_renewals$renewals)
feb_renewals <- sum(feb_renewals$renewals)
march_renewals <- sum(march_renewals$renewals)
apr_renewals <- sum(apr_renewals$renewals)
may_renewals <- sum(may_renewals$renewals)
june_renewals <- sum(june_renewals$renewals)
july_renewals <- sum(july_renewals$renewals)
aug_renewals <- sum(aug_renewals$renewals)
sep_renewals <- sum(sep_renewals$renewals)
oct_renewals <- sum(oct_renewals$renewals)
nov_renewals <- sum(nov_renewals$renewals)
dec_renewals <- sum(dec_renewals$renewals)

renewals <- c(jan_renewals, feb_renewals, march_renewals, apr_renewals,
may_renewals, june_renewals, july_renewals, aug_renewals,
sep_renewals, oct_renewals, nov_renewals, dec_renewals)

new_customers_df$renewals <- renewals

retention_churn_2022 <- new_customers_df %>%
mutate(
retention_rate = round((renewals/new_customers * 100), 2)
)

retention_churn_2022

Customer Segmentation

Product purchasing behaviour would be segmented into the following buckets:

No of Items Customer Segment
7+ Cannot lose them
6 Champions
4-5 Loyal Customers
2-3 Needs Attention
1 At Risk
0 (signed up member but not purchased) Hibernating

In addition, created a category called ‘Other’ for any cases falling out of the criteria set.

Filtered for data frames to include entries where purchase_type = ‘Product Purchase’.

  • Group the data be customer_id and calculate count of purchases made by each customer and total revenue which aggregates the revenue generated by each purchase
  • Assign each customer to a segment based on the number of transactions using a case_when statement. It was grouped as per the table above
# SEGMENTATION

# Script dependencies
library(dplyr)
library(ggplot2)
library(lubridate)

options(scipen = 999)

# Import datasets
purchases <- read.csv("data/All Data Tables/Customer ID and Order ID - Final_alt.csv")
netsuite_transactions <- read.csv("data/All Data Tables/NetSuite_Transactions.csv")
google_analytics <- read.csv("data/All Data Tables/GoogleAnalytics_Transactions.csv")

# Data Preprocessing
head(netsuite_transactions)
head(google_analytics)
head(purchases)

#purchases$order_date <- as.Date(purchases$order_date, "%d/%m/%y")
purchases$order_date <- as.Date(purchases$order_date, format = "%m/%d/%Y")

# Column renaming
colnames(netsuite_transactions) <- c("netsuite_order_id", "netsuite_revenue")
colnames(google_analytics) <- c("ga_order_id", "ga_source_medium")
colnames(purchases) <- c("purchase_type", "customer_id", "order_id", "order_date")

customer_purchases <- purchases %>%
filter(!is.na(customer_id))

# Joining tables
df <- inner_join(netsuite_transactions,
customer_purchases,
by = c("netsuite_order_id" = "order_id"))

df <- inner_join(df,
google_analytics,
by = c("netsuite_order_id" = "ga_order_id"))

#
df <- df %>%
select(customer_id, netsuite_order_id, order_date,
purchase_type, ga_source_medium, netsuite_revenue)

colnames(df)[2] <- "order_id"

################################################################################

# new returning

df <- df %>%
arrange(customer_id, order_date) %>%
group_by(customer_id) %>%
mutate(customer_status = case_when(
purchase_type == "Membership For You" ~ "New",
purchase_type == "Individual Membership Renewal" ~ "Renewal",
purchase_type == "Product Purchase" & lag(purchase_type) == "Membership For You" ~ "New",
TRUE ~ "Returning"
)) %>%
ungroup()

# Separate years
df_2021 <- df %>% filter(year(order_date) == "2021")
df_2022 <- df %>% filter(year(order_date) == "2022")

# Adding month and year column - 2021
df_2021$month <- month.name[month(df_2021$order_date)]
df_2021$year <- year(df_2021$order_date)

# Adding month and year column: 2022
df_2022$month <- month.name[month(df_2022$order_date)]
df_2022$year <- year(df_2022$order_date)

# SEGMENTS: 2021
segments <- df_2021 %>%
filter(purchase_type == "Product Purchase") %>%
group_by(customer_id, order_date) %>%
summarise(
number_bottles = n(),
total_revenue = sum(netsuite_revenue)
)

segments <- segments %>%
mutate(
segment = case_when(
number_bottles == 0 ~ "Hibernating",
number_bottles == 1 ~ "At Risk",
number_bottles >= 2 & number_bottles <= 3 ~ "Needs Attention",
number_bottles >= 4 & number_bottles <= 5 ~ "Loyal Customers",
number_bottles == 6 ~ "Champions",
number_bottles >= 7 ~ "Cannot Lose Them",
TRUE ~ "Other" # Add a catch-all condition for cases not covered
)
)

segment_summary_2021 <- segments %>%
group_by(segment) %>%
summarise(
total_customers = n(),
total_revenue = sum(total_revenue) # Calculate total revenue for each segment
) %>%
mutate(
percentage_of_revenue = round(total_revenue / sum(total_revenue) * 100, 2),
percentage_of_customers = round(total_customers / sum(total_customers) * 100, 2)
)

segment_summary_2021

# SEGMENTS: 2022
segments <- df_2022 %>%
filter(purchase_type == "Product Purchase") %>%
group_by(customer_id) %>%
summarise(
number_bottles = n(),
total_revenue = sum(netsuite_revenue)
)

segments <- segments %>%
mutate(
segment = case_when(
number_bottles == 0 ~ "Hibernating",
number_bottles == 1 ~ "At Risk",
number_bottles >= 2 & number_bottles <= 3 ~ "Needs Attention",
number_bottles >= 4 & number_bottles <= 5 ~ "Loyal Customers",
number_bottles == 6 ~ "Champions",
number_bottles >= 7 ~ "Cannot Lose Them",
TRUE ~ "Other" # Add a catch-all condition for cases not covered
)
)

segment_summary_2022 <- segments %>%
group_by(segment) %>%
summarise(
total_customers = n(),
total_revenue = sum(total_revenue) # Calculate total revenue for each segment
) %>%
mutate(
percentage_of_revenue = round(total_revenue / sum(total_revenue) * 100, 2),
percentage_of_customers = round(total_customers / sum(total_customers) * 100, 2)
)

segment_summary_2022

Time Lag

To understand customer behaviour over time looked at time lag for first and second purchases, the days taken to make a purchase + total revenue + average order value grouped by each day.

Tracking the first purchase – from membership to first purchase

Filtering the dataset include only entries where the purchase_type is either “Membership For You” or “Product Purchase”, as we want to calculate time lag between when customers first bought a membership and when they made their first product purchase.

Then group the filtered data by customer_id, ensuring that the subsequent calculations are performed for each unique customer. The data is arranged by customer_id and order_date to guarantee chronological order within each customer’s membership purchase and subsequent purchases. Subsequently, for each customer, we calculate the time difference in days between their first product purchase and their first membership purchase. This time difference, stored as days_to_first_product, provides insight into how quickly or slowly customers make product purchases after the initial membership purchase.

Also had to take into account problematic customer_id values which made product purchases without ever purchasing membership by filtering out any entries where days to first purchase was not a finite value. After this filtering step grouped the remaining data by days_to_first_product and count the number of customers by bucketing them into each distinct time lag.

Tracking the second purchase – from first purchase to second purchase

Tracking days to second purchase after first purchase, took a different route. Focusing on analysing time taken by customers after making a first purchase to make their second purchase, examining this phenomenon across the different years. The analysis begins by sorting the data by customer_id and order_date to ensure proper sequencing of customer transactions.

Next filtered the data to include only entries where the purchase type is “Product Purchase”, indicating subsequent purchases made by customers. Then group the data by customer_id and select the second purchase for each customer using the row_number() function. For each customer calculated the time difference between their first and second purchases, denoted as time_to_second_purchase. Additionally computed metrics such as the average revenue per purchase and the total revenue generated from these second purchases.

Subsequently aggregated the results for each year, grouping the data by time_to_second_purchase. Then calculated the total number of customers who made their second purchase within each time interval, along with the average revenue and total revenue associated with these purchases.

# Time Lag

#---------------- TIME LAG FROM MEMBERSHIP PURCHASE TO FRIST PURCHASE----------#
days <- df_2021 %>%
filter(purchase_type %in% c("Membership For You", "Product Purchase")) %>%
group_by(customer_id) %>%
arrange(customer_id, order_date) %>%
summarise(
days_to_first_product = as.numeric(
difftime(
min(order_date[purchase_type == "Product Purchase"]),
min(order_date[purchase_type == "Membership For You"]),
units = "days"
)
)
)

filtered_days <- days[is.finite(days$days_to_first_product), ]

days_taken_2021 <- filtered_days %>%
group_by(days_to_first_product) %>%
summarise(count = n())

days_taken_2021 <- days_taken_2021[-1,]
days_taken_2021

# Time Lag: 2022
new_cust <- df_2022 %>%
filter(customer_status == "New") %>%
group_by(month) %>%
summarise(
new_customers = n()
)

# Days taken
# Problematic Id's
days <- df_2022 %>%
filter(purchase_type %in% c("Membership For You", "Product Purchase")) %>%
group_by(customer_id) %>%
arrange(customer_id, order_date) %>%
summarise(
days_to_first_product = as.numeric(
difftime(
min(order_date[purchase_type == "Product Purchase"]),
min(order_date[purchase_type == "Membership For You"]),
units = "days"
)
)
)

filtered_days <- days[is.finite(days$days_to_first_product), ]

days_taken_2022 <- filtered_days %>%
group_by(days_to_first_product) %>%
summarise(count = n()) %>%
arrange(days_to_first_product) %>%
slice(5:n())

days_taken_2022

#----------------- TIME LAG TO SECOND PURCHASE --------------------------------#

# Days taken to second purchase
# Sort the data frame by customer_id and order_date

days_2021 <- df_2021 %>%
filter(purchase_type == "Product Purchase") %>%
arrange(customer_id, order_date) %>%
group_by(customer_id) %>%
filter(row_number() == 2) %>%
summarise(time_to_second_purchase = min(order_date) - min(df_2021$order_date),
avg_revenue = mean(netsuite_revenue[purchase_type == "Product Purchase"]),
total_revenue = sum(netsuite_revenue[purchase_type == "Product Purchase"]))

days_cust <- days_2021 %>%
group_by(time_to_second_purchase) %>%
summarise(
total_customers = n(),
avg_revenue = mean(avg_revenue),
total_revenue = sum(total_revenue)
)

# Sort the data frame by customer_id and order_date
days_2022 <- df_2022 %>%
filter(purchase_type == "Product Purchase") %>%
arrange(customer_id, order_date) %>%
group_by(customer_id) %>%
filter(row_number() == 2) %>%
summarise(time_to_second_purchase = min(order_date) - min(df_2021$order_date),
avg_revenue = mean(netsuite_revenue[purchase_type == "Product Purchase"]),
total_revenue = sum(netsuite_revenue[purchase_type == "Product Purchase"]))

days_cust <- days_2022 %>%
group_by(time_to_second_purchase) %>%
summarise(
total_customers = n(),
avg_revenue = mean(avg_revenue),
total_revenue = sum(total_revenue)
)

Worked on this project with Jeffrey Ohene who built the code in R