Working with an E-commerce bakery who had gone from having a bakery store for 15 years with consumer behaviours changing they launched an e-commerce website in 2021. In 2023 re-positioned as an E-commerce only bakery, the bakery was not available to consumers to dine in, the bakery became a kitchen hub for all the baking. Any leftovers at the end of the day were sold to local consumers or were distributed to shelters.
The bakery launched a new e-commerce website with a subscription-based model; they needed assistance with better understanding their analytics and making sure they are getting the best out of Google Analytics 4. They got plans for rapid growth, being data savvy, took advantage of integrating with Big Query to future proof their data. Previously they had a got a lot of good insights from GA3 but also found it bit of a bottleneck as they were unable to cut and slice the data in different ways to en-rich the insights.
With the subscription model which had 3 different packages one of the key questions they wanted to answer was the split between new v returning customers. GA3 did not provide a great view on new v returning.
(Image from Google Merch Shop GA3 demo account)
Some of the key questions the bakery wanted to answer:
- What was the split between new v returning customers.
- What is the impact of campaigns and seasonality on driving new v returning customers.
- What channels are driving new v returning customers.
- What is the contribution of revenue between new v returning.
- Also understand the frequency of purchases.
This is where Google Analytics 4 event model and the integration with Big Query was invaluable being able to answer the split of new v returning customers but also frequency of purchase. The event_name = purchase (this is triggered by a tag built in Google Tag Manager) is what the code is built around. The core components of building the query:
- user_pesudo_id – is generated when the user first visits the website. The user_pseudo_id will be recognized when the user visits the site again from the same device and browser.
- event_date – the date on which the event was logged.
- order_id – the unique order ID of the customer’s purchase
- purchase_revenue – the revenue generated for that purchase associated to the order_id
This would generate an output with the column called customer_type which is the key column in calculating new or returning or frequency of purchase.
To enrich the data to provide greater context also added:
- source – name of the traffic source i.e., facebook
- medium – name of the medium i.e., paidsocial
- name – name of the campaign i.e., summer2024
The caveat being that a new customer may not necessarily be their first purchase with the brand.
Building the Query
To be able to understand new v returning and frequency of purchase we have constructed a SQL query that employs a Common Table Expression (CTE) named “first” to extract pertinent data from the database. Our focus is solely on events labeled as ‘purchase,’ enabling us to zero in on purchase-related activities.
Query 1 – New v Returning
We have identified and categorized customers into two primary segments: “new customers” and “returning customers.”
- Our main query processes the results from the “first” CTE.
- Select all columns from the CTE except time_stamp.
- Utilise the dense_rank() window function to partition data by user_pseudo_id and order it by time_stamp. This enables us to identify each customer’s earliest purchase date.
- The case statement assigns the labels “new customer” for customers with rank 1 (first purchase) and “returning customer” for ranks greater than 1 (subsequent purchases).
with first as ( select parse_date('%Y%m%d', event_date) as order_date, timestamp_micros(event_timestamp) as time_stamp, user_pseudo_id, ecommerce.transaction_id as order_id, ecommerce.purchase_revenue as revenue, traffic_source.source as source, traffic_source.name as campaign, traffic_source.medium as medium, from `enter.tablename_123456.events_*` where event_name = 'purchase' group by 1,2,3,4,5,6,7,8), second as ( select *, count(order_id) as transactions, case when dense_rank() over(partition by user_pseudo_id order by time_stamp) = 1 then 'new customer' else 'returning customer' end as customer_type, from first group by 1,2,3,4,5,6,7,8 order by order_date) select * except(time_stamp) from second order by order_date
Query 2 – Frequency of Purchase
We have identified and categorized customers into four primary segments: “first purchase”, “second purchase”, “repeat purchase”, “recurring customer”.
The categorization in how to track frequency of purchase can be adapted to the specific business model.
- Our main query, operating on the “first” CTE.
- Retrieves all columns except time_stamp.
- Utilize the dense_rank() window function partitioned by user_pseudo_id and ordered by time_stamp. This facilitates ranking of multiple purchases on the same date.
- The case statement, we assign labels to customers based on their purchase frequency.
- ‘first purchase’ for the first purchase of a customer.
- ‘second purchase’ for the second purchase.
- ‘repeat purchase’ for purchases occurring between 3 and 5 times.
- ‘recurring customer’ for purchase frequencies above 5.
with first as ( select parse_date('%Y%m%d', event_date) as order_date, timestamp_micros(event_timestamp) as time_stamp, user_pseudo_id, ecommerce.transaction_id as order_id, ecommerce.purchase_revenue as revenue, traffic_source.source as source, traffic_source.name as campaign, traffic_source.medium as medium, from `enter.tablename_123456.events_*` where event_name = 'purchase' group by 1,2,3,4,5,6,7,8) select * except(time_stamp), case when dense_rank() over(partition by user_pseudo_id order by time_stamp) = 1 then 'first purchase' when dense_rank() over(partition by user_pseudo_id order by time_stamp) = 2 then 'second purchase' when dense_rank() over(partition by user_pseudo_id order by time_stamp) between 3 and 5 then 'repeat purchase' else 'recurring customer' end as customer_type from first order by order_date
In the context of our project focused on customer segmentation, the utilization of Common Table Expressions (CTEs) offers a highly practical and efficient approach to extracting, transforming, and categorizing customer data. CTEs provide a structured and organized method to break down complex queries into more manageable steps, enhancing the readability and maintainability of our code. By employing CTEs, such as the “first” CTE in our current implementation, we streamline the process of extracting relevant purchase-related data and subsequently classifying customers into distinct segments.
While other alternatives, such as subqueries or creating temporary tables, can achieve similar outcomes, CTEs stand out as an optimal choice for several reasons. Firstly, CTEs enhance code clarity by allowing us to logically separate the data extraction and transformation steps within the query. This division makes our code more comprehensible and easier to troubleshoot or modify in the future. Additionally, CTEs eliminate the need for creating and managing temporary tables, reducing the complexity of our code and minimizing potential data consistency issues.
Furthermore, CTEs provide the advantage of being more memory-efficient compared to temporary tables, as they are optimized by the database management system for improved performance. The concise syntax of CTEs also promotes code reuse and simplifies maintenance. In our case, the “first” CTE efficiently prepares the dataset for customer segmentation, allowing us to easily implement alternative categorization methods, such as the distinction between “new customers” and “returning customers.”
By leveraging the capabilities of CTEs, we not only enhance the efficiency of our code but also lay the foundation for future scalability and adaptability. The logical organization, ease of modification, and memory efficiency offered by CTEs make a compelling argument for their continued use in our customer segmentation project, enabling us to derive meaningful insights, optimize marketing strategies, and drive business growth.
Modifying the query for the business model
The two queries built are for an e-commerce business model, any business model that focuses on acquisition and retention would want to understand new v returning customers and frequency of purchase. The context of the insights will change depending on the business model. (link to GitHub to download SQL code)
- For lead generation it may be getting a user to complete a registration form so within the query the purchase event would need to be replaced with the registration
- It will provide insights; how many of those registrations are coming from new v returning.
- Also understand how many times a registration has been completed against i.e., user_pesudo_id
- For a specific business model, it may require the user to log-in the user_pseudo_id can be replaced with user_id.
Having a rich data set is the first step, then it’s all about visualizations to provide insights to those burning questions. There are many ways to build out visualization using Big Query, there is a connector to Looker Studio (Data Studio) or Python or Tableau all depending on the ecosystem in place.
(Data in the charts has been modified for this article)
This chart looks at the % revenue driven by new v returning customers by month.
This chart looks at the frequency of purchase and how the revenue is split.
This is an example the kind of insights possible but there is a wealth of data available that can be cut and sliced in different ways to answer the business questions.
Being able to understand the split of new v returning customers and frequency of purchase is part of the data puzzle for brands. This is only possible with good quality long data, a sophisticated Google Tag Manager + Google Analytics 4 set-up and a well thought out event tracking in place and connecting to Big Query. Most importantly is a solid strategy in place with the right business questions being asked.
Worked on this project with Jeffrey Ohene who built the SQL code in Big Query.