Skip to main content

Working with a brand providing an experience in wellbeing and health, they had a re-brand and launched a new website. Previously the insights they got from their analytics (Google Analytics 3) were useful, but it was not able to answer key business and marketing questions. With a new e-commerce website, it was the perfect time to go live with Google Analytics 4 and the integration with Big Query is a game changer.

One of the questions the brand struggled to answer is to understand time lag when users first visit the site to make their first purchase. Google Analytics 3 provided a time lag report which was useful at best, but it did not provide any actionable insights or great visualizations.

(Image from Google Merch Shop GA3 demo account)

Some of the key questions the brand wanted to answer:

  • Which channels are driving those first visits that result in a first purchase.
  • From the first visit, how long does it take to make a first purchase.
  • How do seasonality or campaign periods impact first visit to first purchase.
  • What is the contribution of revenue i.e., for February 2023 first visit / first purchases v total revenue.

This is where Google Analytics 4 event model and the integration with Big Query was invaluable being able to answer the question of time lag from first visit to purchase. The core components of building the query were based around:

  • user_pesudo_id – is generated when the user first visits the website. The user_pseudo_id will be recognized when the user visits the site again from the same device and browser
  • event_date the date on which the event was logged
  • first_visit – is an event that is generated by Google Analytics 4 when someone visits the website for the first time
  • purchase – is an event that is generated when someone buys a product, and this is triggered by a tag built in Google Tag Manager.

    This would generate an output with the column called days_to_purchase which is the key column in calculating how many days it took for a user to make their first purchase after their first visit.

    To enrich the data to provide greater context also added:

    • traffic_source.source – name of the traffic source i.e., facebook
    • traffic_source.medium – name of the medium i.e., paidsocial
    • traffic_source.name – name of the campaign i.e., summer2024
    • ecommerce.purchase_revenue – the revenue generated for that purchase

    The caveat being that the first visit may not necessarily be the users first touch point with the brand. The first purchase may not necessarily be the users first purchase.

    Building the Query

    There are two ways to build the query.

    Query 1 – Use subqueries to achieve our goal of highlighting how many days until a first purchase is made when the same user visits our site.

    The SQL code uses subqueries to extract and join relevant data to find out how many days it took individual users to make their first purchase from their first visit date. Specifically, it retrieves data for first visits and purchases separately from the dataset, and then joins them based on user_pseudo_id. This allows the code to combine the information related to each user’s first visit and subsequent purchase, enabling the calculation of the days taken for purchases after the first visit. By joining the data this way, the code can correlate the two events and calculate the time difference between them, providing insights into user behavior and purchase patterns.

    SELECT
    first_visit_date,
    first_visit_campaign,
    first_visit_medium,
    first_visit_source,
    'first_visit' AS event_name,
    user_pseudo_id,
    purchase_date,
    'purchase' AS second_event,
    revenue,
    date_diff(purchase_date, first_visit_date, day) AS days_to_purchase
    FROM
    (
    SELECT
    first_visit_date,
    user_pseudo_id,
    first_visit_campaign,
    first_visit_medium,
    first_visit_source,
    purchase_date,
    revenue,
    row_number() OVER (PARTITION BY user_pseudo_id ORDER BY date_diff(purchase_date, first_visit_date, day)) AS rn
    FROM
    (
    SELECT
    parse_date('%Y%m%d', event_date) AS first_visit_date,
    user_pseudo_id,
    traffic_source.name AS first_visit_campaign,
    traffic_source.medium AS first_visit_medium,
    traffic_source.source AS first_visit_source
    FROM
    `enter.tablename_123456.events_*`
    WHERE
    event_name = 'first_visit'
    ) AS first_event
    LEFT JOIN
    (
    SELECT
    parse_date('%Y%m%d', event_date) AS purchase_date,
    user_pseudo_id AS id,
    traffic_source.name AS tt2,
    ecommerce.purchase_revenue AS revenue
    FROM
    `enter.tablename_123456.events_*`
    WHERE
    event_name = 'purchase'
    ) AS purchase
    ON
    first_event.user_pseudo_id = purchase.id
    WHERE
    purchase_date IS NOT NULL
    ) AS joint
    WHERE
    rn = 1
    ORDER BY
    days_to_purchase DESC;
    

    Query 2 – In this code, we use nested subqueries. 

    1. The first subquery retrieves the data for the first event
    2. The second subquery retrieves the data for the purchase event
    3. The third subquery performs the left join and calculates the days to purchase.
    4. The final query then selects the desired columns and filters out any rows where the purchase date is null. We can also achieve this by using a CTE, short for Common Table Expression.
    with 
    first_event as (
    select 
    parse_date('%Y%m%d', event_date) as first_visit_date,
    user_pseudo_id,
    traffic_source.name as first_visit_campaign,
    traffic_source.medium as first_visit_medium,
    traffic_source.source as first_visit_source
    from `enter.tablename_123456.events_*`
    where event_name ='first_visit'
    ),
    purchase as (
    select 
    parse_date('%Y%m%d', event_date) as purchase_date,
    user_pseudo_id as id,
    traffic_source.name as tt2,
    ecommerce.purchase_revenue as revenue
    from `enter.tablename_123456.events_*`
    where event_name = 'purchase'
    ),
    joint as (
    select
    first_visit_date,
    user_pseudo_id,
    first_visit_campaign,
    first_visit_medium,
    first_visit_source,
    purchase_date,
    revenue,
    row_number() over (partition by user_pseudo_id order by date_diff(purchase_date,first_visit_date,day)) as rn
    from first_event
    left join purchase 
    on user_pseudo_id = id and first_visit_campaign = tt2
    where purchase_date is not null
    )
    select 
    first_visit_date,
    first_visit_campaign,
    first_visit_medium,
    first_visit_source,
    'first_visit' as event_name,
    user_pseudo_id, 
    purchase_date,
    'purchase' as second_event,
    revenue,
    date_diff(purchase_date,first_visit_date,day) as days_to_purchase
    from joint
    where rn=1
    order by days_to_purchase desc;
    

    The query organizes and combines data from two events — first_visit and purchase — to help analyze customer behavior on a website. It uses three CTEs to accomplish this.

    1. The first CTE selects and assigns data from the first_visit event for later use.
    2. The second CTE selects and assigns data from the purchase event for later use.
    3. The third CTE joins the data from the two events together using the user_pseudo_id and assigns row numbers based on the time between the first visit and first purchase.

    The query then selects the relevant columns from the third CTE, including the date of the first_visit, campaign name, the medium, the source of the traffic, user_pseudo_id, date of the first purchase, revenue generated and the number of days between the two events.  

    The window function is used to assign a unique rank to each row within a group defined by user_pseudo_id and ordered by days_to_purchase. The syntax partition by user_pseudo_id indicates that the ranking should be calculated separately for each unique user_pseudo_id.

    The purpose of the window function used in the joint CTE is to select the first row for each unique user_pseudo_id where purchase_date is not null. Where clause filters out all rows where purchase_date is null, and the rn = 1 condition selects only the row with the lowest days_to_purchase within each group. This means that for each user_pseudo_id, the query will select the first row where a purchase was made if one exists.

    Finally, the order by days_to_purchase clause sorts the result set by days_to_purchase in descending order, which means that the users who took the longest time to make a purchase will be at the top of the list. This can be useful for identifying potential issues with the user experience or the sales process that may be causing customers to delay their purchases.

    Overall, both queries accomplish the same task and there is no significant difference in their complexity, however the readability and precision of the CTE makes Query 2 the preferable option in terms of breaking down workflow to people who are not so familiar with SQL. For large datasets too, a CTE is preferable as it is generally faster because of simplification of complex query code.

    Modifying the query for the business model  

    The query built is for an e-commerce business model, any business model would want to understand the time lag from a user first visiting the website to their first “conversion”. With the way the query is built it can be modified depending on the use case. (link to GitHub to download SQL code)

    • For lead generation it may be getting a user to complete a registration form so within the query the purchase event would need to be replaced with the registration
    • For a specific business model it may require the user to log-in the user_pseudo_id can be replaced with user_id.        

    Visualization

    Having a rich data set is the first step, then it’s all about visualizations to provide insights to those burning questions. There are many ways to build out visualization using Big Query, there is a connector to Looker Studio (Data Studio) or Python or Tableau all depending on the ecosystem in place.  

    (Data in the charts has been modified for this article)

    This chart looks at the when the revenue for the first visit, first purchase was generated.

    This chart looks at which first visit campaign generated the revenue.

    Looking ahead

    Being able to understand the split of new v returning customers and frequency of purchase is part of the data puzzle for brands. This is only possible with good quality long data, a sophisticated Google Tag Manager + Google Analytics 4 set-up and a well thought out event tracking in place and connecting to Big Query. Most importantly is a solid strategy in place with the right business questions being asked.

    Worked on this project with Jeffrey Ohene who built the SQL code in Big Query.