Unlocking the Power of Data: Analyzing Starbucks’ Mobile App to Increase Customer Loyalty and Profit

10 min readJul 18, 2020

About the Starbucks Project

In today’s fast-paced world, convenience is key. And for many coffee lovers, nothing is more convenient than being able to purchase their daily fix with just a few taps on their mobile device. This is where Starbucks’ mobile app comes in. Not only does it save customers time during the payment process, but it also provides the company with valuable data on customer habits and usage patterns.

But what exactly can Starbucks do with this data? The answer is simple: use it to increase customer loyalty and profit. By analyzing customer behavior and preferences, the company can craft targeted promotions that are tailored to individual customers. This is where machine learning comes in. By building a model that predicts which offers will be most effective for each customer, Starbucks can make data-driven decisions that drive customer engagement and revenue.

The marketing team at Starbucks has identified three types of offers: BOGO, discounts, and informational offers. The BOGO offer is a customer favorite, allowing them to earn a free coffee when they purchase one. Discount offers also provide value to customers, but require them to spend a certain amount to qualify. And informational offers, while not providing a direct reward, still play an important role in building customer loyalty.

In this project, we will walk you through the process of how Starbucks uses data and machine learning to craft targeted promotions that increase customer loyalty and revenue. By the end of this post, you will have a better understanding of the power of data-driven decision-making and how it can benefit your business mobile app.

Problems We Will Try to Solve

we aim to increase customer loyalty and profitability for Starbucks by analyzing data from their mobile app. To do this, we must first address several key questions:

Is it possible to predict which individual customers have a high likelihood of completing offers?
Which features are most important in determining which customers to target with offers?
Is there a correlation between gender and offer completion rates, and if so, which gender should be targeted most?

To answer these questions, we will be utilizing data from three sources: portfolio.json, profile.json, and transcript.json. The portfolio.json file contains information about each offer, such as its type, duration, and required spend. The profile.json file contains demographic information about each customer, including their age, gender, and income. The transcript.json file contains information about customer transactions, offers received, viewed, and completed.

With these datasets, we can begin our exploration and cleaning process to uncover insights and build a machine-learning model to predict which offers will be most effective for individual customers.

Preparing the Data for Machine Learning

Portfolio Dataset:

It is crucial to ensure that the data we use for our machine-learning models is clean, readable, and well-prepared. In this blog post, we will discuss the steps we took to prepare the “portfolio” dataset for our model.

Portfolio Dataset:

The portfolio dataset contains information on three different types of offers, with a total of ten alternatives for each offer. The dataset includes columns for the offer’s reward, difficulty, and duration. In order to make this dataset more usable for our machine-learning model, we took the following steps:

Created new columns for the “channels” column, which separates the channels used for each offer.
Dropped the original “channels” column, as it was no longer necessary.
Multiplied the “duration” column by 24 to convert it from days to hours, for consistency with the other datasets.
Renamed the “id” column to “offer_id” to match the naming conventions of the other datasets.
Encoded the “offer_id” column for better understanding.

By taking these steps, we were able to effectively clean and prepare the portfolio dataset for our machine-learning model. This is an important step in the data science process, as it ensures that the model is using accurate and usable data. With a well-prepared dataset, we can be confident in the results our model produces.

Profile Dataset:

Continuing on from preparing the portfolio dataset, we also needed to clean and prepare the “profile” dataset for our machine learning model. This dataset contains information on 17,000 customers, including their age, membership date, gender, customer ID, and total payments made.

In order to make this dataset usable for our model, we took the following steps:

Encoded the “id” column for easier reading and tidier data.
Inspected the dataset for null values and discovered that all customers with an age of 118 had null values in their “gender” column. We decided to drop these null values and any other rows with an age of 118.
Split the “became_member_on” column into separate “year”, “month”, and “day” columns.
In order to understand how long each customer had been a member of the Starbucks app, which we believe will have a significant impact on our results, we looked at the latest membership date and decided to use “2019–01–01” as the current date. We created a new column named “member_since_days” which calculates the difference between “2019–01–01” and the customer’s “became_member_on” date.
Dropped the original “became_member_on” column and renamed the “id” column to “customer_id” for consistency with the other datasets.

By taking these steps, we were able to effectively clean and prepare the profile dataset for our machine-learning model. It is important to note that data cleaning and preparation is a crucial step in the data science process, as it ensures that the model is using accurate and usable data. With well-prepared datasets, we can be confident in the results our model produces.

Transcript Dataset:

In addition to the portfolio and profile datasets, we also needed to clean and prepare a “transcript” dataset for our machine learning model. This dataset contains information on customer transactions, and has 306534 rows and 4 columns.

To make this dataset usable for our model, we took the following steps:

Created separate “offer_id” and “amount” columns from the existing “value” column.
Dropped the original “value” column, as it was no longer necessary.
Renamed the “person” column to “customer_id” for ease of merging with other datasets.
Dropped rows that corresponded to customers for which we did not have any other information (age, gender, etc.)

Last step before building a model on:

Merging these datasets:

As the final step before building a model, we needed to merge the “profile”, “transcript”, and “portfolio” datasets for a comprehensive view of all data.

We started by merging the “profile” and “transcript” datasets, as well as the “portfolio” and “transcript” datasets, for future use. By merging these datasets, we were able to see which transactions were associated with which offers and customer profiles.

We then dropped rows that corresponded to offers that were received but not completed or viewed, as these transactions did not provide useful information for our model. We then created different datasets with all conditions except the offer received condition.

After that, we calculated the time of customer became a member when the offer was sent to the person and between the time after receiving and the transaction happens. We created new columns for these calculations and named it the ‘customer_offers’ dataset.

We then merged the “customer_offers” dataset with the “portfolio” dataset for a comprehensive view of all data. After merging, we dropped the ‘offer_id’ column as it was no longer necessary. We changed the offer_id’s to numeric values for simpler looking, and merged the “portfolio_new” and “customer_offers” datasets. We also dropped the customer_id column since we had a new user id column for this dataset.

Model Building

With our dataset prepared and cleaned, we can now move on to the next step: building our machine-learning model.

Get dummy variables for X: Since we have columns in our dataset that contain categorical data such as gender and type of offer, we need to convert them into numerical data by creating dummy variables.
Set the types as float: We will set the data type of all columns as float before fitting them to the StandardScaler().

Next, we need to split the data into features(X) and target(y) datasets. We set the testing size as 30% percent. We build a pipeline for doing future scaling using StandardScaler() and tried 5 different models for choosing which have the best ratio between them.

Split the data into train and test split: we will split the data into train and test sets to evaluate the performance of our model
Build a pipeline for finding the best classification algorithm: We will build a pipeline for finding the best classification algorithm.
Choose the best test score: We will choose the best test score and apply it to our data and then analyze it.

Choose the best test score: We will choose the best test score and apply it to our data and then analyze it.

Build pipeline with 100 estimators: We will build our pipeline with 100 estimators using the best estimator which is RandomForestClassifier().

Build our pipeline with best parameters: We will build our pipeline with the best parameters on RandomForestClassifier().

Now we can use the model to make predictions and analyze the results: With our model built and optimized, we can now use it to make predictions and analyze the results.

Questions:

Let’s remember the first question we’ve been asked at the beginning of the article:

Is it possible to predict which individuals have a high percentage rate of completing an offer?

Yes, through the implementation of machine learning techniques and analysis of the provided datasets, we were able to predict which individuals have a high percentage rate of completing an offer with an accuracy ratio of 88%. This demonstrates the effectiveness of our model in identifying individuals who are likely to engage with our offers, allowing us to tailor our marketing strategies accordingly.

Which features are most important in determining which users to target?

Our analysis revealed that transaction amount, membership duration, starting time, and income from the customer are the most important features in determining which users to target. The transaction amount in particular, has the highest predictive power. The starting time of the offer also plays an interesting role, it might be because offers are valid within a specific period of time and timely transactions matter to achieve completing the offer or not. Income from customer also shows the economic power of an individual.

Is there an effect of gender on the completion rate of offers and which gender should be targeted?

Our analysis revealed that men tend to complete more offers in total, but when considering conversion rate, female and other gender customers are more profitable targets for the company.

How Can We Improve?

To improve the model, we can explore the potential impact of individual customer behavior on offer completion by splitting channel data and matching it with customer ages. Additionally, by prioritizing customers with the highest total income, we can improve customer relations. To increase the accuracy of the model, we can consider obtaining other datasets or merging them with the current ones. Additionally, A/B testing can be implemented to understand the relationship between transaction behavior and offers.

Conclusion:

In this blog post, we have discussed the steps involved in preparing the data for machine learning visualizations, the process of building a model, and the results of our analysis. Our goal was to predict which individuals have a high percentage rate of completing an offer, identify the most important features in determining which users to target, and understand the effect of gender on the completion rate of offers. Through the implementation of machine learning techniques and analysis of the provided datasets, we were able to achieve an accuracy ratio of 88% in predicting which individuals are likely to engage with our offers.

For those who are interested in the details of our model and analysis, the code and dataset used in this post can be found on my Github page. I hope you found this post informative and engaging. As always, feel free to reach out to me with any questions or comments.