Instacart Market Basket Analysis
Instacart is an American company that operates a grocery delivery and pick-up service in the United States and Canada. The company offers its services via a website and mobile app. The service allows customers to order groceries from participating retailers to the shopping being done by a personal shopper. Instacart aims to make it easy to fill the customer’s refrigerator and pantry with their personal favorites and staples when needed. After selecting products through the Instacart app, personal shoppers review the order and do the in-store shopping and delivery.
Instacart wants to predict the items that a user will buy again. For each user, we need to predict the items that the user will buy again. This is a multilabel classification problem, instead, we can formulate it as a binary classification problem. “We predict whether the user will reorder a particular product in his immediate purchase or not.” We predict “None” in case the user doesn’t buy any previously bought products. At the end, companies wouldn’t prefer to recommend a product which is irrelevant to the buyer and wouldn’t want to miss any product that the user would prefer reordering.
The Instacart online grocery data set is an anonymized dataset that contains a sample of over 3million orders from more than 2 hundred thousands users. The dataset for this competition is a relational set of files describing customers’ orders over time. For each user, we have around 4 to 99 orders with the sequence of products purchased. The day and hour of the order is also given. This data set is a relational set of files, containing the following files:
- aisles.csv: This file contains mapping between aisle_id and aisle (name).
- departments.csv: This file contains mapping between department_id and department(name).
- products.csv: This file contains mapping between product_id and product name,aisle_id and department_id.
- orders.csv: This file contains the order details at user_id level such as order_id,user_id eval_set(prior,train,test),order_number,order_dow,order_hour_of_day, days_since_prior_order
- order_products__prior.csv & order_products__train: These files contain the granular details of each order i.e., the details of products added to the cart along with their cart position and whether the product is a reordered one or not. A single row will have the following information: order_id, product_id, add_to_cart_order (cart position), reordered. order_products__prior.csv will have prior data information and order_products__train.csv will have train data information.
Since it is a multi-label classification problem, we can use either F1 Score or AUC. In the Kaggle challenge, they used the mean F1 score as the metric. We can choose log-loss as a loss function for training the model.
Exploratory Data Analysis
Personal care is the department that has the most no. of unique products in it.
The no. of orders per user ranges from 4 to 100.
Cart Size distribution in prior data:
The distribution of Reorders is similar in train data and prior data which tells us that at the overall level, the user reordering patterns are similar.
The orders during the 0th and 1st day of the week are more than the rest. (Probably weekend?)
Most of the orders are between 7AM to 8PM.
Heat map of all the orders across a week (24x7). Looks like, hour 13,14 and 15 of day0 and hour 9 and 10 of day1 are the busiest hours. Let’s see what products are being ordered during these times. (Day wise)
For Hour 13,14 & 15 of day0.
For Hour 9 and 10 of day1
Products ordered during the peak hours of 2 different days are similar…They are the regular items like fruits and vegetables.
If we look at the days since the previous order column, there are 2 peaks at 7 days and 30 days. That implies, many orders happen on a weekly basis or monthly basis. There are little peaks (compared with the neighborhood) at 14 and 21 days as well which make us think that few orders happen bi-weekly and tri-weekly. However significant no. of orders happen with frequency of <7 days as well
What are most ordered products weekly?
What are the most ordered products Monthly?
All the regular orders (within 1–2days),weekly and monthly orders are dominated by fruits and vegetables or dairy products. However, when we look at the overall reorder ratio, Weekly orders are the highest followed by Regular orders.
Bananas are the most ordered item. Also the top50 list is mostly dominated by organic products. One more observation is that there seems to be few products for which both organic type and normal type are in the top50 most ordered items. Ex:- Banana-Bag of organic bananas, Organic strawberries-Strawberries, Organic blueberries-Blueberries.
Fruits and Vegetables are the most ordered items followed by dairy products…
By filtering out the products that have less than 40 orders, we plotted 50 products with the highest reorder ratio. Almost all of them are related to food and beverages along with 2–3 house-care products and this totally makes sense.
The items that are ordered first in the cart are more likely to be ordered….Considered threshold of minimum 1000 orders for each cart position
Chance of reorder is very high incase of the different orders made again on the same day. Also the chance of reorder at 7 and 14th day is high compared to their surrounding values (Difference is statistically significant tested using proportionality test)
Feature Engineering
Apart from the userid, productid, order number and order id, the raw data given to us has the details about the order time, days from prior order ,mean cart position. I focused mainly on creating the features at different levels i.e., user, user-product, product, user-aisle, user-department etc., and then use these features for creating the models. At each level, the following features are used:
User level: no. of orders, no. of different products ordered, reorder ratio (proportion of reorders to total orders), most frequent day of order, average time between orders, average time between reorders, user-product order streaks aggregated at user level, Days since prior order for User-product combination aggregated at user level etc.,
Product level: no .of orders, reorder ratio, average cart position, most frequent day of order, user-product order streaks aggregated at product level, Days since prior order for User-product combination aggregated at product level etc.,
User-Product level: reorder ratio, when did the user buy the product first, order range for that product: average time difference for ordering the product, no of times the product is brought by the user in the last 5 orders, how frequently a user brought some product after its 1st purchase, Days since prior order for User-product combination, Chance of ordering a product in the last 2, last 3,last 4 orders etc., for all users.
User-aisle level: Unique orders, Reorder ratio, frequent day of order, Days since prior order for that User-Aisle combination etc.,
User-department levels: Unique orders, Reorder ratio, frequent day of order, Days since prior order for that User-Aisle combination etc.,
I also tried to capture the behaviors in the last 3 and last 5 orders of each user to account for the recency bias (if any).
I have added the features based on item-item co-occurrences as well.
Train Data/ Test Data Preparation
This is the part where a lot of mistakes tend to happen. So, I’m creating this as a separate section and explaining in detail the steps that are involved.
First, I have created a user-product combination table which will be the source for all the user-product orders that happened in the prior dataset (Let’s call this User-Product Master).
Since there is no overlap in the train and test users, we can assume that the train and test orders (for which we need to predict the items that need to be reordered ) come after prior orders.
So train/test User-Product pairs are created by merging the train/test users with the user-product master. (With this, we’ll be leaving out all the items that are ordered by train/test users for the first time during their latest order)
Now, we’ll do multiple merges with the User level, product level, User-product level dataset etc., to form the final train and test datasets.
Modeling Approach
With all the features that are created, I trained different models like DT,RF,LR and LightGBM couldn’t actually run XGBoost because of my RAM constraint).
LightGBM gave me the best results among them. Since I have around 250–300 features and for my RAM it is very huge, I removed the features which I felt are redundant and my Final feature set is around 197. With this I have built a LightGBM model and selected top 80 important features based on the feature importances.
Now, I have built 5 models each with 120 features out of which 80 important features are common and the rest varies from model to model. My Final prediction is the average of the predictions of all the 5 models.
Generating reorder basket from the predicted probabilities
After having the predicted probabilities, now the challenge is to get the list of products that’ll be reordered. One direct approach is to decide based on a threshold which is common for all the users and the other approach is to have a user specific/row specific threshold to maximize the expected F1-score.
The idea is to take the predicted probabilities of all the labels and then select the top few labels based on a soft threshold (can skip this step as well). Now, from the predicted probabilities, we will calculate the expected F1 score for all the possible combinations and we select the combination which has the maximum expected F1 score. Thus, we optimize the performance of each instance individually. This method works well only if we have a good classifier in our hand since we are using the predicted probabilities to arrive at the output class labels.
Let’s consider an example where the predicted probabilities A (0.7), B(0.3), C(0.4). The calculated probability for None prediction will (1–0.7)*(1–0.3)*(1–0.4)=0.126. Probability for different occurrences are as follows:
Probability
only A
0.7*0.7*0.6 =0.294
only B
0.3*0.3*0.6 = 0.054
only C
0.3*0.7*0.4 = 0.084
A & B
0.7*0.3*0.6 = 0.126
B & C
0.3*0.3*0.4 = 0.036
A & C
0.7*0.7*0.4 = 0.196
A,B & C
0.7*0.3*0.4 = 0.084
None
0.3*0.7*0.6 = 0.126
We’ll now calculate the expected F1 score for each recommendation i.e., A, B, AB, BC, AC, ABC.
Since Expected F1 Score for A,C recommendation is maximum we’ll select A,C as predicted labels.
This approach gave me a 0.2 lift on the Leaderboard when compared to the common threshold approach.
My final LB score is 0.39821.
Few things to improve the current solution
Given there is enough RAM I would love to train my models on much larger feature sets than what I used in my training process. I couldn’t use XGBoost as well due to that problem.
For some users we need to predict “None” as well if he doesn’t reorder any item in his next order. Right now, for each use we are calculating the probability of “None” based on the predicted reorder probabilities for all the products.
P(None) = 1-(P(p0)*P(p1)*P(p2)*…*P(pn)
But this only holds true only when the reorder phenomena of all the products is independent. But that may not be the case in the actual scenario. So having a separate model to predict whether the next basket contains no reordered products will definitely help us get better results(None or Not).
We can use deep learning methods to obtain Product, User and Aisle level embeddings to capture the respective information instead of only relying on the manual features that we created.
Will update this article with the github link soon…
References: