Customer Segmentation using Machine Learning

Predicting potential customers from the general population

6 min readJun 17, 2020

For a company that sells products directly to the public, it is important to identify the people who are most likely to become customers. Distinguishing the right group of people to sell products helps the company to reach out to only those potential customers and greatly save resources. In this project, I will do an exploratory data analysis on demographics data of customers of a mail-order company and general population data of Germany and answer the following questions:

Which part of population is going to be the target of the marketing campaign for the company?
Which individuals are most likely to be customers of the company?

Problem Statement

There are four data files associated with this project:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood.

We will be using the information from the first two files to figure out how customers (“CUSTOMERS”) are similar to or differ from the general population at large (“AZDIAS”), then use our analysis to make predictions on the other two files (“MAILOUT”), predicting which recipients are most likely to become a customer for the mail-order company.

Data Exploration and Preprocessing

There are 366 features in our data we need to consider for modeling. In the preprocessing stage, each of them is carefully investigated and processed. Examples of customers data are shown here:

If we look at customers data, we can see that most customers bought multiple items and visited the stores to buy their items. Also, customers tend to buy both cosmetic and food products.

Now, let’s clean up our data to prepare it for use in modeling. First, we can calculate the percentage of missing values for each column. In the graph below, we can see that our data contains lots of missing values:

So, we got rid of out features with more than 25% NULL values and dropped other irrelevant features. For example, features that have many different items and highly correlated features were dropped. In addition, we dropped some extra features in costumers data mentioned above which are not part of general population data.

Then, missing values were filled-in with 0 value for remaining features. Categorical values were also converted into numerical values using one-hot encoding which greatly increases the number of features. In our data, the number of the features becomes 562 after representing categorical features as one-hot values.

Customer Segmentation

Now, we will find out which parts of the population are most likely to be customers of the mail-order company. Even though we have dropped many irrelevant features, our data is still very high dimensional. So, we need to reduce the dimension of our data before customer segmentation. We will be using principal component analysis (PCA) technique for dimensionality reduction. PCA helps us to identify patterns in data and reduce the dimensions based on the correlation between features.

n_components = int(azdias_scaled.shape[1] / 2)
pca = PCA(n_components)
azdias_pca = pca.fit_transform(azdias_scaled)

We will apply PCA to our data and retain half the number of the features. The graph below shows the variance values corresponding to the number of principal components in PCA.

We evaluate PCA results based on cumulative variance. The cumulative variance is the sum of variances of all individual principal components in data. The cumulative variance ratio in our data was 0.8396 when half of the features was retained.

After dimension reduction, we can cluster our data into different segments. First, we need to determine the best number of clusters in our data. In our project, we use Calinski-Harabasz metric to find out the right number of clusters in our data. The plot shows that the highest Calinski-Harabasz value occurs at 14 clusters, suggesting that the optimal number of clusters in our data is 14. Now, we will apply K-means algorithm to cluster both general population and customers data into 14 different clusters.

kmeans = MiniBatchKMeans(n_clusters=optimal_k, random_state=15)
model = kmeans.fit(azdias_pca)

The clustering result can be found in the graph below:

Blue bars indicate general population and green ones indicate current customers of the company. As we can see, the 3rd group shows a strong correlation between costumers and general population. Therefore, we can explore the 3rd group in general population and run an advertising campaign to get customers from the people in this group.

Supervised Learning Model

Now we will determine which individuals are most likely to be customers of the company. We are given two datasets such as MAILOUT_TRAIN and MAILOUT_TEST for this task. Each of the rows in the “MAILOUT” data files represents an individual that was targeted for a mailout campaign. In this task, we will predict which individuals are most likely to become a customer for the mail-order company.

We will use the same preprocessing pipeline built for AZDIAS CUSTOMER dataset. We first randomly split MAILOUT_TRAIN into training and validation sets. The split ratio of train/val is 80/20. Random Forest Classifier was used to fit classification model to our data. The GridSearchCV was used to compute accuracy metrics for the classifier on various combinations of parameters, over a cross-validation procedure. This is useful for finding the best set of parameters for our classifier.

def build_model():    """Build model.
    Returns: 
        pipline: sklearn.model_selection.GridSearchCV. 
    """    # Set machine learning pipeline
    clf = RandomForestClassifier(n_estimators=10, n_jobs=2)    pipeline = Pipeline([
            ('imp', Imputer(missing_values='NaN', axis=0)),
            ('scale', scaler),
            ('clf', clf))
    ])
        
    parameters = {}
    parameters['imp__strategy'] = ['mean',
                                   'median',
                                   'most_frequent']
    parameters['clf__n_estimators'] = [5, 10]    # Set parameters for gird search
    cv = GridSearchCV(pipeline, 
                      parameters, 
                      scoring='roc_auc',  
                      n_jobs= 1)    return cvprint('Building model...')
model = build_model()
        
print('Training model...')
model.fit(X_train, y_train)

We fine tune the classification algorithm with various hyper parameters and best performing parameters are selected based on the validation set. As a result, the trained model obtained 0.995% accuracy on the test dataset.

Conclusion

In this project, given the real life demographic data provided by Arvato Financials, I tried to create a segmentation of potential customers in the general population and predict new individuals that are likely to become customers in the future.

I would like to note that it was a quite challenging project for me. When I was doing this project, I spent most of my time to clean up datasets and had to analyze such massive data. In supervised learning, even though our experiment shows a high accuracy, true and false ratio was not balanced well. It can further be improved with data balancing techniques.

Acknowledgements

This project was done as a part of Data Scientist Nanodegree program. I’d like to thank Bertelsmann Arvato Analytics for providing the data and Udacity instructors for their reviews.