Welcome! to the project of Customer Segmentation.

Problem Statement

Grouping customers into ideal number of clusters.

Develop a model to visualize data from mall customers and cluster them into ages and potential products purchased.

Customer Segmentation

Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately. In business-to-business marketing, a company might segment customers according to a wide range of factors, including:

Industry
Number of employees
Products previously purchased from the company
Location

Customer segmentation is basically identifying key differentiators that divide customers into groups that can be targeted. The main goal of segmenting customers is to decide how to relate to customers in each segment to maximize the value of each customer to the business.

Machine Learning Customer Segmentation Project

Earlier Customer segmentation was a challenging and time-consuming task. This is because for segmenting customers we need to perform hours of manual poring on different tables and querying the data in hopes of finding ways to group customers together. So to overcome this task we will use machine learning to segment customers.

So the following project By Coding Pandits helps user to segment customers using machine learning algorithms, so that it will be useful for businessmen in personalized marketing and provide their customers’ relevant offers and deals.

Unsupervised Learning

Unsupervised learning is a machine learning technique in which models are not supervised using training dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be compared to learning which takes place in the human brain while learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision.

So our project works upon the principle of Unsupervised Learning beacuse It involves training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information without guidance.

K-Means Clustering

The idea behind k-means clustering is very simple, it arranges the data into clusters that are very similar.

To implement K-means clustering we have to follow certain steps that are:

In K-means clustering firstly we have to specify the number of clusters k.
And then initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
Then it keeps iterating until there is no change to the centroids, i.e assignment of data points to clusters isn’t changing.
Then it computes the sum of the squared distance between data points and all centroids.
Assign each data point to the closest cluster (centroid).
At last, Compute the centroids for the clusters by taking the average of all data points that belong to each cluster.

Tools and Libraries for the project

The project is made using the coding language Python.

The following libraries you are required to install in your system, they are:

Numpy (pip install numpy)
Pandas (pip install pandas)
Matplotlib (pip install matplotlib)
Seaborn (pip install seaborn)
Sklearn (pip install sklearn)
mpl_toolkits (pip install mpl_toolkits)

Code of the model

Let us code the model of customer segmentation.

First we will import all important libraries in our juypter notebook

Now we will import the dataset and once check if the file is imported

Before moving further we can see that there is a spelling mistake of gender which is written as Genre. So let’s change that and check if it has been corrected.

Now we are all set to move to visualization of data

Let’s create a distribution plot for the 3 categories i.e. Age , Annual Income and Spending Score.

In this program snippet, A far loop has been created [because we need to represent the 3 variables]. Then we have implemented a subplot function. This represents number of rows, columns and the index which we will implement. Now let us run the program.

So in the first data plot we can see that age group 20-40 perform the highest shopping among all the all the age groups.

In the 2nd plot we can see that the people who have their annual income between $50,000 and $70,000 perform more shopping.

In the 3rd plot we can see that the most people have the score 50.

Now let us compare the number of female and male customers.

We will use the countplot function of sns to represent the comparison

From this we can conclude that majority of customers is female.

Now let us create a violin plot of age, annual income , and spending score based on gender using the seaborn library.

In the program/code the x-axis will represent the age, annual income and spending score whereas the y-axis will represent the gender.

Let us now the run the code.

The above violin plots represents the distribution of age, annual income and spending score based on gender

In the first violin plot we can see that female density is more that of male in the age group 30

In the second violin plot we can see that female density is more at $60,000 p.a. and male density is also around $60,000-$65,000

In the third violin plot, the density of both male and female density is around 50.

Before moving ahead lets describe our data

As we can see from the description that minimum age group is 18, maximum is 70 and so on we can derive conclusions from this which will help us to provide efficiency in the visualization

Let us now divide the age group into five categories to understand which age group has more density of customers.

In the program , we have divided the age into 5 categories.

On x-axis we will represent the category of age

On y-axis we will represent the number of customers in each category by using the length function.

Also for visualization of this data we will use bar plot

Let us now run the code:

From this bar graph we can easily understand that the age group 26-35 has maximum customers

For better understanding of data let us look into the relationship between annual income and spending score.

From this plot we can see that customers who have an annual income of $40,000 to $60,000 have a spending score of 60.

Moving ahead let us divide the annual income into 5 categories as we did earlier with the age category.

In the program, we have divided the annual income into five categories.

The x-axis represents the range of income.

The y-axis will represent the number of customers.

Now let us run the code:

From the bar graph we can see that most customers have annual income of $60,000-$90,000

Similarly we will do with the spending scores i.e. dividing it into 5 categories

Let us run the code:

We can see from the bar graph that maximum customers have a spending score of 41-60

In this way we have analysied the data thoroughly.

From here clustering process starts.

Now first we will find the optimum number of clusters that we need to create.

We will first cluster the data of age and spending score

First import the Kmeans and now we will draw a graph to find the optimum number of clusters for the 2 variables we considered for a range of 1-11 in X1