Correlation matrix in python – Python Correlation Matrix with Examples

Correlation Matrix:

Correlation matrix in python: A correlation matrix is a table that contains correlation coefficients for several variables. The correlation between two variables is represented by each cell in the table. The value ranges from -1 to 1. A correlation matrix is used to summarise data, as a diagnostic for advanced analyses, and as an input for a more complex study. The correlation’s two main components are:

Magnitude: the greater the magnitude, the stronger the correlation.

sign: If the sign is positive, this indicates that there is a regular correlation. If the value is negative, there is an inverse relationship.

In the field of Data Science and Machine Learning, we frequently encounter scenarios in which we must examine variables as well as perform feature selection. This is where Correlation Regression Analysis comes in.

correlation regression Analysis allows programmers to investigate the relationship between continuous independent variables and continuous dependent variables.

That is, the regression analysis describes the possibility and link between the data set’s independent variables as well as the independent and response (dependent) variables.

The Correlation matrix is used in Correlation Regression Analysis to depict the relationship between the variables in the data set.

The correlation matrix is a matrix format that aids programmers in analyzing the relationship between data components. It denotes the correlation coefficient between a range of 0 and 1.

A positive number indicates a good correlation, a negative value indicates a low correlation, and a value equal to zero(0) indicates no dependency between the specific set of variables.

The following observations can be drawn from the Regression Analysis and Correlation Matrix:

  • Recognize the relationship between the data set’s independent variables.
  • Aids in the selection of important and non-redundant variables from a data source.
  • Only applies to numeric/continuous variables.

Creation Of Correlation Matrix

Correlation matrix python: Here we have taken an example of a cereal dataset. Let us import and have a glance over it first.

1)Importing the Dataset

Import the dataset into a Pandas Dataframe.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
Output:
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
100% Bran N C 70 4 1 130 10 5 6 280 25 3 1 0.33 68.402973
100% Natural Bran Q C 120 3 5 15 2 8 8 135 0 3 1 1 33.983679
All-Bran K C 70 4 1 260 9 7 5 320 25 3 1 0.33 59.425505
All-Bran with Extra Fiber K C 50 4 0 140 14 8 0 330 25 3 1 0.5 93.704912
Almond Delight R C 110 2 2 200 1 14 8 -1 25 3 1 0.75 34.384843
Apple Cinnamon Cheerios G C 110 2 2 180 1.5 10.5 10 70 25 1 1 0.75 29.509541
Apple Jacks K C 110 2 0 125 1 11 14 30 25 2 1 1 33.174094
Basic 4 G C 130 3 2 210 2 18 8 100 25 3 1.33 0.75 37.038562
Bran Chex R C 90 2 1 200 4 15 6 125 25 1 1 0.67 49.120253
Bran Flakes P C 90 3 0 210 5 13 5 190 25 3 1 0.67 53.313813
Correlation Matrix creation Example:
Let’s start by looking at the data set (cereal dataset) that’s being used in this example. The data set contains four independent continuous variables, as shown below:
  • protein
  • fat
  • sodium
  • fiber

Approach:

  • Import os module using the import keyword.
  • Import pandas module as pd using the import keyword.
  • Import numpy module as np using the import keyword.
  • Import seaborn module using the import keyword.
  • Import dataset using read_csv() function by passing the dataset name as an argument to it.
  • Store it in a variable.
  • Give the consecutive columns of the dataset as a list.
  • Store it in another variable.
  • Form the correlation matrix.
  • Print the correlation matrix.
  • Pass the above correlation matrix and annot = True as the arguments to the heatmap() function to visualize the above correlation matrix.
  • The Exit of the Program.

Below is the implementation:

# Import os module using the import keyword
import os
# Import pandas module as pd using the import keyword
import pandas as pd
# Import numpy module as np using the import keyword
import numpy as np
# Import seaborn module using the import keyword
import seaborn 
# Import dataset using read_csv() function by passing the dataset name as 
# an argument to it. 
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Give the consecutive columns of the dataset as a list and 
# Store it in another variable.
numerc_colmns = ['protein','fat','sodium','fiber']

# Formation of the correlation matrix 
corretn_matrx = cereal_dataset.loc[:,numerc_colmns].corr()
# Print the correlation matrix.
print(corretn_matrx)
# Pass the above correlation matrix and annot = True as the arguments to the heatmap() function 
# to visualize the above correlation matrix
seaborn.heatmap(corretn_matrx, annot=True)

Output:

 

As a result of the above matrix, the following observations may be made:

With a correlation value of 1, the variables ‘protein’ and ‘fat’ are highly correlated

As a result, we can remove one of the two data variables.