Linear Regression:
Today I have learned about the linear regression , what is the mathematical intuition behind the linear regression, and what are its types. Linear Regression is statistical method which is used to generate or understand the relationship between two variables (typically between dependent and an independent variables). Its like finding the best fit line which has less Residual Error i.e minimal value between the actual data point and the predicted data point.
The different types of Linear Regression which I came across today are:
- Simple Linear Regression: It is the statistical method which helps to understand the relationship between one independent variable and one dependent variable. The relationship is represented by a line equation Y= mX+c. Here Y is the dependent variable and the X is the independent variable which is used for predicting the Y, m is slope and c is intercept.
Example case of using Simple Linear Regression:
If we want to predict the salary of an employee based on his years of experience. Precisely, an employee salary is predicted using experience factor ,hence salary is the dependent variable and years of experience will be the independent variable.
- Multiple Linear Regression: It can be considered as an extension of the simple linear regression, which means that it has n number of multiple independent variables. The mathematical equation of the relationship is shown by the Y = , Here Y is the dependent variable and is regression coefficient and is independent variable.
Example case of Multiple Linear Regression:
To predict a car price there are multiple factors to be considered like engine type, gear type, fuel type , body type and many more, all these factors are considered as independent variable and the car price is dependent variable, Basically Multiple regression is used when we have multiple different factors influencing or effecting the target variable.
- Logistic Regression: This is used in binary classification problems, here the dependent variable falls under any category. Here the relationship is understood by the logistic function which generates probabilities between 0 and 1.
Exploration of Dataset:
Along, with the linear regression , I have also gone through the CDC 2018 diabetes dataset. While understanding the data I have found few new terminologies like FIPS which is refereed as the Federal Information processing standards which is used to uniquely identify the different counties in the states and work with the data. The entire dataset contains three different datasets they are diabetes, inactivity, obesity. Subsequently to explore the data, I have plotted the histogram to check data distribution. Below is the image for that.
After creating the histogram of the %Diabetes feature in the diabetes dataset , I found that distribution in the dataset is not in normal distribution form and I also observed that the data distribution is skewed towards the left, which means that there are more numbers observations below the mean than above it. The distribution also suggests there are few outliers in the dataset.
After the diabetic data I have created an histogram for the ‘%Inactive’ feature in the Inactivity dataset, by observing the histogram I came to know that the histogram is right skewed, which means that there are more number of observations above the mean than the below mean, I have also observed that the mean, median , mode of the inactive feature is not equal which implies that the data in not in normal distribution. Observing the histogram, the dataset also contains few outliers. After creation of the histogram for the Obese feature in the obesity dataset, I observed same characteristics similar to the inactive dataset.