September 20
A t test is a statistical tool mostly used in Hypothesis testing. It is used to determine whether the differences of means between two groups are likely due to a real effect or just by a random chance. This test is used when you want to compare the means of two groups or samples and determine the difference between them is statistically significant. There are different types of T test, the most commonly used techniques are Independent Samples and Paired Samples. In independent sample test we compare the averages of two independent groups and assess if there significant difference between, which determines whether the difference between them is due to real effect or by a random chance. In paired sample test we compare the averages of two related groups.
Steps in T – Test
The basic steps to perform t-test are , the first step of this process is to formulate the hypothesis ( Null Hypothesis and Alternative Hypothesis) , if suppose we are comparing means of two different samples then we must formulate for Null and Alternative Hypothesis. The next step is collecting data from two samples and then the collected data is used to calculate the t-statistic using means of two samples, standard deviation of two samples , and the size of the samples. The next step is to calculate the p value, this p value helps us to determine whether the results occurred are likely or unlikely to be happened by a random chance. If the p value is small typically < 0.05 then , it shows that results occurred are unlikely to be happened by a random chance. If the p value is large typically greater > 0.05 then , it shows that the results occurred are very likely to be happened by a random chance.
September 18
Quadratic model is a type of polynomial regression, it is used to define the relationship between a dependent variable and one or more independent variables in the form of quadratic equations. Quadratic model is used for defining the non-linear relationship between the dependent and independent variables. These are useful when the relationship between dependent and independent variables is not linear and it follows U-shaped curve, in this case a quadratic model is a good choice. So, if we suspect that the relationship between the variables is not linear or it follows a U-shaped curve or inverted u shape curve then it Quadratic model is a perfect choice.
Overfitting is a common problem, it occurs when the model is overtrained on the data, i.e. the model learns the train data to well ,where it captures the noise and fluctuations present in the data. The resultant model will wok very well on the training data but it performs very poorly on the testing or unseen data. There are multiple reasons for the cause of overfitting like using a model which is too complex for the data, insufficient data, and noisy data present in the data. To avoid overfitting there are few methods like cross-validation, it is a technique where the dataset is divided into multiple parts and then evaluating for n number of times can help detect overfitting. Choosing the right model (Simpler Model) for the data is important because, if we choose a model which is highly complex to the given dataset then there is chance of overfitting. Collecting more data can help model to generalize better and help reducing the risk of overfitting.
September 13
In today’s class professor discussed about the P-Value. Firstly, professor have shown us a video which explains an example on the p – value. P value is used to determine the strength of evidence against the Null Hypothesis. For example, a scientist in lab, experiments with a new drug by giving it to the few people and compare their health conditions with other group of people who has not taken the medicine, after the experimentation process, he gets a p- value ( a numerical value) which helps him to determine whether the results occurred are likely or unlikely to be happened by a random chance. If the p value is small typically < 0.05 then , it shows that results occurred are unlikely to be happened by a random chance. If the p value is large typically greater > 0.05 then , it shows that the results occurred are very likely to be happened by a random chance.
Project Update:
Today I have tried to integrate three datasets ( Inactivity, Obesity and Diabetes ). After combining the three datasets, I saw that there are so many missing values by which we cannot proceed with the further procedure. So After checking the file common-fips for all three datasets, This file was a lead/clue given by the professor for dealing with the entire data. I have used that word document to extract the common FIPS from all three datasets and create a new dataset which does not contain any missing values. The below image shows the procedure of how I combined all three datasets.
September 11th 2023
Linear Regression:
Today I have learned about the linear regression , what is the mathematical intuition behind the linear regression, and what are its types. Linear Regression is statistical method which is used to generate or understand the relationship between two variables (typically between dependent and an independent variables). Its like finding the best fit line which has less Residual Error i.e minimal value between the actual data point and the predicted data point.
The different types of Linear Regression which I came across today are:
- Simple Linear Regression: It is the statistical method which helps to understand the relationship between one independent variable and one dependent variable. The relationship is represented by a line equation Y= mX+c. Here Y is the dependent variable and the X is the independent variable which is used for predicting the Y, m is slope and c is intercept.
Example case of using Simple Linear Regression:
If we want to predict the salary of an employee based on his years of experience. Precisely, an employee salary is predicted using experience factor ,hence salary is the dependent variable and years of experience will be the independent variable.
- Multiple Linear Regression: It can be considered as an extension of the simple linear regression, which means that it has n number of multiple independent variables. The mathematical equation of the relationship is shown by the Y = , Here Y is the dependent variable and is regression coefficient and is independent variable.
Example case of Multiple Linear Regression:
To predict a car price there are multiple factors to be considered like engine type, gear type, fuel type , body type and many more, all these factors are considered as independent variable and the car price is dependent variable, Basically Multiple regression is used when we have multiple different factors influencing or effecting the target variable.
- Logistic Regression: This is used in binary classification problems, here the dependent variable falls under any category. Here the relationship is understood by the logistic function which generates probabilities between 0 and 1.
Exploration of Dataset:
Along, with the linear regression , I have also gone through the CDC 2018 diabetes dataset. While understanding the data I have found few new terminologies like FIPS which is refereed as the Federal Information processing standards which is used to uniquely identify the different counties in the states and work with the data. The entire dataset contains three different datasets they are diabetes, inactivity, obesity. Subsequently to explore the data, I have plotted the histogram to check data distribution. Below is the image for that.
After creating the histogram of the %Diabetes feature in the diabetes dataset , I found that distribution in the dataset is not in normal distribution form and I also observed that the data distribution is skewed towards the left, which means that there are more numbers observations below the mean than above it. The distribution also suggests there are few outliers in the dataset.
After the diabetic data I have created an histogram for the ‘%Inactive’ feature in the Inactivity dataset, by observing the histogram I came to know that the histogram is right skewed, which means that there are more number of observations above the mean than the below mean, I have also observed that the mean, median , mode of the inactive feature is not equal which implies that the data in not in normal distribution. Observing the histogram, the dataset also contains few outliers. After creation of the histogram for the Obese feature in the obesity dataset, I observed same characteristics similar to the inactive dataset.