KNN Algorithm

KNN Algorithm:

K-Nearest Neighbors is a machine learning algorithm which is used for classification and regression tasks. In this algorithm an object is classified or its value is predicted based on the average of its K nearest neighbors in the feature space, Here K represents the number of neighbors to be considered to make a decision/prediction.

For example, if you have a new Friend and you want to gift him a latest video game, instead of directly asking him whether he likes video games or not, we decide by asking the people around him or his other friends close to him, thinking that if most of their neighbors are into video games, the new friend is also very likely too.

Steps involved in the KNN Algorithm:

Determining the K Value: At first the K value is determined, which is the number of neighbors to be considered when classifying a new data point. This is the most important step as it effects the performance of the algorithm. Chances of overfitting is high when there is a high value for the K and the low value for K may introduce underfitting.

Data Preparation: The data pre processing steps is performed before applying the KNN Algorithm. The data preprocessing steps include handling missing values, removing the inconsistencies and standardizing the data.

Distance Calculation: for each new data point, calculate the distance to each training data point, common distance metrics include Euclidean distance and Manhattan Distance.

K nearest neighbors: The calculated distances are sorted in the ascending order and the K nearest neighbors are selected based on the sorted distances.

Determining the class: The new data point is assigned to the class which is most prevalent among its K nearest neighbors.

Applications of KNN Algorithm :

Medical Diagnosis

Recommender Systems

Spam Filtering

Image Classification

K means Clustering and DB Scan Clusteirng

K Means Clustering:

K means clustering is a popular unsupervised learning algorithm, the goal of K means is to group similar data points and discover the underlying patterns in the data. In k means clustering, The algorithm starts by randomly initializing K centroids which is defined by the user. Each data points is assigned to the cluster with the closes centroid, the centroid is adjusted and repeats the process assigning to the cluster with closes centroid. This process is repeated until the centroids are no longer change or maximum number of iterations have reached.

DB Scan Clustering:

DB Scan clustering is one the popular unsupervised machine learning algorithm for understanding the underlying patterns in the data. DB Scan Clustering Algorithm groups the data points based on their density and proximity to each other. It is based on the idea that data points that are close together in space are morel likely to belong to the same category.

Example working of DB Scan Clustering Algorithm:

If we have group of people in room and if we ant to group people into different groups based on how close they are to each other, At first, we will have to specify two parameters one is Eps, the maximum distance between two people for them to be considered neighbors and MinPts, the minimum number of neighbors a person must have to be considered a core point. DB Scan identifies all the core points in the room. Then it iteratively expands the clusters by adding all non core points and their neighbors, including border points. In summary it is a powerful tool for clustering data.

Logistic Regression

Today I have learned about Logistic Regression, it is statistical method used for analyzing dataset which has one or more independent variables that helps to determine an outcome. This type of regression has dependent variable taking 0 or 1 values. These binary values typically determine a category. Another important characteristic for logistic regression is that it can have one more than one independent variables helping to determine the output variable( dependent variable). The independent variables can also be continuous values or categorical values. Logistic Regression uses Logistic function to model the relationship between the independent variables and the probability of the binary outcome. The Logistic Function is defined as

Logit(p) = ln(p/(1-p))

Here p represents the probability of the event occurring.

Logistic Regression is widely used in wide range of areas like healthcare, finance, and social sciences. In summary we can say that Logistic Regression is tool which can be used for understanding the relationship between the independent variables and dependent variable when the outcome is categorical

Visualization of Latitudes and Longitudes:

The dataset contains 840 missing values on both latitude and longitude attributes. I have removed all instances which has missing values on latitude and longitude attributes. I have used python’s Folium Library for visualizing the geospatial data.

This method not only enhances our understanding of the data but also offers a more intuitive perspective compared to traditional tabular data. It allows us to uncover spatial trends and connections that might be less apparent when examining raw numbers or traditional charts.

After the examination, I have noticed an unusual data point that stands out, and it is located in some state of Canada. A predominant number of shootings have occurred in the eastern region of the United States, with particular focus in the central and south-eastern areas. Additionally, significant shootings have been documented in California.

Investigating Ages

Exploring the Age:

The most interesting attribute I found is ‘Age’, I wanted to explore which age groups has been mostly involved in the crimes. I have started exploring with initial assumption as ‘The age group 21-29 may be more frequently involved in  crimes’.

I found that there were 503 missing ages and there were 81 different age groups in the dataset, I have ignored all the missing values and converted this continuous variable into categorical variable by introducing a new variable ‘age-groups’.

So, upon the investigation I found that the age group ‘31-35’ is mostly involved in the crime/victim. The other age groups included are ‘21- 25’ , ‘26- 30’ and ‘36- 40’.

Analysis on Dataset

Missing Values:

I have found few missing values in the dataset, the below image gives number of missing values in each attribute

There are total of 8002 instances with 17 attributes, out of which, most of the missing values are present in the ‘race’ attribute. In future I will try to remove the missing values, either I may remove the missing values or handle with few techniques.

State wise victims:

I have gone through the dataset and I have found that there are total of 51 different states in the dataset. I have started my analysis with an assumption that the most of the victims will be in the state with highest population compared to other states.

From the above visualization we can say that the California state has the highest police shootings and the Rhode Island state has the lowest police shootings.  More the population, more the crimes and a greater number of police stations.

Genders as victims:

After the going through the state wise victims, I have found an interesting attribute ‘Genders’, I wanted to explore which gender has more frequently been the target or victim.

After the analysis I can say that almost 95.7% of the data male has more frequently been the target or victim, and the female count is less than 400 instances.

Victims with Mental Illness:

Another interesting attribute I found is ‘Mental Illness’, I have stared exploring this attribute with an assumption that there will be significant percentage of victims with mental illness.

In the above visualization True indicates there are signs of mental illness and False indicates that there no signs of mental illness. From the visualization I can say that only 20% of cases are under the signs of mental illness and remaining 80% percent are mentally stable victims.

 

 

Project -2 , Observations on Data

Today I have gone through the dataset. Below are my observations:

There are 17 attributes and 8002 instances. When I looked at the data, I have seen many inconsistencies and inaccuracies. There are many missing values in almost of all the attributes. I have found few interesting attributes which can be explored further like ages, gender , and city or state. I think using these attributes we can generate few interesting insights and these three are the attributes with which we can move forward for the further analysis. Another interesting attribute with which we can move forward is the mental illness. So, by this attribute we can determine whether the significant proportion of crimes may be linked to the individuals with mental illness. By attributes such as age , gender, and location we can also decide whether a notable proportion of crimes may be associated with distinct age groups, specific cities or states, as well as varying genders and various ethnicity.