t-test and ANOVA

A t-test is a statistical method used to compare the means of two groups to assess whether there is a significant difference between them. It’s commonly employed when dealing with small sample sizes and assumes that the data follows a normal distribution. The t-test generates a t-statistic, and a low p-value associated with this statistic suggests that the means of the two groups are significantly different. There are different types of t-tests, such as the independent samples t-test (for comparing means of two independent groups) and the paired samples t-test (for comparing means of two related groups).

ANOVA, on the other hand, is used when comparing means among three or more groups. It assesses whether there are statistically significant differences in the means of the groups by analyzing the variance within and between groups. ANOVA is applicable when dealing with multiple independent groups or multiple levels of a categorical variable. The F-statistic generated by ANOVA is used to determine whether the group means are significantly different. If ANOVA indicates significance, post-hoc tests may be employed to identify which specific groups differ.

DBSCAN

In order for DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to function, data points are grouped according to how dense they are in the feature space. Core points are found by the method when they have a given number of surrounding points within a given distance (ε). Furthermore, locations that fall under the density criteria but are not next to core points are designated as border points. Clusters are created when the algorithm iteratively grows from core points to boundary points inside a given density-connected area. Noise is the term applied to unassigned points. Since the number of clusters is not fixed, DBSCAN can find clusters of different sizes and forms.

Clusters labeled as 4 and 2 in DBSCAN would be a result of the algorithm’s exploration of the data’s density distribution and spatial relationships between points, with each cluster characterized by its unique density-connected region.

K-means and K-medoids

in todays class we discussed about the k-means and k-medoids

K-means : K-Means is a popular clustering algorithm used in unsupervised machine learning. The algorithm aims to partition a dataset into K clusters, where each data point belongs to the cluster with the nearest mean. The number of clusters, K, is a parameter that needs to be specified before running the algorithm. The algorithm iteratively assigns data points to clusters based on their distances to cluster centroids and updates the centroids until convergence. K-Means is sensitive to the initial placement of centroids, and different initializations can lead to different final cluster assignments.

K-medoids : K-Medoids is a variant of K-Means that, instead of using cluster centroids, chooses actual data points within the clusters as representatives or medoids. Medoids are more robust to outliers than centroids, making K-Medoids less sensitive to extreme values. The algorithm iteratively refines cluster assignments by selecting the data point that minimizes the sum of dissimilarities (often measured using a distance metric like Euclidean distance) to other points in the same cluster. K-Medoids is particularly useful in situations where the mean may not be a representative measure (e.g., in the presence of outliers) or when dealing with non-numeric data.

In summary, both K-Means and K-Medoids are clustering algorithms used to group similar data points together. K-Means uses cluster centroids, while K-Medoids uses actual data points as representatives, making it more robust to outliers. The choice between them depends on the nature of the data and the desired characteristics of the clusters.

 

October 18

Using a dataset from the Washington Post, we talked about the age distribution of those killed by police in class today. The data, which includes the ages of every individual, has a slightly skewed distribution, according to the descriptive statistics. It is discovered that 67% of the individuals shot by the police fall between the ages of 24 and 50 after comparing the distribution to a normal distribution and calculating the number of people lying within the standard deviation of -1 to 1.

Then, we follow the same procedures for both the white and black races to locate it. We discover that 68% of the black individuals who were shot were between the ages of 21 and 44, while the white individuals were between the ages of 27 and 53. Next, we attempt to determine whether there is a difference in the average number of Black and White individuals shot by police, and if so, by how much. We determine that there is a discrepancy of about 7 years using the Monte Carlo approach, and this disparity is not the result of chance.

We also utilize Cohen’s d technique to determine the impact of the difference on the entire dataset. The result, 0.577, falls into the Cohen-Sawilowsky category of medium impact size.

October 16

We covered geocoding and geospatial analysis in today’s lecture, which introduced the robust Python package GeoPy. Using the construction of several plot kinds, including Geohist, the professor demonstrated how to map and visualize data points on a map of the United States using GeoPy. The students used GeoPy to determine the locations of police shootings, demonstrating the practical application of geospatial data analysis in real-world scenarios. That application even extended to California.

Additionally, a density-based clustering method called DBSCAN was presented. There are four unique clusters within the dataset, as the professor demonstrated while integrating GeoPy with DBSCAN to find clusters. When it comes to locating hidden spatial patterns in data, this method can be quite helpful.
In addition, the class had an interesting conversation regarding the connection between crime rates and police shootings. We investigated the fascinating idea that, even in places with lower total crime rates, crime intensity might play a significant role. This raised a number of issues and suggested possible lines of inquiry for additional research.
In conclusion, the course today covered a thorough overview of geospatial analysis with GeoPy, presented a useful clustering technique in DBSCAN, and started a stimulating investigation into the relationships between crime and police shootings

October 13

I’ve started analyzing a dataset of fatal police shootings in the US during today’s class. I started by closely examining a number of dataset factors. Notably, the age of the people engaged in these terrible situations was revealed by the number column of interest labeled “age.” The dataset also included geospatial information, namely latitude and longitude, which made it possible to plot the firing locations precisely.

I discovered a “id” column during this initial evaluation that appeared to have little analytical use, so I thought it might be best to exclude it. In addition, I thoroughly checked for missing values, which turned up null or missing data for a number of variables, including “name,” “armed,” “age,” “gender,” “race,” “flee,” “longitude,” and “latitude.” discovered one additional duplicate entry in the dataset, which stood out due to its absence of a “name” value.

The development of GeoHistograms and geospatial analysis were presented as essential methods for examining and visualizing geographical data. With the use of these techniques, we are able to identify geographic hotspots, clusters, and spatial trends within the dataset.

October 11

Understanding the dataset of locations of police shootings -continental US

We are examining data pertaining to gunshots by police in the past year. First, we gather, correct, and ensure that the data is accurate. Then, in order to better understand the data, we employ both numbers and images. We look at topics like who was harmed, where it occurred, and whether there was a pattern.

Additionally, we track the numbers over time to look for any variations. We would like to know if there are some demographics that are more impacted than others, as well as the outcome for the implicated police personnel. We also examine public response to these incidents and if they alter police policy. Finally, we present our results and obtain expert comments to ensure that we are conducting this research accurately and fairly.

A thorough report and visual aids are produced in order to efficiently convey and summarize the insights that are obtained from the data. While the report guarantees a thorough and educational presentation of the investigation, these graphics help to elucidate and support the key conclusions. It’s critical to treat the data with care at every stage of the procedure, to interpret the results objectively, and to take ethical issues into account when doing the analysis. To offer precise insights into this delicate and complex matter, meticulous data analysis coupled with ethical responsibility is necessary.

 

cross validation

In conducting cross-validation on a dataset comprising 354 data points related to obesity, inactivity, and diabetes, the process involves systematically partitioning the dataset into k=5subsets. For each iteration of the cross-validation, a model is trained on k− folds of the data and tested on the remaining fold, allowing for a comprehensive evaluation of the model’s performance across different subsets of the dataset. By calculating both training and testing errors for each iteration and subsequently averaging these errors, the cross-validation approach provides a more robust and reliable estimation of the model’s effectiveness, preventing potential biases that might arise from a single split of the data.

This iterative process ensures that each data point contributes to both the training and testing phases at least once, facilitating a more thorough understanding of the model’s ability to generalize to new data. In summary, cross-validation serves as a valuable tool for assessing model performance, particularly in the context of complex relationships between variables like obesity, inactivity, and diabetes, helping to uncover patterns and ensuring the model’s reliability across diverse subsets of the dataset.

Average MSE

Comparison of pre-molt and post-molt histogram

In the crab analysis, the results section would start with descriptive statistics such as the minimum, maximum, median, mean, standard deviation for the post-molt and pre-molt size variables.

Skewness is [-1.39596775 -1.42999142] for the post-molt and pre-molt size variables. It signifies that the distribution is negatively skewed. Kurtosis is [3.29541977 3.35382337] for the post-molt and pre-molt size variables. For a distribution having kurtosis > 3, It is called leptokurtic, and it signifies that it tries to produce more outliers rather than the normal distribution.

Probability density function (PDF) histograms and smooth histograms would also be included to visually show the distribution of the data.