October 13

I’ve started analyzing a dataset of fatal police shootings in the US during today’s class. I started by closely examining a number of dataset factors. Notably, the age of the people engaged in these terrible situations was revealed by the number column of interest labeled “age.” The dataset also included geospatial information, namely latitude and longitude, which made it possible to plot the firing locations precisely.

I discovered a “id” column during this initial evaluation that appeared to have little analytical use, so I thought it might be best to exclude it. In addition, I thoroughly checked for missing values, which turned up null or missing data for a number of variables, including “name,” “armed,” “age,” “gender,” “race,” “flee,” “longitude,” and “latitude.” discovered one additional duplicate entry in the dataset, which stood out due to its absence of a “name” value.

The development of GeoHistograms and geospatial analysis were presented as essential methods for examining and visualizing geographical data. With the use of these techniques, we are able to identify geographic hotspots, clusters, and spatial trends within the dataset.

October 11

Understanding the dataset of locations of police shootings -continental US

We are examining data pertaining to gunshots by police in the past year. First, we gather, correct, and ensure that the data is accurate. Then, in order to better understand the data, we employ both numbers and images. We look at topics like who was harmed, where it occurred, and whether there was a pattern.

Additionally, we track the numbers over time to look for any variations. We would like to know if there are some demographics that are more impacted than others, as well as the outcome for the implicated police personnel. We also examine public response to these incidents and if they alter police policy. Finally, we present our results and obtain expert comments to ensure that we are conducting this research accurately and fairly.

A thorough report and visual aids are produced in order to efficiently convey and summarize the insights that are obtained from the data. While the report guarantees a thorough and educational presentation of the investigation, these graphics help to elucidate and support the key conclusions. It’s critical to treat the data with care at every stage of the procedure, to interpret the results objectively, and to take ethical issues into account when doing the analysis. To offer precise insights into this delicate and complex matter, meticulous data analysis coupled with ethical responsibility is necessary.

 

cross validation

In conducting cross-validation on a dataset comprising 354 data points related to obesity, inactivity, and diabetes, the process involves systematically partitioning the dataset into k=5subsets. For each iteration of the cross-validation, a model is trained on k− folds of the data and tested on the remaining fold, allowing for a comprehensive evaluation of the model’s performance across different subsets of the dataset. By calculating both training and testing errors for each iteration and subsequently averaging these errors, the cross-validation approach provides a more robust and reliable estimation of the model’s effectiveness, preventing potential biases that might arise from a single split of the data.

This iterative process ensures that each data point contributes to both the training and testing phases at least once, facilitating a more thorough understanding of the model’s ability to generalize to new data. In summary, cross-validation serves as a valuable tool for assessing model performance, particularly in the context of complex relationships between variables like obesity, inactivity, and diabetes, helping to uncover patterns and ensuring the model’s reliability across diverse subsets of the dataset.

Average MSE

Comparison of pre-molt and post-molt histogram

In the crab analysis, the results section would start with descriptive statistics such as the minimum, maximum, median, mean, standard deviation for the post-molt and pre-molt size variables.

Skewness is [-1.39596775 -1.42999142] for the post-molt and pre-molt size variables. It signifies that the distribution is negatively skewed. Kurtosis is [3.29541977 3.35382337] for the post-molt and pre-molt size variables. For a distribution having kurtosis > 3, It is called leptokurtic, and it signifies that it tries to produce more outliers rather than the normal distribution.

Probability density function (PDF) histograms and smooth histograms would also be included to visually show the distribution of the data.

 

statistics of pre and post molt data

The descriptive statistics for pre and post-molt data

For post-molt data we got median = 147.4, mean = 143.8 , Standard deviation = 14.67 , variance = 214.347 , skewness = -2.34 and kurtosis = 13.116

For pre-molt data we got median = 132.8, mean = 129.2 , Standard  deviation = 15.864 , variance = 251.68 , skewness = -2.003 and kurtosis = 9.766

 

Crab molt model

Crab molt model

The linear model is designed to predict the size of a crab shell before molting (pre-molt size) based on the size of the crab shell after molting (post-molt size). In a linear model, the relationship between these two variables is assumed to be linear, meaning that changes in the post-molt size are proportional to changes in the pre-molt size. we tried to understand if the difference between both the states means statistical significance or not and came to the conclusion that by standard statistical inference across a lot of cases, the p value in this case was less than 0.05 with makes us reject the null hypothesis that there is no real difference.We also got to know about the t-test analysis, which is a a statistical hypothesis test used to determine if there is a significant difference between the means of two groups or populations. For eg. Imagine you have two groups, like group A and group B. Group A, who were taught using Method 1, and Group B, who were taught using Method 2. You want to determine if there is a significant difference in the average test scores between the two groups. The t-test wishes it knew the  scores of a special secret ingredient (lets call it “scaling term”) in the math, but it doesn’t. So, it has to guess it from the numbers. If it guesses right (under some specific conditions), it can use its rulebook to say if the groups are different or not.

Chi- square distribution

The chi-square distribution is a probability distribution frequently used in statistics for hypothesis testing and constructing confidence intervals.it is characterized by a shape determined by its degrees of freedom (df)

we ploted graph for 1 and 2 degree of freedom

September 25

Breusch-pagan Test

The Breusch-Pagan test is a statistical test used to check for heteroscedasticity in a regression analysis. Heteroscedasticity refers to the situation where the variance of the errors (residuals) in a regression model is not constant across all levels of the independent variable. In other words, it assesses whether the spread of the residuals changes systematically with the values of the independent variable.

We got p-value for the Breusch-pagan test : 4.15150076269579le-05

so the p-value is 0.0000415. This extremely low p-value suggests strong evidence against the null hypothesis in the context of the Breusch-Pagan test. In practical terms, it indicates a significant presence of heteroscedasticity, implying that the variability in the residuals is not constant across different levels of the independent variable. This finding may prompt a closer examination of the regression model and potentially the need for adjustments to account for the observed heteroscedasticity in the data.

Residuals

Residuals in the context of diabetes and inactivity refer to the differences between the observed values of diabetes prevalence and the values predicted by a linear regression model based on inactivity levels. When conducting a regression analysis, the model estimates the relationship between diabetes and inactivity, and the residuals represent how much the actual data points deviate from these predictions.

And we plot a residual plot for diabetes and inactivity

we plot Smooth histogram and Q-Q plot pot for residuals

And we calculate the Standard deviation = 0.91 , skewness = 0.51 and Kurtosis = 4.08