Nov 13

In todays class we have discussed about

Principal Component Analysis (PCA) is a statistical method used for simplifying complex data sets. It aims to reduce the number of variables while retaining the key information present in the data.

PCA works by transforming a set of correlated variables into a smaller set of uncorrelated variables known as principal components. These components are calculated in a way that the first principal component captures the most variance in the data. Subsequent components capture less and less variance in descending order.

The primary goal of PCA is to find patterns and structures within the data, allowing for easier interpretation and analysis. It’s commonly used for dimensionality reduction, data visualization, and identifying the most critical factors influencing the data.

Decision tree

A decision tree is a predictive modeling tool used in machine learning and data analysis. It is a tree-like structure that breaks down a dataset into smaller and more manageable subsets while recursively making decisions based on input features. The decision tree consists of nodes, where each node represents a feature or attribute, branches that represent the decision rules, and leaves that represent the outcomes or predictions.

The construction of a decision tree involves recursively splitting the dataset based on the most significant feature at each node. The goal is to create homogeneous subsets, meaning that the data within each subset is more similar in terms of the target variable (the variable to be predicted). Decision trees are often used for both classification and regression tasks.

Decision trees are advantageous because they are easy to understand, interpret, and visualize. They mimic human decision-making processes and are capable of handling both numerical and categorical data. However, they can be prone to overfitting, where the model performs well on the training data but poorly on new, unseen data. Techniques like pruning and setting constraints on tree depth help mitigate this issue. Popular algorithms for building decision trees include ID3 (Iterative Dichotomiser 3), C4.5, CART (Classification and Regression Trees), and Random Forests, which use an ensemble of decision trees for improved accuracy.

NOVEMBER 3

A statistical method used for dimensionality reduction and feature extraction, Linear Discriminant Analysis (LDA) has applications in pattern recognition and classification. In order to do LDA, a dataset’s numerous classes or groups are best separated by a linear combination of features.

Fundamentally, the goal of LDA is to optimize the ratio of variance within a class to variance between classes. Put another way, it looks for a way to project the data into a lower-dimensional space that maximizes the variance across classes while minimizes the variance within them. Features that emphasize class separability are transformed as a result of this process.

Baye’s Theorem

Bayes’ theorem is a foundational concept in probability theory that allows for the updating of probability estimates based on new evidence. Mathematically expressed as P(A/B)=P(B/A)⋅P(A)P(B), the theorem is employed in Bayesian statistics to systematically combine prior probabilities with observed data, resulting in updated or posterior probabilities. It plays a crucial role in fields such as medical diagnostics, where it facilitates the adjustment of the probability of a condition given new test results. Bayes’ theorem provides a powerful framework for reasoning about uncertainty and updating beliefs in light of fresh information.

Bayes theorem in null hypothesis

Bayes’ theorem is a foundational concept in probability theory, particularly in Bayesian statistics, where it facilitates the updating of probabilities based on new evidence. However, in classical hypothesis testing, null hypothesis (H0) are typically formulated independently of Bayes’ theorem. The null hypothesis asserts no effect or no difference between groups and is central to frequentist statistical methods, relying on p-values and significance testing to make decisions about the observed data. While Bayes’ theorem plays a crucial role in Bayesian statistics, the conventional framing of null hypotheses in frequentist statistics follows a different paradigm, emphasizing hypothesis testing within a set framework of assumptions and procedures.

t-test and ANOVA

A t-test is a statistical method used to compare the means of two groups to assess whether there is a significant difference between them. It’s commonly employed when dealing with small sample sizes and assumes that the data follows a normal distribution. The t-test generates a t-statistic, and a low p-value associated with this statistic suggests that the means of the two groups are significantly different. There are different types of t-tests, such as the independent samples t-test (for comparing means of two independent groups) and the paired samples t-test (for comparing means of two related groups).

ANOVA, on the other hand, is used when comparing means among three or more groups. It assesses whether there are statistically significant differences in the means of the groups by analyzing the variance within and between groups. ANOVA is applicable when dealing with multiple independent groups or multiple levels of a categorical variable. The F-statistic generated by ANOVA is used to determine whether the group means are significantly different. If ANOVA indicates significance, post-hoc tests may be employed to identify which specific groups differ.

DBSCAN

In order for DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to function, data points are grouped according to how dense they are in the feature space. Core points are found by the method when they have a given number of surrounding points within a given distance (ε). Furthermore, locations that fall under the density criteria but are not next to core points are designated as border points. Clusters are created when the algorithm iteratively grows from core points to boundary points inside a given density-connected area. Noise is the term applied to unassigned points. Since the number of clusters is not fixed, DBSCAN can find clusters of different sizes and forms.

Clusters labeled as 4 and 2 in DBSCAN would be a result of the algorithm’s exploration of the data’s density distribution and spatial relationships between points, with each cluster characterized by its unique density-connected region.

K-means and K-medoids

in todays class we discussed about the k-means and k-medoids

K-means : K-Means is a popular clustering algorithm used in unsupervised machine learning. The algorithm aims to partition a dataset into K clusters, where each data point belongs to the cluster with the nearest mean. The number of clusters, K, is a parameter that needs to be specified before running the algorithm. The algorithm iteratively assigns data points to clusters based on their distances to cluster centroids and updates the centroids until convergence. K-Means is sensitive to the initial placement of centroids, and different initializations can lead to different final cluster assignments.

K-medoids : K-Medoids is a variant of K-Means that, instead of using cluster centroids, chooses actual data points within the clusters as representatives or medoids. Medoids are more robust to outliers than centroids, making K-Medoids less sensitive to extreme values. The algorithm iteratively refines cluster assignments by selecting the data point that minimizes the sum of dissimilarities (often measured using a distance metric like Euclidean distance) to other points in the same cluster. K-Medoids is particularly useful in situations where the mean may not be a representative measure (e.g., in the presence of outliers) or when dealing with non-numeric data.

In summary, both K-Means and K-Medoids are clustering algorithms used to group similar data points together. K-Means uses cluster centroids, while K-Medoids uses actual data points as representatives, making it more robust to outliers. The choice between them depends on the nature of the data and the desired characteristics of the clusters.

 

October 18

Using a dataset from the Washington Post, we talked about the age distribution of those killed by police in class today. The data, which includes the ages of every individual, has a slightly skewed distribution, according to the descriptive statistics. It is discovered that 67% of the individuals shot by the police fall between the ages of 24 and 50 after comparing the distribution to a normal distribution and calculating the number of people lying within the standard deviation of -1 to 1.

Then, we follow the same procedures for both the white and black races to locate it. We discover that 68% of the black individuals who were shot were between the ages of 21 and 44, while the white individuals were between the ages of 27 and 53. Next, we attempt to determine whether there is a difference in the average number of Black and White individuals shot by police, and if so, by how much. We determine that there is a discrepancy of about 7 years using the Monte Carlo approach, and this disparity is not the result of chance.

We also utilize Cohen’s d technique to determine the impact of the difference on the entire dataset. The result, 0.577, falls into the Cohen-Sawilowsky category of medium impact size.

October 16

We covered geocoding and geospatial analysis in today’s lecture, which introduced the robust Python package GeoPy. Using the construction of several plot kinds, including Geohist, the professor demonstrated how to map and visualize data points on a map of the United States using GeoPy. The students used GeoPy to determine the locations of police shootings, demonstrating the practical application of geospatial data analysis in real-world scenarios. That application even extended to California.

Additionally, a density-based clustering method called DBSCAN was presented. There are four unique clusters within the dataset, as the professor demonstrated while integrating GeoPy with DBSCAN to find clusters. When it comes to locating hidden spatial patterns in data, this method can be quite helpful.
In addition, the class had an interesting conversation regarding the connection between crime rates and police shootings. We investigated the fascinating idea that, even in places with lower total crime rates, crime intensity might play a significant role. This raised a number of issues and suggested possible lines of inquiry for additional research.
In conclusion, the course today covered a thorough overview of geospatial analysis with GeoPy, presented a useful clustering technique in DBSCAN, and started a stimulating investigation into the relationships between crime and police shootings