Crab molt model

Crab molt model

The linear model is designed to predict the size of a crab shell before molting (pre-molt size) based on the size of the crab shell after molting (post-molt size). In a linear model, the relationship between these two variables is assumed to be linear, meaning that changes in the post-molt size are proportional to changes in the pre-molt size. we tried to understand if the difference between both the states means statistical significance or not and came to the conclusion that by standard statistical inference across a lot of cases, the p value in this case was less than 0.05 with makes us reject the null hypothesis that there is no real difference.We also got to know about the t-test analysis, which is a a statistical hypothesis test used to determine if there is a significant difference between the means of two groups or populations. For eg. Imagine you have two groups, like group A and group B. Group A, who were taught using Method 1, and Group B, who were taught using Method 2. You want to determine if there is a significant difference in the average test scores between the two groups. The t-test wishes it knew the  scores of a special secret ingredient (lets call it “scaling term”) in the math, but it doesn’t. So, it has to guess it from the numbers. If it guesses right (under some specific conditions), it can use its rulebook to say if the groups are different or not.

Chi- square distribution

The chi-square distribution is a probability distribution frequently used in statistics for hypothesis testing and constructing confidence intervals.it is characterized by a shape determined by its degrees of freedom (df)

we ploted graph for 1 and 2 degree of freedom

September 25

Breusch-pagan Test

The Breusch-Pagan test is a statistical test used to check for heteroscedasticity in a regression analysis. Heteroscedasticity refers to the situation where the variance of the errors (residuals) in a regression model is not constant across all levels of the independent variable. In other words, it assesses whether the spread of the residuals changes systematically with the values of the independent variable.

We got p-value for the Breusch-pagan test : 4.15150076269579le-05

so the p-value is 0.0000415. This extremely low p-value suggests strong evidence against the null hypothesis in the context of the Breusch-Pagan test. In practical terms, it indicates a significant presence of heteroscedasticity, implying that the variability in the residuals is not constant across different levels of the independent variable. This finding may prompt a closer examination of the regression model and potentially the need for adjustments to account for the observed heteroscedasticity in the data.

Residuals

Residuals in the context of diabetes and inactivity refer to the differences between the observed values of diabetes prevalence and the values predicted by a linear regression model based on inactivity levels. When conducting a regression analysis, the model estimates the relationship between diabetes and inactivity, and the residuals represent how much the actual data points deviate from these predictions.

And we plot a residual plot for diabetes and inactivity

we plot Smooth histogram and Q-Q plot pot for residuals

And we calculate the Standard deviation = 0.91 , skewness = 0.51 and Kurtosis = 4.08

Linear regression

In a linear regression analysis of diabetes and inactivity, the goal would be to model the relationship between these two variables using a linear equation. The equation takes the form  y=mx+b, where
y represents the predicted values of diabetes, x is the level of inactivity, m is the slope (indicating the change in diabetes for a unit change in inactivity), and b is the y-intercept (representing the predicted level of diabetes when inactivity is zero).

For instance, if the linear regression indicates a positive slope, it suggests that as inactivity increases, the predicted level of diabetes also increases. Conversely, a negative slope implies a decrease in predicted diabetes with decreasing levels of inactivity.

and R-squared is the square of the correlation between two variables

we got Slope = 0.23 , intercept = 3.77, R- squared = 0.1915, p- value = 1.63 and standard error = 0.012

September 18

Correlation between diabetes and inactivity

The correlation between diabetes and inactivity refers to the statistical relationship or association between these two variables. When we say there is a positive correlation, it means that as levels of inactivity increase, the likelihood or prevalence of diabetes also tends to increase. Conversely, a negative correlation would imply that as levels of inactivity decrease, the prevalence of diabetes decreases.

we got correlation coefficient between diabetes Ana inactivity is  0.441706

Q-Q plot

in project we got 1370 data points for % inactivity and % diabetes and done the statistical analysis  

Smooth histogram for % diabetes

1370 diabetes data points for which we have %inactivity data, and estimated the means, medians, SDs, skewness, and kurtosis for each column.

we got values median = 7.45, mean = 7.62 Standard deviation = 1.01 , Skewness =0.658 and kurtosis = 4.1306

Smooth histogram for %inactivity

we got values median = 16.7, mean = 16.5 Standard deviation = 1.92 , Skewness =-0.342 and kurtosis = 2.450

 

September 13

In todays class we discussed about the  Breusch-Pagan test for hetroscedasticity  and p-value

p-value : A statistical measure called the p-value aids researchers in determining how strong the evidence is in opposition to a study’s null hypothesis. It measures the likelihood that the data or more severe outcomes would be observed if the null hypothesis were correct. A low p-value, usually less than 0.05, indicates that the observed data is not likely to support the null hypothesis, which leads to the alternative being accepted in place of the null hypothesis. On the other hand, a high p-value suggests that there is not enough evidence to reject the null hypothesis and that the observed data is likely to occur even if it is true. In hypothesis testing, the p-value essentially helps in assessing the importance of results and formulating conclusions.

September 11

I have analyzed the given data set of CDC Diabetes  2018 which consist of variables % obesity , %Inactivity and %diabetes  and I have eliminated the not null values and I got a new dataset with that dataset we can build a model to predict % diabetes using both %inactivity and %obesity as factor

I have learnt about

Skewness :A statistical metric known as skewness is used to assess how asymmetrical a probability distribution is. Finding out if a dataset is more concentrated on one side of the mean than the other is made easier by this. An upward tail is suggested by positive skewness and a downward tail by negative skewness. Perfect symmetry is indicated by a skewness value of zero.

Kurtosis: This statistical measure determines the “tailedness” of a probability distribution, indicating whether the data show heavy or light tails in comparison to a normal distribution. A high peak and heavy tails are indicated by positive kurtosis , whereas a flat peak and light tails are indicated by negative kurtosis (platykurtic). What defines a distribution is a zero kurtosis.

Heteroscedasticity: Non-constant variance in the residuals is referred to as heteroscedasticity in regression analysis. To put it another way, it shows that there are differences in the dispersion of errors between the levels of the independent variable. Since heteroscedasticity contradicts the concept of continuous variance, detecting and treating it is essential to maintaining the reliability of regression models. Robust model interpretation and accurate statistical analysis depend on this understanding.