Linear regression

In a linear regression analysis of diabetes and inactivity, the goal would be to model the relationship between these two variables using a linear equation. The equation takes the form  y=mx+b, where
y represents the predicted values of diabetes, x is the level of inactivity, m is the slope (indicating the change in diabetes for a unit change in inactivity), and b is the y-intercept (representing the predicted level of diabetes when inactivity is zero).

For instance, if the linear regression indicates a positive slope, it suggests that as inactivity increases, the predicted level of diabetes also increases. Conversely, a negative slope implies a decrease in predicted diabetes with decreasing levels of inactivity.

and R-squared is the square of the correlation between two variables

we got Slope = 0.23 , intercept = 3.77, R- squared = 0.1915, p- value = 1.63 and standard error = 0.012

September 18

Correlation between diabetes and inactivity

The correlation between diabetes and inactivity refers to the statistical relationship or association between these two variables. When we say there is a positive correlation, it means that as levels of inactivity increase, the likelihood or prevalence of diabetes also tends to increase. Conversely, a negative correlation would imply that as levels of inactivity decrease, the prevalence of diabetes decreases.

we got correlation coefficient between diabetes Ana inactivity is  0.441706

Q-Q plot

in project we got 1370 data points for % inactivity and % diabetes and done the statistical analysis  

Smooth histogram for % diabetes

1370 diabetes data points for which we have %inactivity data, and estimated the means, medians, SDs, skewness, and kurtosis for each column.

we got values median = 7.45, mean = 7.62 Standard deviation = 1.01 , Skewness =0.658 and kurtosis = 4.1306

Smooth histogram for %inactivity

we got values median = 16.7, mean = 16.5 Standard deviation = 1.92 , Skewness =-0.342 and kurtosis = 2.450

 

September 13

In todays class we discussed about the  Breusch-Pagan test for hetroscedasticity  and p-value

p-value : A statistical measure called the p-value aids researchers in determining how strong the evidence is in opposition to a study’s null hypothesis. It measures the likelihood that the data or more severe outcomes would be observed if the null hypothesis were correct. A low p-value, usually less than 0.05, indicates that the observed data is not likely to support the null hypothesis, which leads to the alternative being accepted in place of the null hypothesis. On the other hand, a high p-value suggests that there is not enough evidence to reject the null hypothesis and that the observed data is likely to occur even if it is true. In hypothesis testing, the p-value essentially helps in assessing the importance of results and formulating conclusions.

September 11

I have analyzed the given data set of CDC Diabetes  2018 which consist of variables % obesity , %Inactivity and %diabetes  and I have eliminated the not null values and I got a new dataset with that dataset we can build a model to predict % diabetes using both %inactivity and %obesity as factor

I have learnt about

Skewness :A statistical metric known as skewness is used to assess how asymmetrical a probability distribution is. Finding out if a dataset is more concentrated on one side of the mean than the other is made easier by this. An upward tail is suggested by positive skewness and a downward tail by negative skewness. Perfect symmetry is indicated by a skewness value of zero.

Kurtosis: This statistical measure determines the “tailedness” of a probability distribution, indicating whether the data show heavy or light tails in comparison to a normal distribution. A high peak and heavy tails are indicated by positive kurtosis , whereas a flat peak and light tails are indicated by negative kurtosis (platykurtic). What defines a distribution is a zero kurtosis.

Heteroscedasticity: Non-constant variance in the residuals is referred to as heteroscedasticity in regression analysis. To put it another way, it shows that there are differences in the dispersion of errors between the levels of the independent variable. Since heteroscedasticity contradicts the concept of continuous variance, detecting and treating it is essential to maintaining the reliability of regression models. Robust model interpretation and accurate statistical analysis depend on this understanding.