[New release] Applied logistic regression (2nd edition) [Repost]: An easily accessible introduction

potumguvi1982
Aug 20, 2023
7 min read

The code below estimates a logistic regression model using the glm (generalized linear model)function. First, we convert rank to a factor to indicate that rank should betreated as a categorical variable.

[New release] Applied logistic regression (2nd edition) [Repost]

Download Zip: https://porphasako.blogspot.com/?file=2vKkyb

Now we can say that for a one unit increase in gpa, the odds of beingadmitted to graduate school (versus not being admitted) increase by a factor of2.23. For more information on interpreting odds ratios see our FAQ pageHow do I interpret odds ratios in logistic regression?. Note that while R produces it, the odds ratio for the intercept is not generally interpreted.

These objects must have the same names as the variables in your logisticregression above (e.g. in this example the mean for gre must be namedgre). Now that we have the data frame we want to use to calculate the predictedprobabilities, we can tell R to create the predicted probabilities. The firstline of code below is quite compact, we will break it apart to discuss whatvarious components do. The newdata1$rankP tells R that wewant to create a new variable in the dataset (data frame) newdata1 calledrankP, the rest of the command tells R that the values of rankPshould be predictions made using the predict( ) function. The optionswithin the parentheses tell R that the predictions should be based on the analysis mylogitwith values of the predictor variables coming from newdata1 and that the type of predictionis a predicted probability (type="response"). The second line of the codelists the values in the data frame newdata1. Although notparticularly pretty, this is a table of predicted probabilities.

In this revised and updated edition of their popular book, David Hosmer and Stanley Lemeshow continue to provide an amazingly accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets. Hosmer and Lemeshow extend the discussion from biostatistics and epidemiology to cutting-edge applications in data mining and machine learning, guiding readers step-by-step through the use of modeling techniques for dichotomous data in diverse fields. Ample new topics and expanded discussions of existing material are accompanied by a wealth of real-world examples-with extensive data sets available over the Internet.

The focus in this Second Edition is again on logistic regression models for individual level data, but aggregate or grouped data are also considered. The book includes detailed discussions of goodness of fit, indices of predictive efficiency, and standardized logistic regression coefficients, and examples using SAS and SPSS are included.

Phage-bacterium interactions have recently been modeled to facilitate phage therapy against Campylobacter jejuni (4). The influence of the bacterial and phage concentrations on the inactivation of Campylobacter and Salmonella in culture medium was also modeled (3). To our knowledge, however, predictive modeling procedures to determine bacterial behavior in foods when phages are used as biocontrol agents have not been developed so far. Consequently, in the present work, we have developed probabilistic models in order to facilitate the successful application of phages as biocontrol agents against S. aureus in milk. For this purpose, we have applied a survival/death interface model that describes the conditions that limit bacterial survival. This type of model is derived from a logistic regression procedure, which uses a binary response variable (S. aureus survival or death) and three independent variables (initial bacterial contamination, initial phage titer, and temperature of incubation). A fortuitous breakdown in the cold chain and the temperatures to which milk is subjected during dairy product processing have been taken into account when the temperature range was selected. The bacterial inoculum range represents low, medium, and high contamination levels.

...low $R^2$ values in logistic regression are the norm and this presents a problem when reporting their values to an audience accustomed to seeing linear regression values. ... Thus [arguing by reference to running examples in the text] we do not recommend routine publishing of $R^2$ values with results from fitted logistic models. However, they may be helpful in the model building state as a statistic to evaluate competing models.

The only assumptions made in logistic regression are that of linearity and additivity (+ independence). Although many global goodness-of-fit tests (like the Hosmer & Lemeshow $\chi^2$ test, but see my comment to @onestop) have been proposed, they generally lack power. For assessing model fit, it is better to rely on visual criteria (stratified estimates, nonparametric smoothing) that help to spot local or global departure between predicted and observed outcomes (e.g. non-linearity or interaction), and this is largely detailed in Harrell's RMS handout. On a related subject (calibration tests), Steyerberg (Clinical Prediction Models, 2009) points to the same approach for assessing the agreement between observed outcomes and predicted probabilities:

I would have thought the main problem with any kind of $R^2$ measure for logistic regression is that you are dealing with a model which has a known noise value. This is unlike standard linear regression, where the noise level is usually treated as unknown. For we can write a glm probability density function as:

Where $p$ is the dimension of $\beta$. For logistic regression we have $\phi=1$, which is known. So we can use this to decide on a definite level of residual that is "acceptable" or "reasonable". This usually cannot be done for OLS regression (unless you have prior information about the noise). Namely, we expect each deviance residual to be about $1$. Too many $d_i^2\gg1$ and it is likely that an important effects are missing from the model (under-fitting); too many $d_i^2\ll1$ and it is likely that there are redundant or spurious effects in the model (over-fitting). (these could also mean model mispecification).

As well as criticising R^2, Hosmer & Lemeshow did propose an alternative measure of goodness-of-fit for logistic regression that is sometimes useful. This is based on dividing the data into (say) 10 groups of equal size (or as near as possible) by ordering on the predicted probability (or equivalently, the linear predictor) then comparing the observed to expected number of positive responses in each group and performing a chi-squared test. This 'Hosmer-Lemeshow goodness-of-fit test' is implemented in most statistical software packages.

Although there's no commonly accepted agreement on how to assess the fit of a logistic regression, there are some approaches. The goodness of fit of the logistic regression model can be expressed by some variants of pseudo R squared statistics, most of which being based on the deviance of the model.

The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest:

Rather than choosing parameters that minimize the sum of squared errors (like in ordinary regression), estimation in logistic regression chooses parameters that maximize the likelihood of observing the sample values.

The option to plot a graph that shows the logistic regression curve is only available when there is just one single independent variable. Results After you click OK the following results are displayed:

Cox & Snell R2 and Nagelkerke R2 are other goodness of fit measures known as pseudo R-squareds. Note that Cox & Snell's pseudo R-squared has a maximum value that is not 1. Nagelkerke R2 adjusts Cox & Snell's so that the range of possible values extends to 1. Regression coefficients The logistic regression coefficients are the coefficients b0, b1, b2, ... bk of the regression equation:

An independent variable with a regression coefficient not significantly different from 0 (P>0.05) can be removed from the regression model (press function key F7 to repeat the logistic regression procedure). If P

The Hosmer-Lemeshow test is a statistical test for goodness of fit for the logistic regression model. The data are divided into approximately ten groups defined by increasing order of estimated risk. The observed and expected number of cases in each group is calculated and a Chi-squared statistic is calculated as follows:

The classification table is another method to evaluate the predictive accuracy of the logistic regression model. In this table the observed values for the dependent outcome and the predicted values (at a user defined cut-off value, for example p=0.50) are cross-classified. In our example, the model correctly predicts 74% of the cases.

Another method to evaluate the logistic regression model makes use of ROC curve analysis. In this analysis, the power of the model's predicted values to discriminate between positive and negative cases is quantified by the Area under the ROC curve (AUC). The AUC, sometimes referred to as the C-statistic (or concordance index), is a value that varies from 0.5 (discriminating power not better than chance) to 1.0 (perfect discriminating power).

Sample size calculation for logistic regression is a complex problem, but based on the work of Peduzzi et al. (1996) the following guideline for a minimum number of cases to include in your study can be suggested.

The probability that predicting the outcome is better than chance. Used to compare the goodness of fit of logistic regression models, values for this measure range from 0.5 to 1.0. A value of 0.5 indicates that the model is no better than chance at making a prediction of membership in a group and a value of 1.0 indicates that the model perfectly identifies those within a group and those not. Models are typically considered reasonable when the C-statistic is higher than 0.7 and strong when C exceeds 0.8 (Hosmer & Lemeshow, 2000; Hosmer & Lemeshow, 1989).References Hosmer DW, Lemeshow S.Applied Logistic Regression.New York, NY:John Wiley & Sons;1989.(View)
Hosmer DW, Lemeshow S.Applied Logistic Regression (2nd Edition).New York, NY:John Wiley & Sons;2000.(View)
Term used in Fransoo R, Martens P, Prior H, Chateau D, McDougall C, Schultz J, McGowan K, Soodeen R, Bailly A.Adult Obesity in Manitoba: Prevalence, Associations, and Outcomes.Winnipeg, MB:Manitoba Centre for Health Policy,2011. [Report] [Summary] (View)
Fransoo R, Martens P, The Need to Know Team, Prior H, Burchill C, Koseva I, Bailly A, Allegro E.The 2013 RHA Indicators Atlas.Winnipeg, MB:Manitoba Centre for Health Policy,2013. [Report] [Summary] [Additional Materials] (View)

Contact us Manitoba Centre for Health PolicyCommunity Health Sciences, Max Rady College of Medicine,Rady Faculty of Health Sciences,Room 408-727 McDermot Ave.University of ManitobaWinnipeg, MB R3E 3P5 Canada 2ff7e9595c

[New release] Applied logistic regression (2nd edition) [Repost]: An easily accessible introduction

[New release] Applied logistic regression (2nd edition) [Repost]

Recent Posts

Comments

IN.EX

EXPERIENCE

FOLLOW US

JOIN OUR NEWSLETTER