Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? (2024)

Minitab Blog Editor | 30 May, 2013

Topics: Regression Analysis

After you have fit a linear model using regression analysis, ANOVA, or design of experiments (DOE), you need to determine how well the model fits the data. To help you out, Minitab Statistical Software presents a variety of goodness-of-fit statistics. In this post, we’ll explore the R-squared (R2 ) statistic, some of its limitations, and uncover some surprises along the way. For instance, low R-squared values are not always bad and high R-squared values are not always good!

What Is Goodness-of-Fit for a Linear Model?

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? (1) Definition: Residual = Observed value - Fitted value

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals.

In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased.

Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

  • 0% indicates that the model explains none of the variability of the response data around its mean.
  • 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

Graphical Representation of R-squared

Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? (2)

The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line.

Ready for a demo of Minitab Statistical Software? Just ask!

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? (3)

Key Limitations of R-squared

R-squaredcannotdetermine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

The R-squared in your output is a biased estimate of the population R-squared.

ArE LOW R-SQUARED VALUES INHERENTLY BAD?

No! There are two major reasons why it can be just fine to have low R-squared values.

In some fields, it is entirely expected that your R-squared values will be low. For example, any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than, say, physical processes.

Furthermore, if your R-squared value is low but you have statistically significant predictors, you can still draw important conclusions about how changes in the predictor values are associated with changes in the response value. Regardless of the R-squared, the significant coefficients still represent the mean change in the response for one unit of change in the predictor while holding other predictors in the model constant. Obviously, this type of information can be extremely valuable.

See a graphical illustration of why a low R-squared doesn't affect the interpretation of significant variables.

A low R-squared is most problematic when you want to produce predictions that are reasonably precise (have a small enough prediction interval). How high should the R-squared be for prediction? Well, that depends on your requirements for the width of a prediction interval and how much variability is present in your data. While a high R-squared is required for precise predictions, it’s not sufficient by itself, as we shall see.

Are High R-squared Values Inherently Good?

No! A high R-squared does not necessarily indicate that the model has a good fit. That might be a surprise, but look at the fitted line plot and residual plot below. The fitted line plot displays the relationship between semiconductor electron mobility and the natural log of the density for real experimental data.

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? (4)

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? (5)

The fitted line plot shows that these data follow a nice tight function and the R-squared is 98.5%, which sounds great. However, look closer to see how the regression line systematically over and under-predicts the data (bias) at different points along the curve. You can also see patterns in the Residuals versus Fits plot, rather than the randomness that you want to see. This indicates a bad fit, and serves as a reminder as to why you should always check the residual plots.

This example comes from my post about choosing between linear and nonlinear regression. In this case, the answer is to use nonlinear regression because linear models are unable to fit the specific curve that these data follow.

However, similar biases can occur when your linear model is missing important predictors, polynomial terms, and interaction terms. Statisticians call this specification bias, and it is caused by an underspecified model. For this type of bias, you can fix the residuals by adding the proper terms to the model.

For more information about how a high R-squared is not always good a thing, read my post Five Reasons Why Your R-squared Can Be Too High.

Closing Thoughts on R-squared

R-squared is a handy, seemingly intuitive measure of how well your linear model fits a set of observations. However, as we saw, R-squared doesn’t tell us the entire story. You should evaluate R-squared values in conjunction with residual plots, other model statistics, and subject area knowledge in order to round out the picture (pardon the pun).

While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. The F-test of overall significance determines whether this relationship is statistically significant.

In my next blog, we’ll continue with the theme that R-squared by itself is incomplete and look at two other types of R-squared: adjusted R-squared and predicted R-squared. These two measures overcome specific problems in order to provide additional information by which you can evaluate your regression model’s explanatory power.

For more about R-squared, learn the answer to this eternal question: How high should R-squared be?

If you're learning about regression, read my regression tutorial!

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? (6)

Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? (2024)

FAQs

How do you interpret goodness of fit R-squared? ›

The most common interpretation of r-squared is how well the regression model explains observed data. For example, an r-squared of 60% reveals that 60% of the variability observed in the target variable is explained by the regression model.

What does R-squared tell you in regression? ›

R-squared is a statistical measure that indicates how much of the variation of a dependent variable is explained by an independent variable in a regression model.

How can you determine if a regression model is good enough? ›

The best fit line is the one that minimises sum of squared differences between actual and estimated results. Taking average of minimum sum of squared difference is known as Mean Squared Error (MSE). Smaller the value, better the regression model.

Which value of r2 indicates the best fitting regression line? ›

R-squared is a measure of how closely the data in a regression line fit the data in the sample. The closer the r-squared value is to 1, the better the fit. An r-squared value of 0 indicates that the regression line does not fit the data at all, while an r-squared value of 1 indicates a perfect fit.

What is the R-squared value for best fit? ›

R-Squared value is a quantifiable analysis of how well the line of best fit (linear regression model) fits your data. A value closer to 1 (100%) is usually good. The P value is the probability of finding the observed results when the null hypothesis of a statement is true.

What is R2 and how do you interpret it? ›

The coefficient of determination, or R2 , is a measure that provides information about the goodness of fit of a model. In the context of regression it is a statistical measure of how well the regression line approximates the actual data.

How do you interpret regression results? ›

Interpreting Linear Regression Coefficients

A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

Do you want a high or low R-squared in regression? ›

R-squared measures the goodness of fit of a regression model. Hence, a higher R-squared indicates the model is a good fit, while a lower R-squared indicates the model is not a good fit.

How do you test goodness of fit in linear regression? ›

The best way to take a look at a regression data is by plotting the predicted values against the real values in the holdout set. In a perfect condition, we expect that the points lie on the 45 degrees line passing through the origin (y = x is the equation). The nearer the points to this line, the better the regression.

What are the measures for assessing goodness of fit in regression analysis? ›

Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate model fit: R-squared, the overall F-test, and the Root Mean Square Error (RMSE). All three are based on two sums of squares: Sum of Squares Total (SST) and Sum of Squares Error (SSE).

What is the measure of goodness of fit in linear regression? ›

R squared, the proportion of variation in the outcome Y, explained by the covariates X, is commonly described as a measure of goodness of fit.

How do you know if a line of best fit is good in R? ›

Correlation Coefficient (r)

o If r is close to 1 (or -1), the model is considered a "good fit". o If r is close to 0, the model is "not a good fit". o If r = ±1, the model is a "perfect fit" with all data points lying on the line. o If r = 0, there is no linear relationship between the two variables.

Is higher R-squared better fit? ›

In general, the higher the R-squared, the better the model fits your data.

Is goodness of fit R-squared or adjusted R-squared? ›

Which Is Better, R-Squared or Adjusted R-Squared? Many investors prefer adjusted R-squared because adjusted R-squared can provide a more precise view of the correlation by also taking into account how many independent variables are added to a particular model against which the stock index is measured.

What is r2 score in simple words? ›

What is r2 score? ” …the proportion of the variance in the dependent variable that is predictable from the independent variable(s).” Another definition is “(total variance explained by model) / total variance.” So if it is 100%, the two variables are perfectly correlated, i.e., with no variance at all.

How do you interpret R-squared and correlation? ›

How to Interpret Correlation and R-Squared
  1. A strong positive or negative correlation (i.e. a value close to +1 or -1) indicates a strong relationship between the variables.
  2. A weak positive or negative correlation (i.e. a value close to 0) indicates a weak relationship between the variables.
May 12, 2022

What does R-squared 0.5 mean? ›

Any R2 value less than 1.0 indicates that at least some variability in the data cannot be accounted for by the model (e.g., an R2 of 0.5 indicates that 50% of the variability in the outcome data cannot be explained by the model).

How do you interpret a regression line example? ›

Interpreting the slope of a regression line

The slope is interpreted in algebra as rise over run. If, for example, the slope is 2, you can write this as 2/1 and say that as you move along the line, as the value of the X variable increases by 1, the value of the Y variable increases by 2.

What is the best explanation of regression? ›

Regression allows researchers to predict or explain the variation in one variable based on another variable. Definitions: ❖ The variable that researchers are trying to explain or predict is called the response variable. It is also sometimes called the dependent variable because it depends on another variable.

How to calculate simple regression analysis and how to interpret? ›

Therefore, the formula for calculation is Y = a + bX + E, where Y is the dependent variable, X is the independent variable, a is the intercept, b is the slope, and E is the residual. Regression is a statistical tool to predict the dependent variable with the help of one or more independent variables.

Does a large r2 value means that a regression is significant? ›

If your regression model contains independent variables that are statistically significant, a reasonably high R-squared value makes sense. The statistical significance indicates that changes in the independent variables correlate with shifts in the dependent variable.

What is the best measure for goodness of fit? ›

The adjusted R-square statistic is generally the best indicator of the fit quality when you add additional coefficients to your model. The adjusted R-square statistic can take on any value less than or equal to 1, with a value closer to 1 indicating a better fit. A RMSE value closer to 0 indicates a better fit.

What does an R-squared value of 0.3 mean? ›

the value will usually range between 0 and 1. Value of < 0.3 is weak , Value between 0.3 and 0.5 is moderate and Value > 0.7 means strong effect on the dependent variable.

What do you understand by goodness of fit measures R-squared and adjusted R-squared? ›

R-squared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model. Adjusted R-squared adjusts the statistic based on the number of independent variables in the model.

What does an R-squared value of 0.5 mean? ›

Any R2 value less than 1.0 indicates that at least some variability in the data cannot be accounted for by the model (e.g., an R2 of 0.5 indicates that 50% of the variability in the outcome data cannot be explained by the model).

What does an R-squared value of 0.6 mean? ›

Generally, an R-Squared above 0.6 makes a model worth your attention, though there are other things to consider: Any field that attempts to predict human behaviour, such as psychology, typically has R-squared values lower than 0.5.

Top Articles
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 6021

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.