Excel is a powerful tool for data analysis, whether you're working with a simple relationship between one independent variable and a dependent variable or there are multiple independent variables to consider. Learning about how to perform a multivariate analysis in Excel – in the form of a multiple regression – and interpret the results is essential if you have complicated data to work through. The good news is that Excel is well set up to handle these tasks, and you only need learn how one function works to start making sense of your data.
What Is Multiple Regression?
Multiple regression is a way of relating multiple independent variables to a single dependent variable by finding an equation that describes how the variable in question changes with each. A more basic but similar tool is linear regression, which aims to investigate the link between one independent variable, such as obesity, on a dependent variable like the risk of cancer, but things are rarely this straightforward. Continuing with the example, the number of cigarettes smoked per day is also related to cancer risk, as is the amount of alcohol you drink. To come up with a reliable prediction of cancer risk for an individual, you'd need to take all these factors (and more) into account.
Video of the Day
The general form of the equation used for multiple regressions is:
Y^ = a + b1x1 + b2x2 + b3x3 …
So the Y^ is the expected value for the observation, the b1 and so on represent the slope of the straight-line relationship between x1 and Y^, and the x1 and so on are the variables included in the analysis. The a tells you the point of the y-intercept. A multiple regression involves choosing values of the coefficients (b1 and so on) that minimize the difference between the expected value Y^ and the observed value Y, giving you the best fit between the model and the data.
What Does a Multiple Regression Tell You?
Multiple regressions put numerical values on the association between a multitude of variables and an outcome, so you can use it for predictions, for estimating the relative contributions of the different variables to the result, or for a few other purposes like selecting the most relevant variables to use in a mathematical model.
For example, say you have data on the prices of houses in a certain city (your dependent variable), along with information like whether it has a pool, how many square feet it occupies, how many bedrooms it has, how many bathrooms it has, and how many garages it has. A multiple regression would enable you to look at how each of these factors is related to the price of the house, so – after you looked at how they relate to price – you could use your equation to predict the price of a house based on these points alone.
You can also use this type of regression analysis in Excel to look at how a specific factor of many – such as whether the house has a pool – affects the dependent variable (house prices) if all other variables remain constant. If you convert the coefficients (called "partial regression coefficients") into standard partial regression coefficients, which represent how many standard deviations Y would change by if you changed the corresponding variable by one standard deviation, then the equation also tells you which factors are more important in determining the result.
How to Do a Multiple Regression in Excel
You can perform a multivariate regression in Excel using a built-in function that is accessible through the Data Analysis tool under the Data tab and the Analysis group. Click Data Analysis and find the option for regression in the window that pops up, highlight it and click OK. Click on the select cells icon beside the Input Y Range field and then select the column containing the results for your dependent variable. Then, do the same for the Input X Range field but select the multiple columns for your independent variables. These columns have to be next to one another, so if they aren't, you have to move them before producing the regression.
The Regression window has a range of additional options you can select to tailor the process to suit your needs. For example, you can set a confidence level other than 95 percent if you like, choose to display residuals and specify where the output is placed in your workbook. This last option is automatically set to New Worksheet Ply, so the results show on a new sheet, but you can change this or any other option to suit your needs. Additionally, check the Labels box if the columns for your independent variables have labels at the top, so these display in the output.
Click OK to generate your regression analysis in Excel and be taken to the new sheet.
The Regression Output From Excel
There are three main sections to the output you're presented with after doing a multiple regression in Excel: regression statistics, ANOVA and detail on the estimated regression line. The regression statistics include the multiple correlation coefficient ("Multiple R") which shows the direction and strength of the correlation, from −1 to +1. The coefficient of determination, "R Square," tells you what percentage (as a decimal) of the variation in the dependent variable is explained by the independent variables. The "Adjusted R Square" gives you an indication of explanatory power, but it isn't straightforward to interpret, and the "Standard Error" gives you a measure of the variation between the observed results and your regression line.
The ANOVA section contains statistical information about the amount of the variation explained by the regression line, with "SS Regression" telling you the amount explained by the line, and "SS Residual" representing the amount not explained. The "MS" sections stand for "Mean Square," and the "F Statistic" is the test statistic used to test for a significant result, with the "Significance F" section giving you the P-value.
Finally, the last section tells you about the characteristics of the estimated regression line, in particular, the values of the coefficients, whether they're significantly linked to the dependent variable, and the amount of variation there could be in them. Positive coefficients show a positive relationship between the variable in question and the dependent variable, so when one increases, the other does too. Negative values mean the dependent variable decreases as the independent variable increases. So, if the "square footage" coefficient on a house prices multiple regression is 300, this means an additional square foot of space increases the cost of the house by $300 on average.
Assumptions and Limitations of Multiple Regression
It's important to remember that multiple regression is only a tool, and like most tools, you can only use it in some circumstances, and there are some things it just can't do.
One of the most important limitations is that concluding causality on the basis of the results is difficult. As an example, if you have a multiple regression with the damage done by a fire and many potentially relevant factors, you'd likely find a significant link between the number of firefighters present and the damage done. This doesn't mean that the firefighters caused the damage because another factor such as the size of the fire not included in the model could explain both of these observations.
Two important assumptions of a multivariate analysis in Excel of this type are the assumptions of linearity and normality. You're assuming a linear relationship between the dependent and independent variables, so you should check this is likely to be valid before performing the analysis. You can look at the relationship between each variable individually to check, but this isn't a perfect strategy. Similarly, the test assumes that the variables are normally distributed, so you should check the results for each for normality before conducting the test.