Home Â» How to Perform a Box-Cox Transformation in R (With Examples)

# How to Perform a Box-Cox Transformation in R (With Examples)

AÂ box-cox transformation is a commonly used method for transforming a non-normally distributed dataset into a more normally distributed one.

The basic idea behind this method is to find some value for Î» such that the transformed data is as close to normally distributed as possible, using the following formula:

• y(Î») = (yÎ» â€“ 1) / Î»Â  if y â‰  0
• y(Î») = log(y)Â  if y = 0

We can perform a box-cox transformation in R by using theÂ boxcox() function from theÂ MASS() library. The following example shows how to use this function in practice.

Refer to this paper from the University of Connecticut for a nice summary of the development of the Box-Cox transformation.

### Example: Box-Cox Transformation in R

The following code shows how to fit a linear regression model to a dataset, then use theÂ boxcox() function to find an optimal lambda to transform the response variable and fit a new model.Â

```library(MASS)

#create data
y=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 6, 7, 8)
x=c(7, 7, 8, 3, 2, 4, 4, 6, 6, 7, 5, 3, 3, 5, 8)

#fit linear regression model
model #find optimal lambda for Box-Cox transformation
bc #fit new linear regression model using the Box-Cox transformation
new_model ```

The optimal lambda was found to beÂ -0.4242424. Thus, the new regression model replaced the original response variable y with the variable y = (y-0.4242424 â€“ 1) / -0.4242424.

The following code shows how to create two Q-Q plots in R to visualize the differences in residuals between the two regression models:

```#define plotting area
op #Q-Q plot for original model
qqnorm(model\$residuals)
qqline(model\$residuals)

#Q-Q plot for Box-Cox transformed model
qqnorm(new_model\$residuals)
qqline(new_model\$residuals)

#display both Q-Q plots
par(op)
```

As a rule of thumb, if the data points fall along a straight diagonal line in a Q-Q plot then the dataset likely follows a normal distribution.

Notice how the box-cox transformed model produces a Q-Q plot with a much straighter line than the original regression model.

This is an indication that the residuals of the box-cox transformed model are much more normally distributed, which satisfies one of the assumptions of linear regression.