*49*

In statistics, an **influential observation** is an observation in a dataset that, when removed, dramatically changes the coefficient estimates of a regression model.

The most common way to measure the influence of observations is to use **Cookâ€™s distance**, which quantifies how much all of the fitted values in a regression model change when the i^{th} observation is deleted.

As a rule of thumb, any observation with a Cookâ€™s distance greater than 1 is considered to be an observation with high leverage.

The following example shows how to calculate and interpret Cookâ€™s distance for a given dataset to detect potential influential observations.

**Example: Detecting Influential Observations**

Suppose we have the following dataset with 14 values:

Now suppose we fit a simple linear regression model. The regression output is shown below:

Using statistical software, we can calculate the following values for Cookâ€™s distance for each observation:

Notice that the last observation has a value significantly greater than 1 for Cookâ€™s distance, which tells us that itâ€™s an influential observation.

Suppose we remove this value from the dataset and fit a new simple linear regression model. The output for this model is shown below:

Notice that the regression coefficients for the intercept and x both changed dramatically. This tells us that removing the influential observation from the dataset completely changed the fitted regression model.

The following plots show the difference between these two fitted regression equations:

Notice how much the one influential observation changes the regression line. By removing this observation, we were able to find a regression line that fits the data much more closely.

**Notes**

Itâ€™s important to note that Cookâ€™s distance should be used as a way to *identify* potentially influential observations. However, just because an observation is influential doesnâ€™t necessarily mean that it should be deleted from the dataset.

First, you should verify that the observation isnâ€™t a result of a data entry error or some other odd occurrence. If it turns out to be a legit value, you can then decide to deal with it in one of the following ways:

- Delete it from the dataset.
- Leave it in the dataset.
- Replace it with an alternative value like the mean or median.

Depending on your specific scenario, one of these options may make more sense than the others.

**How to Calculate Cookâ€™s Distance in Practice**

The following tutorials explain how to calculate Cookâ€™s distance for a given dataset in Python and R:

How to Calculate Cookâ€™s Distance in Python

How to Calculate Cookâ€™s Distance in R