Correlation is a measure of a relationship (not necessarily linear) between two datasets.
Pearson coefficient
The Pearson correlation is a measure of linear correlation between two datasets.
$\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y}$
np.cov(a,b) gives a matrix with covariances and unbiased variances (on the diagonal).
Several computation equivalences are shown below:
a = pd.Series([5, 2, 6])
b = pd.Series([18, 2, 5])
print(a.corr(b) # biased standard deviation estimators !!
== np.corrcoef(a,b)[0,1]
== (np.cov(a,b)[0,1] / np.sqrt(np.cov(a,b)[0,0]*np.cov(a,b)[1,1]))
== np.cov(a,b)[0,1] / (np.std(a,ddof=1)*np.std(b,ddof=1))
!= np.cov(a,b)[0,1] / (np.std(a)*np.std(b)))
# prints True
Note: when we compute those statistics numerically, we use empirical values.
Thus, $\mathbb{V}[X] = \mathbb{E}[X-\mathbb{E}[X]]$ is computed as $var_n(x)=\frac{1}{n}\Sigma(x_i - \overline{x})^2$
In the below graph, the two samples are highly correlated (pearson = 98%).
The linear trend appears when plotting one sample against the other.
Warning: the Pearson coefficient is NOT sensitive to the scale (see relationship metrics).
Autocorrelation (1)
$R_{k} = \frac{\mathbb{E}[(X_i-\mu_X)(X_{i+k}-\mu_X)]}{\sigma_X^2}$
$X_i$ is the dataset without the last $k$ values
$X_{i+k}$ is the dataset without the first $k$ values
$\mu_X$ is the mean on the whole dataset $X$
$\sigma_X^2$ is the variance on the whole dataset $X$
Autocorrelation (2)
$R_{k} = \frac{\mathbb{E}[(X_i-\mu_{X_i})(X_{i+k}-\mu_{X_{i+k}})]}{\sigma_{X_i}\sigma_{X_{i+k}}}$
$X_i$ is the dataset without the last $k$ values
$X_{i+k}$ is the dataset without the first $k$ values
$\mu_{X_i}$ is the mean on dataset $X_i$
$\sigma_{X_i}$ is the standard deviation on dataset $X_i$
statsmodels.tsa.stattools.acf uses formula (1).
np.autocorr uses formula (2).
Below is the summary of equivalences:
import statsmodels.tsa.stattools as sm
s = pd.Series([5, 2, 6, 18, 2, 5])
a = pd.Series([5, 2, 6])
b = pd.Series([18, 2, 5])
# Formula (1)
print(s.autocorr(3) # unbiased standard deviation estimators !!
== a.corr(b)
== np.cov(a,b)[0,1]/(np.std(a,ddof=1)*np.std(b,ddof=1)))
# prints True
# Formula (2)
def acf_by_hand(x, lag):
y1 = np.array(x[:(len(x)-lag)])
y2 = np.array(x[lag:])
sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
return sum_product / (len(x) * np.var(x))
print(round(acf_by_hand(s,3),6)
== round(sm.acf(s)[3],6)) # biased covariance and standard deviation estimators !!
# prints True
Below a graphical comparison of both formulas:
import statsmodels.tsa.stattools as sm
s = pd.Series([5, 2, 6, 18, 2, 5])
corr_statsmodel = sm.acf(s)[1:4]
corr_pandas = [s.autocorr(i) for i in range(1,4)]
test_df = pd.DataFrame([corr_statsmodel, corr_pandas]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar', title='Comparison with different lags')
Note: the differences may get smaller for longer time series but are quite big for short ones.
Partial autocorrelation
Based on article understanding-partial-auto-correlation (towardsdatascience)
\[PR_k = \frac{cov(X_t | X_{t-1} ... X_{t-k+1},X_{t-k} | X_{t-1} ... X_{t-k+1})}{\sigma_{X_t | X_{t-1} ... X_{t-k+1}} \sigma_{X_{t-k} | X_{t-1} ... X_{t-k+1}}}\]$X_t | X_{t-1} … X_{t-k+1}$ is the residual of regression $X_t = \beta_0 + \beta_1 X_{t-1} + … + \beta_k X_{t-k+1}$
$X_{t-k} | X_{t-1} … X_{t-k+1}$ is the residual of regression $X_{t-k} = \beta_0 + \beta_1 X_{t-1} + … + \beta_k X_{t-k+1}$
Thus, one can write:
$PR_k = \rho_{\epsilon_t,\epsilon_{t-k}}$
We use partial autocorrelation in order to define the order $p$ in which we can compute an AR(p) model.
Based on this graph, we can use an AR(2) or even AR(3) ($k=3$ is just outside the 95% confidence interval.