Introduction to Correlation

In statistics, we often want to understand not just a single set of data, but how two different sets of data relate to each other. Correlation analysis is the tool we use to systematically examine the relationship between two variables.

Think about everyday situations:

When the temperature rises in the summer, more people visit hill stations and ice cream sales go up.
When a large harvest of tomatoes arrives at the local market (mandi), the price of tomatoes usually drops.

Correlation helps us answer key questions about these kinds of relationships:

Is there a relationship between two variables at all?
If the value of one variable changes, does the other one also change?
Do they change in the same direction or in opposite directions?
How strong is that relationship?

Types of Relationship

The relationship between two variables can be straightforward or complex. It's important to understand what might be behind a statistical connection.

Cause and Effect: Some relationships have a clear cause-and-effect link. For example, low rainfall (cause) is often related to low agricultural productivity (effect).
Coincidence: Sometimes, two things happen at the same time without any real connection. The relationship between the arrival of migratory birds at a sanctuary and the local birth rate is likely just a coincidence.
Influence of a Third Variable: A hidden, third variable can make two unrelated variables seem connected.

Example

Imagine we find a strong correlation between ice cream sales and the number of deaths by drowning. Does eating ice cream cause drowning? Of course not. The third variable at play here is temperature. When it gets hot, more people buy ice cream, and more people go swimming, which unfortunately increases the risk of drowning. The temperature is the real driver behind both trends.

What Does Correlation Measure?

Correlation studies and measures the direction and intensity of the relationship between variables.

Note

A crucial point to remember is that correlation measures covariation, not causation. Just because two variables move together doesn't mean one causes the other. Correlation simply shows that as one variable changes, the other also changes in a predictable way—either in the same direction or in the opposite direction.

For simplicity, we often assume the relationship is linear, meaning it can be represented by a straight line on a graph.

Types of Correlation

Correlation is broadly classified into two types based on the direction of the relationship.

Positive Correlation: This occurs when two variables move together in the same direction.
- If one variable increases, the other also increases.
- If one variable decreases, the other also decreases. [!example] When a person's income rises, their consumption (spending) usually rises too. Similarly, as the temperature increases, the sale of ice cream also increases.
Negative Correlation: This occurs when two variables move in opposite directions.
- If one variable increases, the other decreases. [!example] When the price of apples falls, the demand for them generally increases. If you spend more time studying, your chances of failing decline.

Techniques for Measuring Correlation

We use three main tools to study and measure correlation:

Scatter Diagrams
Karl Pearson's Coefficient of Correlation
Spearman's Rank Correlation

Scatter Diagram

A scatter diagram is a graph that provides a visual way to examine the relationship between two variables without needing any complex calculations.

The values of the two variables are plotted as points on a graph.
By looking at the pattern of these points, we can get a good idea of the nature and strength of the relationship.

How to interpret a scatter diagram:

Direction: If the points form a pattern that slopes upward from left to right, it indicates a positive correlation. If they slope downward, it indicates a negative correlation.
Strength: If the points are clustered closely together in a clear line, the correlation is strong. If they are widely scattered, the correlation is weak.
Perfect Correlation: If all the points lie exactly on a straight line, the correlation is perfect. This can be perfect positive or perfect negative.
No Correlation: If the points are scattered randomly with no clear pattern, there is no correlation.

Karl Pearson's Coefficient of Correlation

This method, also known as the product moment correlation coefficient, gives a precise numerical value for the strength and direction of a linear relationship between two variables, X and Y. This value is called the correlation coefficient, represented by the symbol r.

Note

It is very important to use Karl Pearson's method only when the relationship between the variables is linear. If the relationship is non-linear (curved), this method can be misleading. It's always a good idea to look at a scatter diagram first to check for linearity.

The formula for Karl Pearson's coefficient is based on the covariance and standard deviations of the variables.

Covariance measures how two variables change together. Its sign (positive or negative) determines the sign of the correlation coefficient.
The most common formula used for calculation is: r = (NΣXY - (ΣX)(ΣY)) / [√ (NΣX² - (ΣX)²) * √ (NΣY² - (ΣY)²)]

Properties of the Correlation Coefficient (r)

No Unit: r is a pure number and has no units. A correlation of 0.7 between height in feet and weight in kilograms is just 0.7.
Direction: A negative r indicates an inverse or negative relationship (as one goes up, the other goes down). A positive r indicates a positive relationship (both move in the same direction).
Range: The value of r always lies between -1 and +1. If your calculation results in a value outside this range, there is an error.
- -1 ≤ r ≤ 1
Perfect Correlation: If r = +1, there is a perfect positive linear relationship. If r = -1, there is a perfect negative linear relationship.
No Linear Correlation: If r = 0, the variables are uncorrelated, meaning there is no linear relationship between them. However, a non-linear relationship might still exist.
Strength: The closer r is to +1 or -1, the stronger the linear relationship. The closer r is to 0, the weaker the linear relationship.
Unaffected by Change of Origin and Scale: The value of r does not change if you add, subtract, multiply, or divide all the values of X and/or Y by a constant number. This property is used in the step deviation method to simplify calculations with large numbers.

Interpreting the Value of r

An r value of 0.9 indicates a very strong positive relationship. For example, students with high marks in English will almost certainly have high marks in Statistics.
An r value of 0.1 indicates a very weak positive relationship. Students with high marks in English might have only slightly higher marks in Statistics.
An r value of -0.9 indicates a very strong negative relationship. A large supply of vegetables in the market will be met with a sharp drop in prices.
An r value of -0.1 indicates a weak negative relationship. A large supply of vegetables will only cause a small drop in prices.

Spearman's Rank Correlation

Developed by psychologist C.E. Spearman, this method measures the linear association between the ranks of items, not their actual values. The Spearman's rank correlation coefficient is represented by rₛ or rₖ.

It is particularly useful in the following situations:

When dealing with qualitative data (attributes) that can be ranked but not measured precisely, such as honesty, beauty, or intelligence.
When the data contains extreme values (outliers), as ranks are not affected by them, making this method more robust than Pearson's in such cases.
When the relationship between variables is clearly directional but non-linear.

The formula for Spearman's rank correlation is: rₛ = 1 - [ (6ΣD²) / (n³ - n) ]

D is the difference between the ranks for each pair of observations.
n is the number of observations.

Calculation of Rank Correlation Coefficient

There are three main scenarios for calculating rank correlation.

Case 1: Ranks are given If the data is already in the form of ranks (e.g., judges' rankings in a competition), you simply find the difference (D) for each pair, square it (D²), sum the squares (ΣD²), and apply the formula.

Case 2: Ranks are not given If you have raw data (like student marks), you must first assign ranks to each variable. You can rank from highest to lowest or lowest to highest, but you must be consistent for both variables. Once ranks are assigned, the process is the same as in Case 1.

Case 3: Ranks are repeated (tied ranks) If two or more items have the same value, they are given the average of the ranks they would have occupied. [!example] If two students both score 90 and would have been ranked 2nd and 3rd, each is assigned the rank (2+3)/2 = 2.5. The next student would then receive rank 4.

When ranks are repeated, a correction factor is added to the formula: rₛ = 1 - [ 6(ΣD² + CF) / (n³ - n) ]

The Correction Factor (CF) is calculated for each set of tied ranks using the formula (m³ - m) / 12, where m is the number of times a rank is repeated. You sum the correction factors for all ties.

Conclusion

Correlation is a powerful statistical tool for studying the relationship between two variables.

A scatter diagram offers a quick visual summary of the relationship.
Karl Pearson's coefficient provides a precise measure of linear relationships when data is numerical and accurate.
Spearman's rank correlation is useful for qualitative data, data with outliers, or non-linear relationships.

The knowledge of correlation helps us understand the direction and intensity of how one variable changes when another one does. However, it's essential to remember that this statistical relationship does not, by itself, prove that one variable causes the other.

Chapter Notes