Correlation
In statistics, we often want to understand not just a single set of data, but how two different sets of data relate to each other. Correlation analysis is the tool we use to systematically examine the relationship between two variables.
Think about everyday situations:
Correlation helps us answer key questions about these kinds of relationships:
The relationship between two variables can be straightforward or complex. It's important to understand what might be behind a statistical connection.
Correlation studies and measures the direction and intensity of the relationship between variables.
For simplicity, we often assume the relationship is linear, meaning it can be represented by a straight line on a graph.
Correlation is broadly classified into two types based on the direction of the relationship.
Positive Correlation: This occurs when two variables move together in the same direction.
Negative Correlation: This occurs when two variables move in opposite directions.
We use three main tools to study and measure correlation:
A scatter diagram is a graph that provides a visual way to examine the relationship between two variables without needing any complex calculations.
How to interpret a scatter diagram:
This method, also known as the product moment correlation coefficient, gives a precise numerical value for the strength and direction of a linear relationship between two variables, X and Y. This value is called the correlation coefficient, represented by the symbol r.
The formula for Karl Pearson's coefficient is based on the covariance and standard deviations of the variables.
r = (NΣXY - (ΣX)(ΣY)) / [√ (NΣX² - (ΣX)²) * √ (NΣY² - (ΣY)²)]r is a pure number and has no units. A correlation of 0.7 between height in feet and weight in kilograms is just 0.7.r indicates an inverse or negative relationship (as one goes up, the other goes down). A positive r indicates a positive relationship (both move in the same direction).r always lies between -1 and +1. If your calculation results in a value outside this range, there is an error.
-1 ≤ r ≤ 1r = +1, there is a perfect positive linear relationship. If r = -1, there is a perfect negative linear relationship.r = 0, the variables are uncorrelated, meaning there is no linear relationship between them. However, a non-linear relationship might still exist.r is to +1 or -1, the stronger the linear relationship. The closer r is to 0, the weaker the linear relationship.r does not change if you add, subtract, multiply, or divide all the values of X and/or Y by a constant number. This property is used in the step deviation method to simplify calculations with large numbers.r value of 0.9 indicates a very strong positive relationship. For example, students with high marks in English will almost certainly have high marks in Statistics.r value of 0.1 indicates a very weak positive relationship. Students with high marks in English might have only slightly higher marks in Statistics.r value of -0.9 indicates a very strong negative relationship. A large supply of vegetables in the market will be met with a sharp drop in prices.r value of -0.1 indicates a weak negative relationship. A large supply of vegetables will only cause a small drop in prices.Developed by psychologist C.E. Spearman, this method measures the linear association between the ranks of items, not their actual values. The Spearman's rank correlation coefficient is represented by rₛ or rₖ.
It is particularly useful in the following situations:
The formula for Spearman's rank correlation is:
rₛ = 1 - [ (6ΣD²) / (n³ - n) ]
There are three main scenarios for calculating rank correlation.
Case 1: Ranks are given If the data is already in the form of ranks (e.g., judges' rankings in a competition), you simply find the difference (D) for each pair, square it (D²), sum the squares (ΣD²), and apply the formula.
Case 2: Ranks are not given If you have raw data (like student marks), you must first assign ranks to each variable. You can rank from highest to lowest or lowest to highest, but you must be consistent for both variables. Once ranks are assigned, the process is the same as in Case 1.
Case 3: Ranks are repeated (tied ranks) If two or more items have the same value, they are given the average of the ranks they would have occupied. [!example] If two students both score 90 and would have been ranked 2nd and 3rd, each is assigned the rank (2+3)/2 = 2.5. The next student would then receive rank 4.
When ranks are repeated, a correction factor is added to the formula:
rₛ = 1 - [ 6(ΣD² + CF) / (n³ - n) ]
(m³ - m) / 12, where m is the number of times a rank is repeated. You sum the correction factors for all ties.Correlation is a powerful statistical tool for studying the relationship between two variables.
The knowledge of correlation helps us understand the direction and intensity of how one variable changes when another one does. However, it's essential to remember that this statistical relationship does not, by itself, prove that one variable causes the other.
Great job reading through all sections. Ready to test your knowledge and reinforce your learning?