Introduction to Measures of Central Tendency
In statistics, after collecting and organizing data, the next step is often to summarize it. Instead of looking at a long list of numbers, we can use a single value to represent the entire dataset. This single, representative value is called a measure of central tendency, or an average.
These averages help us make sense of large amounts of information and compare different datasets.
Example
Imagine a farmer named Baiju who owns one acre of land in a village with fifty other small farmers. To understand Baiju's economic condition, you can't just look at his one acre. You need to compare it to the landholdings of all the other farmers. You could ask:
- Is his land size above the average for the village? (This uses the Arithmetic Mean).
- Is his land size larger than what half the farmers own? (This uses the Median).
- Is his land size similar to the most common land size in the village? (This uses the Mode).
Each of these questions is answered by a different type of average, which helps summarize the data in a meaningful way.
The three most commonly used measures of central tendency are:
- Arithmetic Mean
- Median
- Mode
Arithmetic Mean
The Arithmetic Mean is the most common type of average. It is calculated by adding up all the values in a dataset and then dividing by the total number of values. It is usually represented by the symbol X̄.
The formula for the arithmetic mean is:
X̄ = ΣX / N
Where:
- ΣX is the sum of all the observations.
- N is the total number of observations.
How Arithmetic Mean is Calculated
The method for calculating the arithmetic mean depends on whether the data is grouped or ungrouped.
Arithmetic Mean for Ungrouped Data
Ungrouped data is simply a list of individual observations.
Direct Method
This is the simplest method: sum all the values and divide by the count of the values.
Example
To find the average marks of five students who scored 40, 50, 55, 78, and 58 in a test:
- Sum of marks (ΣX) = 40 + 50 + 55 + 78 + 58 = 281
- Number of students (N) = 5
- Arithmetic Mean (X̄) = 281 / 5 = 56.2 marks.
Assumed Mean Method
This method is a shortcut used when dealing with a large number of observations or large numbers, which can make direct calculation difficult.
The steps are:
- Assume a mean (A). This can be any value, but choosing a value in the middle of the data simplifies the math.
- Calculate the deviation (d) of each observation (X) from the assumed mean: d = X - A.
- Sum up all the deviations (Σd).
- Apply the formula: X̄ = A + (Σd / N)
Note
The purpose of the Assumed Mean method is to work with smaller, more manageable numbers (the deviations) to make calculations faster and reduce errors.
Step Deviation Method
This method simplifies calculations even further, especially when the deviations from the assumed mean have a common factor.
The steps are:
- Follow steps 1 and 2 from the Assumed Mean method.
- Find a common factor (c) that can divide all the deviations (d).
- Calculate the step deviation (d') for each observation: d' = d / c = (X - A) / c.
- Sum up all the step deviations (Σd').
- Apply the formula: X̄ = A + (Σd' / N) × c
Arithmetic Mean for Grouped Data
Grouped data is organized into categories or classes, often with frequencies.
Discrete Series
In a discrete series, each observation value (X) has a corresponding frequency (f), which tells us how many times that value appears.
-
Direct Method: The formula is adjusted to account for the frequencies:
X̄ = ΣfX / Σf
Here, ΣfX is the sum of each value multiplied by its frequency, and Σf is the total number of observations (the sum of all frequencies).
-
Assumed Mean Method: The formula is adapted for frequencies:
X̄ = A + (Σfd / Σf)
Here, Σfd is the sum of each deviation multiplied by its corresponding frequency.
-
Step Deviation Method: The formula is also adapted for frequencies:
X̄ = A + (Σfd' / Σf) × c
Here, Σfd' is the sum of each step deviation multiplied by its corresponding frequency.
Continuous Series
In a continuous series, data is presented in class intervals (e.g., 0-10, 10-20).
To calculate the mean, we first need to find the mid-point (m) of each class interval. The mid-point then acts as the representative value (like X in a discrete series) for all observations in that class.
The formulas are the same as for a discrete series, but with m replacing X:
- Direct Method: X̄ = Σfm / Σf
- Assumed Mean Method: X̄ = A + (Σfd / Σf) (where d = m - A)
- Step Deviation Method: X̄ = A + (Σfd' / Σf) × c (where d' = (m - A) / c)
Properties of Arithmetic Mean
The arithmetic mean has two important properties:
- The sum of the deviations of all items from the arithmetic mean is always zero. Symbolically, Σ(X - X̄) = 0.
- The arithmetic mean is affected by extreme values. A single very large or very small value in the dataset can significantly pull the mean up or down.
Weighted Arithmetic Mean
Sometimes, not all items in a dataset have the same importance. In such cases, we can assign weights (W) to each item according to its significance. A higher weight means the item has a greater influence on the final average.
The formula for the Weighted Arithmetic Mean is:
X̄w = ΣWX / ΣW
Example
When calculating the average price increase of goods, you might want to give more importance (a higher weight) to items that make up a larger part of a family's budget, like potatoes, compared to less frequently purchased items, like mangoes.
The Median is a positional average. It is the value that divides a dataset into two equal halves when the data is arranged in ascending or descending order.
- One half of the values are greater than or equal to the median.
- The other half of the values are less than or equal to the median.
Note
The median's main advantage is that it is not affected by extreme values. If the largest value in a dataset becomes even larger, the median will not change because it only depends on the position of the middle value.
Ungrouped Data
- Arrange the data in ascending or descending order.
- Find the position of the median item using the formula: Position = (N + 1) / 2th item, where N is the number of items.
- The value at this position is the median.
- If N is an odd number, the median will be the single middle value.
- If N is an even number, there will be two middle values. The median is the arithmetic mean of these two values.
Discrete Series
- Arrange the data and find the cumulative frequency (c.f.).
- Find the position of the median using the (N + 1) / 2th item formula (where N = Σf).
- Locate this position in the cumulative frequency column. The corresponding value of the variable (X) is the median.
Continuous Series
-
Find the median class by locating the position of the N / 2th item in the cumulative frequency column. (Note: We use N/2, not (N+1)/2).
-
Once the median class is identified, use the following formula to find the exact median value:
Median = L + [(N/2 - c.f.) / f] × h
Where:
- L = lower limit of the median class.
- c.f. = cumulative frequency of the class preceding the median class.
- f = frequency of the median class.
- h = class interval (magnitude) of the median class.
Quartiles and Percentiles
Just as the median divides data into two equal parts, other measures can divide it into more parts.
Quartiles
Quartiles are measures that divide a dataset into four equal parts. There are three quartiles:
- First Quartile (Q₁), or the lower quartile: 25% of the items are below it, and 75% are above it.
- Second Quartile (Q₂): This is the median. 50% of items are below it, and 50% are above it.
- Third Quartile (Q₃), or the upper quartile: 75% of the items are below it, and 25% are above it.
The range between Q₁ and Q₃ contains the central 50% of the data.
Percentiles
Percentiles divide a distribution into one hundred equal parts. There are 99 percentiles (P₁, P₂, ..., P₉₉). For example, P₅₀ is the median.
Example
If you score in the 82nd percentile on an exam, it means that your score is higher than or equal to 82% of the other test-takers, and 18% scored higher than you.
Calculation of Quartiles
For individual and discrete series, the formulas to find the position of the quartiles are:
- **Q₁ = size of [(N + 1) / 4]**th item
- **Q₃ = size of [3(N + 1) / 4]**th item
Mode
The Mode (Mₒ) is the value that occurs most frequently in a dataset. It is the value with the highest concentration of items around it.
Example
A shoe manufacturer would be very interested in the mode of shoe sizes sold. This tells them which size is the most popular and should be produced in the largest quantity.
A dataset can have:
- One mode (unimodal).
- Two modes (bimodal).
- More than two modes (multimodal).
- No mode, if all values occur with the same frequency.
Computation of Mode
Discrete Series
In a discrete series, the mode is simply the value (X) that has the highest frequency (f).
Continuous Series
-
First, identify the modal class, which is the class interval with the highest frequency.
-
Use the following formula to calculate the mode:
Mₒ = L + [D₁ / (D₁ + D₂)] × h
Where:
- L = lower limit of the modal class.
- D₁ = frequency of the modal class minus the frequency of the preceding class.
- D₂ = frequency of the modal class minus the frequency of the succeeding class.
- h = class interval of the modal class.
Note
To calculate the mode for a continuous series, the class intervals should be equal and the series should be exclusive (e.g., 10-20, 20-30, not 10-19, 20-29).
In a distribution, the mean, median, and mode have a specific relationship. The median is always located between the arithmetic mean and the mode.
- For a positively skewed distribution: Mean > Median > Mode
- For a negatively skewed distribution: Mean < Median < Mode
Conclusion: Choosing the Right Average
Measures of central tendency summarize data with a single, representative value. The choice of which average to use depends on the nature of the data and the goal of the analysis.
- Arithmetic Mean: Best for data that is symmetrically distributed and does not have extreme values. It is simple to calculate and uses every value in the dataset.
- Median: A better choice when the data has extreme values (outliers) or is skewed. It is also useful for open-ended distributions (e.g., "income above $100,000").
- Mode: Most useful for describing qualitative data (e.g., most popular color, most common size) or for identifying the most typical value in a set.