Introduction
After data is collected, it needs to be organized. Just like a junk dealer or kabadiwallah sorts their items to manage their trade, we need to classify raw data to bring order to it. Classification is the process of arranging or organising things into groups or classes based on some shared criteria. This process saves valuable time and effort, making the data easier to understand and ready for further statistical analysis.
Example
Think about how you arrange your schoolbooks. You might group them by subject: History, Geography, Mathematics, etc. When you need your history book, you only have to look in the "History" group, not through your entire collection. This is a simple form of classification. The kabadiwallah does the same by sorting junk into groups like "newspapers," "plastics," "glass," and "metals."
Raw Data
Data that has just been collected and is not yet organized or classified is called Raw Data. It's often highly disorganized, large, and difficult to handle. Drawing meaningful conclusions from raw data is a tedious task because it's just a collection of numbers without any structure.
For example, a table listing the mathematics marks of 100 students or the monthly food expenditure of 50 households is raw data. If you were asked to find the highest mark or the average expenditure from these tables, you would have to search through every single number, which is very time-consuming. The larger the dataset, the more difficult this becomes.
Note
The main purpose of classifying raw data is to summarize it and make it comprehensible. By grouping facts with similar characteristics, we can easily locate them, make comparisons, and draw inferences.
Classification of Data
Raw data can be classified in various ways, depending on the purpose of the study. The main types of classification are:
Chronological Classification
When data is grouped according to time, it is known as Chronological Classification. The data is arranged either in ascending or descending order with reference to time periods like years, months, or weeks. Data presented this way is called a Time Series.
Example
A table showing the population of India every ten years from 1951 to 2011 is a chronological classification. The variable being measured is population, and the basis of classification is time (years).
Spatial Classification
In Spatial Classification, data is classified based on geographical locations such as countries, states, cities, or districts.
Example
A table comparing the yield of wheat in different countries like Canada, China, India, and Pakistan is an example of spatial classification. Here, the data is organized by location.
Qualitative Classification
This type of classification is used for characteristics that cannot be expressed numerically. These characteristics are called Qualities or Attributes. Examples include nationality, literacy, gender, and religion. The classification is based on the presence or absence of an attribute.
Example
A population can be classified by gender (Male/Female). Each of these groups can be further classified by marital status (Married/Unmarried). This is a qualitative classification because the attributes (gender, marital status) are not numerical.
Quantitative Classification
When data for characteristics that can be measured numerically—like height, weight, age, income, or marks—are grouped into classes, it is called Quantitative Classification.
Example
A table showing the marks of 100 students grouped into classes like 0-10, 10-20, 20-30, and so on, is a quantitative classification.
Variables: Continuous and Discrete
A variable is a characteristic that can be measured and changes its value. Variables are broadly classified into two types:
Continuous Variable
A continuous variable is one that can take any numerical value, including integers, fractions, and irrational numbers. It can take on any value within a given range.
- Examples: Height, weight, time, and distance are continuous variables. A student's height can be 150 cm, 150.5 cm, or even 150.512 cm. There are no "jumps" between values; the variable can manifest in every conceivable value within its range.
Discrete Variable
A discrete variable can only take certain specific values, and its value changes in finite "jumps." It cannot take intermediate values between two specific points.
- Examples: The "number of students in a class" is a discrete variable. A class can have 25 or 26 students, but not 25.5 students. It "jumps" from one whole number to the next.
- A discrete variable can sometimes take fractional values, but there are still gaps between them. For instance, a variable that takes values like 1/8, 1/16, 1/32 cannot take any value between 1/8 and 1/16.
What is a Frequency Distribution?
A frequency distribution is a comprehensive way to classify the raw data of a quantitative variable. It is a table that shows how different values of a variable are distributed into different groups or classes, along with the number of observations that fall into each class.
- Class Frequency: The number of values or observations in a particular class. For example, if seven students scored marks between 30 and 40, the frequency of the class 30-40 is 7.
- Class Limits: These are the two ends of a class. The lowest value is the Lower Class Limit, and the highest value is the Upper Class Limit. For the class 60-70, 60 is the lower limit and 70 is the upper limit.
- Class Interval (or Class Width): The difference between the upper class limit and the lower class limit. For the class 60-70, the class interval is 10 (70 - 60).
- Class Mid-Point (or Class Mark): The middle value of a class, calculated as:
(Upper Class Limit + Lower Class Limit) / 2
Note
Once data is grouped into a frequency distribution, all further statistical calculations use the class mark to represent all the observations within that class. The individual values are no longer used.
A Frequency Curve is a graph that represents a frequency distribution. It is created by plotting the class marks on the X-axis and the frequencies on the Y-axis.
How to prepare a Frequency Distribution?
When creating a frequency distribution, several key decisions need to be made.
Should we have equal or unequal sized class intervals?
- Equal sized intervals are used in most cases for simplicity and ease of comparison.
- Unequal sized intervals are more appropriate in two situations:
- When the range of the data is very high (e.g., income, which can range from very low to extremely high). Using equal intervals would either create too many classes or suppress important information at the extremes.
- When a large number of values are concentrated in a small part of the range.
How many classes should we have?
The number of classes is usually between six and fifteen. If you are using equal-sized class intervals, the number of classes can be found by dividing the range (the difference between the largest and smallest values) by the size of the class interval.
What should be the size of each class?
The size of the class (class interval) and the number of classes are interlinked. Once you decide on one, the other is often determined by the range of the data. For example, if the marks range from 0 to 100 (a range of 100) and you decide on ten classes, the class interval will be 10.
How should we determine the class limits?
Class limits must be definite and clearly stated. There are two main methods for defining class intervals:
- Inclusive Method: In this method, values equal to both the lower and upper limits are included in the frequency of that class. There are gaps between the upper limit of one class and the lower limit of the next.
- Example (for discrete data): 0-10, 11-20, 21-30, etc. A value of 10 falls in the 0-10 class, and a value of 20 falls in the 11-20 class.
- Exclusive Method: In this method, a value equal to the upper limit of a class is excluded from that class and included in the next one (where it becomes the lower limit). This method ensures continuity in the data.
- Example (for continuous data): 0-10, 10-20, 20-30, etc. A value of 10 would be included in the 10-20 class, not the 0-10 class. This is the "upper limit excluded" convention.
Adjustment in Class Interval
When using the inclusive method for a continuous variable, there is a discontinuity or "gap" between classes (e.g., between 899 and 900). To restore continuity, an adjustment is made:
- Find the difference between the lower limit of the second class and the upper limit of the first class (e.g., 900 - 899 = 1).
- Divide this difference by two (e.g., 1 / 2 = 0.5).
- Subtract this value from all lower limits and add it to all upper limits.
- For example, the class 800-899 becomes 799.5-899.5, and 900-999 becomes 899.5-999.5.
Finding class frequency by tally marking
The frequency for each class is found by going through the raw data and making a tally mark (/) for each observation that falls into a particular class. To make counting easier, every fifth tally is marked as a diagonal line across the first four (forming a gate: ||||). The total number of tallies for a class gives its frequency.
A major shortcoming of classifying data into a frequency distribution is the loss of information. While the summary becomes concise and comprehensible, the details of the original raw data are lost.
Note
Once data is grouped, an individual observation loses its identity. For example, if the class 20-30 has a frequency of 6, we only know that six values fall in this range. We no longer know their exact values (e.g., 20, 22, 25, 25, 25, 28). All statistical calculations from this point on will assume these six values are equal to the class mark (25), which is an approximation.
Frequency distribution with unequal classes
Sometimes, a frequency distribution with unequal class intervals is more appropriate. This is useful when observations are heavily concentrated in certain parts of the data range. By using smaller class intervals in these concentrated areas, the class marks become more representative of the actual data, reducing the loss of information.
Example
If most students scored between 40 and 70, we could split the classes 40-50, 50-60, and 60-70 into smaller classes like 40-45, 45-50, 50-55, etc. This would provide a more detailed picture of the performance in that specific range.
Frequency array
For a discrete variable, the classification of its data is known as a Frequency Array. Since a discrete variable only takes specific values (usually integers), a frequency array is a table that lists each possible value of the variable and its corresponding frequency. It doesn't use class intervals.
Example
A table showing the size of households (1, 2, 3, 4, etc.) and the number of households for each size is a frequency array.
Bivariate Frequency Distribution
So far, we have discussed univariate distributions, which deal with a single variable. A Bivariate Frequency Distribution is a table that shows the frequency distribution of two variables simultaneously.
It is presented as a two-way table where the classes for one variable are in the rows and the classes for the other variable are in the columns. Each cell in the table shows the joint frequency for the corresponding row and column values.
Example
A table could show the sales and advertising expenditure of 20 companies. The rows might represent classes of advertising expenditure (e.g., Rs 62-64 thousand), and the columns might represent classes of sales (e.g., Rs 135-145 lakh). A number in a cell would show how many companies fall into both categories.