Dating Your Data: What is your type, dear data?
January 18, 2021, Data
We are being continuously exposed to some sort of information in our life. You know the sources of that! This is sort of bombarding of information! I funnily say – people are continuously and forcefully put on a date with information and are overwhelmed with so many feelings of excitement, joy, sorrow, exhilaration, and anticipation. Although it is almost impossible to avoid this situation in the digital age, however, you can ignore most of the information without worrying much and consider the only part which is useful. Fortunately, in this type of dating one need not ask the information “what is your type?”. Do we?
And then comes the life of data analysts and data scientists! These people are also the ones who are continuously bombarded with information and they call it data (small, medium, or big size information!). However, the more interesting part of this is that they have to go on a serious date with data they encounter! At the very first, each and every data analyst needs to ask their data the question – what is your type, dear data? And there is a possible set of answers out there. Now you see what I mean!
Let’s explore the main types of data that data analysts must know.
The fundamental building block of any data analysis project is the data. If Analytics Is The Engine, Data Is Fuel! Everything builds around the data, and the selection of analysis methods totally depends on the type of data at hand. Therefore, having a solid understanding of what data is and their types is very important.
So, let’s discuss the types of data in this post. In statistical terms, there are four types of scales data are measured by. These are called measurement scales or data types: Nominal, Ordinal, Interval, and Ratio.
Nominal type data are without any quantitative value. These could simply be called “labels.” Nominal data could be and should be, for most of the analysis techniques, re-coded in numerical type, but these new codes don’t have any meaningful value. Few examples of nominal data are race, gender, and hair color. No ordering of values of nominal data is possible. One can calculate only counts/proportions and mode of nominal data.
A sub-type of nominal data with only two categories is called Binary or Dichotomous data. Examples: Success/Failure, Yes/No, Male/Female, etc.
In the case of ordinal data, the order of the values is important and significant. However, the difference between the values is not really known. For example, one can order “Very Satisfied,” “Satisfied,” “OK,” “Dissatisfied,” and “Very Dissatisfied,” but one doesn’t really know if the difference between “Very Satisfied” and “Satisfied” is the same as the difference between “OK” and “Dissatisfied.”
Few other examples of ordinal data are “Level of Agreement (strongly disagree, disagree, neutral, agree, strongly agree)”, “Socioeconomic status (poor, middle class, rich)”, and “Education (under-grad, graduate, post-grad)”. One can calculate the Mode or Median values of ordinal data, but the Mean (also called Average) cannot be defined.
Broadly speaking, Nominal and Ordinal data are observed/described and thus called Categorical Data.
Interval data are numeric data, which can be ordered and where the exact difference between the values is known. An example of interval data used everywhere is Celsius temperature. For example, the difference between 30 and 40 degrees is a measurable 10 degrees, as is the difference between 60 and 70 degrees.
One can calculate the Mode, Median, Mean, and Standard Deviation of interval data. The problem with interval data is that zero is not true zero. Zero degrees Celsius does not mean “no temperature at all”!
This is one of the widely used data types. One knows the order of values and the exact value between units (the difference between values), and they also have a true zero (zero means zero!). Some examples of ratio data types are income, age, and experience. These can be called numeric data. Ratio type data are meaningfully added, subtracted, multiplied, or divided (ratios), and a wide range of descriptive and inferential statistical methods can be applied to them.
In the practical world, among different domains, different terminologies are used for data types. Different software packages have their own set of data types so users need to be careful while using different software packages.
Broadly, Interval and Ratio data are measured and thus called Numerical Data.
In summary different types of data are having the following characteristics:
|Data Type||Broad Type||Characteristics|
|Nominal||Categorical||Named or Labeled|
Illustration: We are given the following data columns and the task is to identify the type of data in each of the columns. Column 1 contains names of 10 cities (from India), their distance (in km) from a nearby hill station (Ooty), the average annual temperature in OC, and total yearly rainfall in mm.
You can use the following tree diagram to identify the type of data. It is amazingly easy and fun! Enjoy!
|Column 1 (City)||Column 2 (Distance in km)||Column 3 (Temperature in OC)||Column 4 (Rainfall in mm)|
If you use the above tree diagram to identify the data type of each of the columns in the city data you will get the following. What about column 3?
|Column 1 (City)||Nominal (categorical)|
|Column 2 (Distance in km)||Ratio (numerical)|
|Column 3 (Temperature in OC)||Interval (numerical, zero not meaningful)|
|Column 4 (Rainfall in mm)||Ratio (numerical)|