The Fundamentals of Distribution Plots
As a data science fresher, it can be tricky to identify the best plot type for a given dataset and scenario. To help we’ve put together a few handy articles explaining what each type of data visualization can be used for. Here are some other articles that you might find useful:
Distribution Plots are useful tools that can help you visualize data, and can give insight into the different ways your data has been distributed. Some of the different types of distribution plots are histograms, box plots, density plots and violin plots, which one to choose will depend on your data and required outcomes.
For a single variable, a histogram is effective. For multiple variables, you can either use a box plot or a violin plot. The violin plot visualizes the densities of your variables, whereas the box plot just visualizes the median, the interquartile range, and the range for each variable.
First, we will take a deeper look into a histogram and understand the features, uses, and best practices of the violin plot.
What is a Histogram and what is it used for?
A histogram visualizes the distribution of a single numerical variable. Each bar represents the frequency for a certain interval. Histograms help get an estimate of statistical measures. You see where values are concentrated, and you can easily detect outliers.
You can either plot a histogram with absolute frequency values or, alternatively, normalize your histogram. If you want to compare distributions of multiple variables, you can use different colors for the bars.
Histograms are primarily used into gaining insights into understanding the underlying distribution of a dataset. The following diagram shows the distribution of the Intelligence Quotient (IQ) for a test group.
The dashed lines represent the standard deviation on each side of the mean (the solid line):
The above figure shows the distribution of IQ for a test group of hundred adults.
One of the best design practices is to try different numbers of bins like data intervals since the main shape of the histogram varies quite significantly.
What is a Density Plot and what is it used for?
A density plot shows the distribution of a numerical variable. It is a variation of a histogram that uses kernel smoothing, allowing for smoother distributions. One advantage these have over histograms is that density plots are better at determining the distribution shape since the distribution shape for histograms heavily depends on the number of bins (data intervals).
The density plot is primarily used for comparing the distribution of several variables by plotting the density on the same axis and using different colors. The following diagram shows a density plot for multiple variables such as height and density.
The above figure shows a density plot between Density and Height
The above figure shows a basic multi-density plot
What is a Box plot and what is it used for?
A box plot shows multiple statistical measurements and extends from the lower to the upper quartile values of the data, thus allowing us to visualize the interquartile range (IQR). The horizontal line within the box denotes the median.
The parallel extending lines from the boxes are called whiskers; they indicate the variability outside the lower and upper quartiles. There is also an option to show data outliers, usually as circles or diamonds, past the end of the whiskers.
Box plots are primarily used in comparing statistical measures for multiple variables or groups. Here is the following diagram that shows a basic box plot that shows the height of a group of people:
The above figure showcases a box plot for a single variable
The following diagram shows a basic box plot for multiple variables. In this case, it shows heights for two different groups – adults and non-adults:
The above figure showcases a box plot for multiple variables.
What is a Violin Plot and what is it used for?
Violin plots are a combination of box plots and density plots. Both the statistical measures and the distribution are visualized. The thick black bar in the center represents the interquartile range, while the thin black line corresponds to the whiskers in a box plot. The white dot indicates the median. On both sides of the centerline, the density is visualized.
The violin plots are used in comparing statistical measures and the density for multiple variables or groups. The following diagram shows a violin plot for a single variable and shows how students have performed in Math:
The above figure showcases a violin plot for a single variable (Math)
From the preceding diagram, we can analyze that most of the students have scored around 40-60 on the Math test. The following diagram shows a violin plot for two variables and shows the performance of students in English and Math:
The above figure showcases the violin plot for multiple variables (English and Math)
From the preceding diagram, we can say that on average, the students scored more in English than in Math, but the highest score was secured in Math.
The following diagram shows a violin plot for a single variable divided into three groups, and shows the performance of three divisions of students in English based on their scores:
The above figure showcases the violin plot with multiple categories for three groups of students
From the preceding diagram, we can note that on average, division C has scored the highest, division B has scored the lowest, and division A is, on average, in between divisions B and C. The best design practices for violin plots are to scale the axes accordingly so that the distribution is clearly visible and not flat.
Conclusion:
Histograms, violin plots, density plots, and box plots are the different types of distribution plots that can be used in visualizing data effectively.