Rui Qin
Data visualization is an important part of data analysis. In STATW 5702, we learned how to create graphs of data by ggplot2, a data visualization package in R. Likewise, there are some libraries in Python able to do the same job as ggplot2. Seaborn is a data visualization tool based on Python library, matplotlib. Like ggplot2 in R, seaborn can create multiple kinds of statistical graphs for exploratory and explanatory purpose. In this file I will show some examples of graphs that we have learnt in class and I will use three languages, English, Chinese, and Japanese, to briefly explain them.
There are two ways to install seaborn:
#Import libraries
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
#Load iris and penguins data
df = datasets.load_iris(as_frame=True).data
df["class"] = datasets.load_iris(as_frame=True).target.replace([0, 1, 2], ["Setosa", "Versicolour", "Virginica"])
df_2 = sns.load_dataset("penguins")
#Adjust size of plots
plt.rcParams["figure.figsize"] = (12,8)
Often data scientists would like to know distributions of continuous variables. Histogram is one of the most widely used tools that visually present distributions of data.
#Set style
#create histogram
sns.histplot(x = df["sepal length (cm)"])
<AxesSubplot:xlabel='sepal length (cm)', ylabel='Count'>
#create histogram
sns.histplot(y = df["sepal length (cm)"])
<AxesSubplot:xlabel='Count', ylabel='sepal length (cm)'>
#create histogram
sns.histplot(x = df["sepal length (cm)"], binwidth = 1)
<AxesSubplot:xlabel='sepal length (cm)', ylabel='Count'>
#create histogram
sns.histplot(x = df["sepal length (cm)"], cumulative = True)
<AxesSubplot:xlabel='sepal length (cm)', ylabel='Count'>
To show correlation between continuous variables, we use scatterplot in exploratory analysis.
#create scatterplot
sns.scatterplot(x = df["sepal length (cm)"], y = df["sepal width (cm)"])
<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>
#create scatterplot
sns.scatterplot(x = df["sepal length (cm)"], y = df["sepal width (cm)"], hue = df["class"])
<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>
#create scatterplot
sns.scatterplot(x = df["sepal length (cm)"], y = df["sepal width (cm)"], style = df["class"])
<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>
#create scatterplot matrix
sns.pairplot(df, hue = "class")
<seaborn.axisgrid.PairGrid at 0x27ac4264c48>
Compared to histogram and scatterplot, Boxplot is better at showing median, range, and outliers.
相较于直方图与散点图, 箱型图能更好地展示中位数,间距,以及异常值。
#create boxplot
sns.boxplot(x = "class", y = "sepal length (cm)", data = df)
<AxesSubplot:xlabel='class', ylabel='sepal length (cm)'>
It's said that boxplot fails to show distribution of variables. Hence, we could use violin plot.
#create violin plot
sns.violinplot(x = "class", y = "sepal length (cm)", data = df)
<AxesSubplot:xlabel='class', ylabel='sepal length (cm)'>
For categorical data, we can use bar chart.
#create bar chart
<AxesSubplot:xlabel='species', ylabel='count'>
#Create a bar chart faceted by sex variable in penguins data
grid = sns.FacetGrid(col = "sex", data = df_2, height = 5)
#Change value of height to adjust plot size
#グラフのサイズを変えます, "species")
<seaborn.axisgrid.FacetGrid at 0x27ac69ec448>
#create bar chart
sns.countplot(x = df_2.species, hue =
<AxesSubplot:xlabel='species', ylabel='count'>
There are five styles to choose in seaborn: darkgrid, whitegrid, dark, white, and ticks. According to different exploratory or explanatory purposes, different styles can be chosen so that graphs are easy to understand.
seaborn中总共有五种风格可供选择:darkgrid, whitegrid, dark, white, 和ticks。根据不同情况,可以选择不同的风格确保表格清晰易懂。
#Present four styles used in seaborn
fig = plt.figure()
gs = plt.GridSpec(2, 2)
with sns.axes_style("darkgrid"):
ax = fig.add_subplot(gs[0, 0])
sns.scatterplot(x = df["sepal length (cm)"], y = df["sepal width (cm)"])
with sns.axes_style("dark"):
ax = fig.add_subplot(gs[0, 1])
sns.scatterplot(x = df["sepal length (cm)"], y = df["sepal width (cm)"])
with sns.axes_style("whitegrid"):
ax = fig.add_subplot(gs[1, 0])
sns.scatterplot(x = df["sepal length (cm)"], y = df["sepal width (cm)"])
with sns.axes_style("white"):
ax = fig.add_subplot(gs[1, 1])
sns.scatterplot(x = df["sepal length (cm)"], y = df["sepal width (cm)"])