In this notebook, I present different data visualisations using Altair, a declarative statistical visualisation library for Python that allows the user to create a wide range of highly interactive and customisable visualisations. In general, I use matplotlib and other tools to create basic visualisations when I am in the exploration stage of any data science project. However, when presenting some of your findings, having interactive visualisations can come in handy.
The first example is a simple scatterplot that aims to show Altair's basic functionalities, and it is in Altair's example gallery. This scatterplot is a good starting example. Although it clearly shows the relationship between Miles per gallon and the horsepower, one can change various elements to improve the message's delivery.
%config Completer.use_jedi = False
import altair as alt
from vega_datasets import data
import numpy as np
import pandas as pd
source = data.cars()
fig = alt.Chart(source, title='This is a good title').mark_point().encode(
alt.X("Horsepower:Q", title = "Horsepower (hp)"),
alt.Y("Miles_per_Gallon:Q", title="Miles per gallon (mpg)"),
color = 'Origin',
tooltip = ["Name", "Year","Origin", "Horsepower", "Miles_per_Gallon"]
).configure_range(category=["#1E88E5", "#FFC107", "#004D40"])
fig.interactive(
).configure_title(fontSize=18, anchor="start"
).configure_axis(titleFontSize=16, titleColor="gray")
The datasets that I will be using for the following examples can be found in the Opendata Transport NSW and correspond to the Sydney Cycling Survey (SCS) results for 2011 and 2012.
First, we present a simple bar chart that cand be produced using just a few code lines. Although there is nothing wrong with the chart itself, there are multiple improvements that we can introduce by making use of the various customisations that Altair offers to us.
# Read excel files
colnames = ["residence", "cycling_participation2012"]
cycling_12 = pd.read_excel("bts_sydney_cycling_survey_summary_results_2012_v1_2.xlsx",
sheet_name="Cycling participation%",
usecols="B,J", skiprows=range(0,5))
cycling_12 = cycling_12.iloc[:17]
cycling_12.columns = colnames
# Cycling participation 2012
source = cycling_12.copy()
## Simple bar chart
simple = alt.Chart(source).mark_bar().encode(
x=alt.X("cycling_participation2012"),
y=alt.Y("residence", sort="-x")
)
simple
In the second bar plot, we can see significant improvements, mainly in delivering the message and decluttering the visualisation itself. Some of the changes include a meaningful title and subtitle, removal of the grid and bounding box, removal of the x-axis and labelling of the values next to the bars, and finally, strategic use of colour to highlight the plot's essential parts. All in all, the changes that we just introduced to the graph add up very quickly in terms of code lines; however, the results are worth it, and the visualisation now looks great.
## Improved
source = cycling_12.copy()
bars = alt.Chart(source).mark_bar().encode(
x=alt.X("cycling_participation2012", axis=None),
y=alt.Y("residence", sort="-x", title=""),
color=alt.condition(alt.datum.cycling_participation2012>=0.5,
alt.value("#0cb5e8"),
alt.value("#abb1b3")),
tooltip=["residence","cycling_participation2012"]
)
text = bars.mark_text(dx=+16,dy=3).encode(
text=alt.Text('cycling_participation2012')
)
(bars + text).properties(title={"text":["Cycling Participation Sydney* 2012"],
"subtitle":["Illawarra was the region of with the highest cycling participation."],
"dx":+140, "dy":-5, "fontSize":19
},height=450
).configure_title(anchor="start"
).configure_axis(titleFontSize=12, titleColor='gray', grid=False
).configure_view(stroke=None)