10  Data visualization

In this session, we will learn to visualize data in R. Data visualization serves three primary purposes : data exploration, communication and aesthetics. First, when we have a new dataset and want to learn about the data, a few graphs can provide valuable insights into the distributions of variables (univariate plots) and their relationship with others (bivariate plots). Second, once we have analyzed our data and want to convery our findings, graphs can be highly effective to illustrate out argument. This notions is captured by the “A picture is worth a thousand words”. This is the case in research or having good visualization is usually better seen than using tables that are hard to read (Kastellec and Leoni 2007). But information in the public debate is also more and more visualized. This is more and more the case if you look at how graphs are more and mure used in the media to convey information. Think about the Covid and all of the graphs that emerged tracking the evolution of the pandemic or the popularity of a website such as World in data.

Furthermore, aesthetics play a significant role, and we appreciate visually pleasing graphs. R is capable of producing exceptionally beautiful graphs, and once you learn how to create them, you may develop a strong dislike for those who rely on Excel graphs for the rest of your life.

Furthermore, aesthetics play a significant role in data visualization, and we tend to appreciate visually pleasing graphs. R is capable of producing exceptionally beautiful graphs, and once you learn how to create them, you may develop a strong preference for R graphs over those generated by tools like Excel for the rest of your life.

10.1 Bad viz

At the same time, there many aspects on which we should care about when creating graphs. Most of the datavisualization we see in the media or in different reports are not great. So before to show you how to do graphs, here a few recommandations on what you should not do. According to Healy, graphs can be bad for three main reasons : perceptual, substantive and aesthetic.

  • Avoid pie charts (more here, and here)

  • Avoid double axis here

10.2 The grammar of graphics

There are three main ploting systems available for creating graphs in R. Base R provides functions such as plot(), hist() or boxplot() (you can also explore the lattice package). However, in this session, we will focus on a widely popular package called ggplot. It is a component of the tidyverse, so there is no need to install or load it separately if the tidyverse package is already loaded.

The “gg” in ggplot represents the “grammar of graphics” (Wilkinson 2012). Graphs are constructed by adding various layers to a basic graph, allowing for progressively more complex modifications such as adjusting the title or adding annotations.

To do a graph in ggplot, you need at least three things : data, aestethics and a geom.

  • Data : a dataframe, in a tidy format : observations as rows and variables as columns

  • Aesthetics : what are the values that we want to map on the x and y axis, with wich color/shape/size

  • Geoms : the geometry allows you to specify how you want to represent your data : geom_point(), geom_line().

It is not always easy to decide which graph is best at representing the information we want. You can have a look on the data to viz website or the R graph gallery which can help you doing this.

Other layers are also possible to add such as facets, statistics, coordinates and themes.

Graphs are built by adding these different layers with a + between each : it is additive syntax, here we do not use pipes. This might be quite abstract at this point but you will see how it works in a few minutes.

10.3 Exploring party politics with graphs

To learn about how to make graphs, we will use the Chapell Hill Expert Survey. I you are interested in political parties, it is definitely a dataset you should know. Basically, every 4 years, hundreds of experts in different countries are asked to locate political parties on different scales (eg : locating a party on a left-right scale from 0 to 10). The goal is to have an valid overview of where do parties stand in different issues on different countries. To use the data, I read directly the link of the CHES trend stata file that is available on the CHES’s website.

# Install and load packages with needs
# You should install the needs package before with install.packages("needs")

needs(tidyverse, ggrepel, gghighlight, ggridges, gganimate, labelled) 

# Load the CHES trend data : https://www.chesdata.eu/s/1999-2019_CHES_dataset_meansv3.dta

ches <- haven::read_dta("data/ches.dta") |> 
  # Convert all of values to their stata labels
  mutate_all(labelled::unlabelled) |> 
  mutate(year = as.factor(year))

colnames(ches)
 [1] "country"                "eastwest"               "eumember"              
 [4] "year"                   "expert"                 "party_id"              
 [7] "cmp_id"                 "party"                  "vote"                  
[10] "seat"                   "electionyear"           "epvote"                
[13] "family"                 "govt"                   "eu_position"           
[16] "eu_salience"            "eu_dissent"             "eu_blur"               
[19] "eu_benefit"             "eu_ep"                  "eu_fiscal"             
[22] "eu_intmark"             "eu_employ"              "eu_budgets"            
[25] "eu_agri"                "eu_cohesion"            "eu_environ"            
[28] "eu_asylum"              "eu_foreign"             "eu_turkey"             
[31] "lrgen"                  "lrecon"                 "lrecon_salience"       
[34] "lrecon_dissent"         "lrecon_blur"            "galtan"                
[37] "galtan_salience"        "galtan_dissent"         "galtan_blur"           
[40] "spendvtax"              "spendvtax_salience"     "deregulation"          
[43] "dereg_salience"         "redistribution"         "redist_salience"       
[46] "econ_interven"          "civlib_laworder"        "civlib_salience"       
[49] "sociallifestyle"        "social_salience"        "religious_principles"  
[52] "relig_salience"         "immigrate_policy"       "immigrate_salience"    
[55] "immigrate_dissent"      "multiculturalism"       "multicult_salience"    
[58] "multicult_dissent"      "urban_rural"            "urban_salience"        
[61] "environment"            "enviro_salience"        "cosmo"                 
[64] "cosmo_salience"         "protectionism"          "regions"               
[67] "region_salience"        "international_security" "international_salience"
[70] "us"                     "us_salience"            "ethnic_minorities"     
[73] "ethnic_salience"        "nationalism"            "russian_interference"  
[76] "anti_islam_rhetoric"    "people_vs_elite"        "antielite_salience"    
[79] "corrupt_salience"       "members_vs_leadership"  "mip_one"               
[82] "mip_two"                "mip_three"              "chesversion"           
n_distinct(ches$country) # How many different countries are there ? 
[1] 28
n_distinct(ches$year) # How many different years (waves) ? 
[1] 6
n_distinct(ches$party_id) # How many different parties ? 
[1] 424

The dataset covers information on 424 different political parties, on 6 different waves and in 28 different countries. From the list of variables we can see that a few of them gives us variables on the year, the country, the party, its vote share and number of seats for a given wave and then a lot of variables on different issues.

For instance, I am interested in political parties’ preferences regarding the environment. There is two variables in the CHES for this : environment which measure parties’ environmental positions and enviro_salience which measures the salience of the environment for different parties.

10.4 Distributions

The first think we would like to know is how these variables are distributed. To do this we could first calculate some descriptive statistics.

summary(ches$environment)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.2857  4.1456  5.5455  5.2001  6.6187  9.4000     502 
summary(ches$enviro_salience)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.7143  3.0000  4.2792  4.6009  5.8000 10.0000     746 

The first thing we can see is that there are a lot of NAs. If you have not looked at the codebook carefully, you could wonder why. And this is because our dataset is composed with different waves at different years and the items we are interested are not available for all waves. One way to check it is to look at the mean by groups for different years. We see that for environment, we have values for 2010, 2014 and 2019, and for enviro_salience, only for 2010 and 2019.

ches |> 
  group_by(year) |>
  summarise(
    # Compute the mean of environment salience for each year
    env_sal = mean(enviro_salience, na.rm = T),
    # Compute the mean of environment position for each year
    env_pos = mean(environment, na.rm = T)
  )
# A tibble: 6 × 3
  year  env_sal env_pos
  <fct>   <dbl>   <dbl>
1 1999   NaN     NaN   
2 2002   NaN     NaN   
3 2006   NaN     NaN   
4 2010     4.48    5.16
5 2014   NaN       5.27
6 2019     4.70    5.16

But we would have a better understanding of how the values are distributed with a visualization ! For this I create an histogram with ggplot.

# Data
ches |> 
  # Aesthetics : what are the values we want in the plot
  ggplot(aes(x = enviro_salience)) + 
  # Geom : which kind of graphic do we want
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note that then, this graph is highly customable with plenty of different arguments. Here an exemple of how it could change. Do not hesitate to change the values to see how it works.

ches |> 
  ggplot(aes(x = enviro_salience)) + 
  geom_histogram(bins = 10, # 
                 alpha = 1, 
                 fill = "pink", 
                 color = "yellow") +
  # Modify the axis
  labs(x = "Environmental Salience", y = "Number of parties") +
  # Change the theme of the graph
  theme_minimal() +
  ggtitle("Environmental salience distribution")
Warning: Removed 746 rows containing non-finite values (`stat_bin()`).

Another way to plot distributions is using geom_density() package. The “y” value at a specific point on the horizontal axis indicates the estimated probability density of the data at that point. In simpler terms, it tells you how likely data points are to occur around that particular value.

ches |>
  ggplot(aes(x = enviro_salience)) +
  geom_density()
Warning: Removed 746 rows containing non-finite values (`stat_density()`).

Most of the time, we are interested to see how variables are distributed across groups. To do so with a histogram or a density plot, we wan use the fill argument to have different colors for each group of a given variable.

ches |> 
  ggplot(aes(x = enviro_salience, fill =eastwest)) + 
  geom_histogram(alpha = 0.8) +
  scale_fill_manual(values = c("blue", "yellow"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 746 rows containing non-finite values (`stat_bin()`).

ches |> 
  ggplot(aes(x = enviro_salience, fill =family)) + 
  geom_density(alpha = 0.8)
Warning: Removed 746 rows containing non-finite values (`stat_density()`).

To do so with multiple groups, it is easier to use the ggridges::geom_density_ridges() function.

ches |>
  ggplot(aes(x = enviro_salience, y = family)) +
  ggridges::geom_density_ridges(alpha = 0.8) +
  theme_minimal()
Picking joint bandwidth of 0.584
Warning: Removed 746 rows containing non-finite values
(`stat_density_ridges()`).

10.5 Boxplots : relationships between 1 categorical and 1 continuous variable

ches |> 
  ggplot(aes(x = eastwest, y = enviro_salience)) +
  geom_boxplot()
Warning: Removed 746 rows containing non-finite values (`stat_boxplot()`).

ches |> 
  ggplot(aes(x = family, y = environment)) +
  geom_boxplot() +
  geom_jitter(color = "darkgreen", alpha = 0.1) +
  coord_flip() +
  scale_y_continuous("Env vs Econ growth", limits = c(0, 10)) +
  ggtitle("Env vs Econ growth by party family") +
  theme_minimal() 
Warning: Removed 502 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 502 rows containing missing values (`geom_point()`).

10.6 Plotting the evolution of eu positions over time

view(ches)

 ches |> 
  mutate(party = str_to_lower(party)) |> 
  filter(country == "uk", party %in% c("cons", "ukip", "snp", "green", "libdem", "lab")) |> 
  ggplot(aes(x = year, y = eu_position, color = party, group = party)) +
  geom_point() +
  geom_line() +
  scale_color_brewer(palette = "Set2") +
  theme_minimal()

10.7 Scatter plots : relationships between two continuous variables

So far, we have plotted the values of environental salience across categorical variables such as party families or the division east west. We have noticed that party families that are on the right are usually less pro-environment than families that are on the left. We could also visualise this by looking at the association of environmental_salience with the lrgen variable that place political parties on the left right scale.

# Are left right positions and environmental salience related ? 

ches |> 
  ggplot(aes(x = lrgen, y = enviro_salience)) +
  geom_point() +
  geom_smooth() +
  theme_light()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 746 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 746 rows containing missing values (`geom_point()`).

# Are environmental positions and salience related ?
ches |> 
  filter(year %in% c(2010, 2019)) |> 
  ggplot(aes(environment, enviro_salience, color = as.factor(year))) +
  geom_point() + 
  geom_smooth() +
  theme_minimal()  +
  scale_x_continuous("Environmental position", limits = c(0, 10)) +
  scale_y_continuous("Environmental saillance", limits = c(0, 10)) +
  scale_color_brewer("Year", palette = "Set1") +
  geom_hline(
    yintercept = 4.67,
    linetype = "dashed",
    color = "black",
    size = 0.5
  ) +
  geom_vline(
    xintercept = 5.22,
    linetype = "dashed",
    color = "black",
    size = 0.5
  ) +
  labs(caption = "Source : CHES")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing missing values (`geom_smooth()`).

10.8 Facetting

One of the pattern identified in the literature is the so-called U-curve relationship between the left right economic placement of a party and its position towards the EU.

(eu_plot <- ches |>
  ggplot(aes(lrecon, eu_position, color = eastwest)) + # Aesthtetics
  geom_point(alpha = 0.2) +
  geom_smooth() +
  theme_minimal() +
  scale_x_continuous("Left-Right position") +
  scale_y_continuous("EU position"))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

To compare this relationship across many different categories, it is usually easier to use facetting with facet_wrap().

eu_plot +
  facet_wrap(~ year)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

eu_plot +
  facet_wrap(~ country)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

10.9 Adding annotations with ggrepel()

ches |>
  # Keep only french observations
  filter(country == "fr") |>
  # Create a party-year variable by combining the text of both
  mutate(party_year = str_c(year, "-", party)) |>
  
  ggplot(aes(lrgen, eu_position, color = family, label = party_year)) +
  geom_point() +
  ggrepel::geom_text_repel()
Warning: ggrepel: 19 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

10.10 Complex graphs : annotations + colors + facets

ches |> 
  filter(year == 2019, !family %in% c("no family", "confessional", "regional")) |> 
  ggplot(aes(lrecon, galtan, color = family, label = party)) + # Aesthtetics
  geom_point() +
  geom_hline(yintercept = 5, linetype = "dotted") +
  geom_vline(xintercept = 5, linetype = "dotted") +
  ggrepel::geom_text_repel() +
  facet_wrap(~ eastwest) +
  theme_bw() +
  scale_x_continuous("Economic Left Right") +
  scale_y_continuous("Gal TAN") +
  scale_color_brewer(palette = "Set2")

ches |> 
  group_by(eastwest) |> 
  count(country) |> 
  view()

10.11 Save and export graphs

graphic <- ches |> 
  ggplot(aes(eastwest, enviro_salience)) +
  geom_boxplot()

graphic
Warning: Removed 746 rows containing non-finite values (`stat_boxplot()`).

ggsave(graphic, filename = "figures/env_eastwest.png")
Saving 7 x 5 in image
Warning: Removed 746 rows containing non-finite values (`stat_boxplot()`).

10.12 Animate graphs

You can also animate your graphs : here an exemple coming from the gapminder dataset (not parties animore) that plot the evolution over time of the relationship between the gdp per capita and life expectancy across different continents.

needs(gapminder)

p <- ggplot(gapminder,
            aes(
              x = gdpPercap,
              y = lifeExp,
              size = pop,
              colour = country
            )) +
  geom_point(show.legend = FALSE, alpha = 0.7) +
  scale_color_viridis_d() +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  labs(x = "GDP per capita", y = "Life expectancy")

p

p + 
  facet_wrap(~continent) +
  transition_time(year) +
  labs(title = "Year: {frame_time}")

10.13 To go further

  • R Graphics Cookbook

  • ggplot2:Elegant graphics for data analysis

  • Kieran Healy’s has written a whole book on dataviz. See also his paper on ARS : Data viz and sociology

  • Other book by Claus O Wilke on dataviz

  • Chapter on dataviz in the R for Data science book

  • Chapter by Irizarry in his book

  • A paper by Hadley Wickham explaining the idea of “grammar of graphics”

  • If you would also like to learn how to animate your graphs, you can consult the gganimate package.

  • Tutfte : The visual display of quantitative information

  • A video by Chris Bail introducing dataviz

  • Other ressources pointed out by François Briatte on github

  • GGplot extensions : https://exts.ggplot2.tidyverse.org/gallery/

  • https://stackoverflow.com/questions/tagged/ggplot2

  • http://albertocairo.com/

  • https://visionscarto.net/hieroglyphes-isotype

  • https://visionscarto.net/la-semiologie-graphique-a-50-ans

11 References

Kastellec, Jonathan P, and Eduardo L Leoni. 2007. “Using Graphs Instead of Tables in Political Science.” Perspectives on Politics 5 (4): 755–71.
Wilkinson, Leland. 2012. The Grammar of Graphics. Springer.