At this point, you are now able to import a data file into Rstudio and quickly explore the structure of the dataset, which is usually the first step in any analysis. If the dataset is well-structured, we then need to carry out a very important operation, which we’ll see how to do in R: manipulate the data to obtain the information we want. In many cases, we’re not interested in the whole dataset; we want to select specific variables, analyze only certain groups and see how these variables vary according to these groups.
With R, there are several ways to perform these operations. Firstly, you can use R’s base functions, without having to load a specific package. An alternative approach is based on the tidyverse, the package suite developed by Hadley Wickham with a special syntax. Here, we’ll concentrate on the tidyverse functions, which I feel are easier and more intuitive to use.
The tidyverse relies extensively on what is called “the pipe” : |>. You will quickly understand what the pipe is, we use it to chain instructions just as in a recipe. Depending on the version of R, you might also use/see this %>%.
7.1 Introduction to dplyr functions
To manipulate data, we will use a package from the tidyverse called dplyr, comprising a comprehensive suite of exceptionally useful functions. To familiarize ourselves with its usage, we will explore the Quality of Government Environmental Indicators dataset. To do this, we will need first to load the tidyverse and import the data.
# Load the tidyverselibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
qog <- haven::read_dta("data/qog_env.dta")
7.1.1count()
From Exercise 1, we saw that the qog dataset has a country-year structure. For each country, we have a series of indicators with one value per year. the qog dataset has a year variable and a cname variable (for country).
qog |>count(year, cname) # Only one country-year observation
# A tibble: 11,722 × 3
year cname n
<dbl> <chr> <int>
1 1946 Afghanistan 1
2 1946 Albania 1
3 1946 Andorra 1
4 1946 Antigua and Barbuda 1
5 1946 Argentina 1
6 1946 Australia 1
7 1946 Belgium 1
8 1946 Bhutan 1
9 1946 Bolivia (Plurinational State of) 1
10 1946 Brazil 1
# ℹ 11,712 more rows
7.1.2filter() observations
We may want to use only a subset of countries and get a smaller versions of the dataset. And the tidyverse has one for this, which is called filter() We want to select specific rows/observations of the dataset based on a specific conditions. To filter, we will use extensively boolean operators.
Operator
Description
==
equal
!=
not equal
<
less than
<=
less than or equal
>
greater than
>=
greater than or equal
&
and
|
or
!
not
%in%
in the set
For instance, I might be only interested in environmental indicators for France. In that case I could filter the whole dataset with only the observations that has France as country variable.
Here, see that the tidyverse functions are a bit intuitive that what you could do directly with R base functions. To do the same, here an example of what we would have done :
qog[qog$cname %in%c("France"), ]
# A tibble: 75 × 414
cname ccode year cname_qog ccode_qog ccodealp ccodealp_year ccodecow
<chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
1 France 250 1963 France (1963-) 250 FRA FRA63 220
2 France 250 1964 France (1963-) 250 FRA FRA64 220
3 France 250 1965 France (1963-) 250 FRA FRA65 220
4 France 250 1966 France (1963-) 250 FRA FRA66 220
5 France 250 1967 France (1963-) 250 FRA FRA67 220
6 France 250 1968 France (1963-) 250 FRA FRA68 220
7 France 250 1969 France (1963-) 250 FRA FRA69 220
8 France 250 1970 France (1963-) 250 FRA FRA70 220
9 France 250 1971 France (1963-) 250 FRA FRA71 220
10 France 250 1972 France (1963-) 250 FRA FRA72 220
# ℹ 65 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
# act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
# ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
# ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
# ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
# ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …
You could also filter by all of the countries that are NOT France and count how many observations we have for each.
qog |># Keep only countries that are not Francefilter(cname !="France") |># Count the number of observations for each countrycount(cname, sort =TRUE) # Use Sort = TRUE if you want to sort the results
# A tibble: 203 × 2
cname n
<chr> <int>
1 Afghanistan 75
2 Albania 75
3 Andorra 75
4 Antigua and Barbuda 75
5 Argentina 75
6 Australia 75
7 Belgium 75
8 Bhutan 75
9 Bolivia (Plurinational State of) 75
10 Brazil 75
# ℹ 193 more rows
Or you could try with a few countries. In that case, you can use %in%.
# A tibble: 4 × 2
cname n
<chr> <int>
1 Denmark 75
2 France 75
3 Norway 75
4 Spain 75
But we also want to store the data to manipulate it and so it is easier to create an object.
countries <-c("France", "Norway", "Spain")(qog_subset <- qog |># Keep only the following countriesfilter( cname %in%c("France","Norway","Spain","Denmark","Netherlands","Poland","Russia","Saudi Arabia","India" ),# Keep only the observations that are after 1990 year >1990 ))
qog_slice <- qog_subset |>slice_max(year, n =5) # Get the n max results of a variableqog_subset |>slice_min(year, n =3) # Get the n min results of a variable
While filter() is useful for keeping only certain groups of rows, select() is used, as its name suggests, to select certain variables (columns) from our dataframe.
Let’s say I’m interested in CO2 emissions per capita, their overall level and how they vary over time and by country. For this, I would need only three variables that are present in the dataset: year, cname and the variable wdi_co2 which comes from the World Bank’s World Development Indicators. To do this, I use select() and just specify which variables of the dataset I want to select.
The select() functions offer a lot of possibilities to select variable based on certain patterns by adding functions inside such as starts_with(), ends_with(), contains(). You can learn more by typing ?select.
qog_wdi <- qog_subset |># Select the country, the year and every variable that starts with wdi. select(cname, year, starts_with("wdi"))# Look at which columns we have in our datasetqog_wdi |>colnames()
qog_wdi |># Select only the variables that have numeric as typeselect_if(is.character) |># Look at which columns we have in our datasetcolnames()
[1] "cname"
7.4 Change column names with rename()
Occasionally, you may encounter datasets with columns named in a non-meaningful manner, such as v5, x45, or Q234, often representing variable identifiers from surveys. To enhance the understanding of the variables’ meanings, it can prove beneficial to rename these columns using the rename() function. In our current dataset, the variable names are already quite informative. However, for the purpose of this illustration, let’s rename the ‘cname’ variable as ‘country’, offering a more explicit label.
qog_renamed <- qog_wdi |># Rename the variable cname as countryrename(country = cname) qog_renamed |>colnames()
# More complex ! qog_renamed <- qog_renamed |># Remove wdi_ from all of the variables names that starts with wdirename_with(~str_remove(.x, "wdi_"), starts_with("wdi")) qog_renamed |>colnames()
However, we are often interested in how these indicators vary across groups such as country or year. To do so, use the summarise() function in R that allows you to compute new variables by groups. To choose grouping variable, we use first group_by() where we specify for which group we want to compute something. Here I use this function to compute the mean, the median, the standard deviation and the first and third quartile of co2 emissions per capita for each country. This gives us a new tibble with all of the summary information.
To get nice tables when you render your quarto document, you can use the knitr::kable() function.
qog_stats |> knitr::kable()
cname
mean_co2
median_co2
sd_co2
q1
q3
Denmark
9.306369
9.647554
2.1354587
8.1058071
8.1058071
France
5.815746
5.926466
0.4877268
5.6274657
5.6274657
India
1.154740
1.005567
0.3372283
0.9172273
0.9172273
Netherlands
10.945372
11.040789
0.4877648
10.5946611
10.5946611
Norway
8.940977
8.983080
1.0866764
8.1622779
8.1622779
Poland
8.291731
8.205371
0.5672529
7.8777194
7.8777194
Russia
11.508477
11.443988
0.9082599
10.8866443
10.8866443
Saudi Arabia
16.012722
16.656378
2.6470291
14.4207311
14.4207311
Spain
6.458592
6.149608
0.9813928
5.7886348
5.7886348
It is also possible to sort the results to gain a clearer idea of which countries have the highest average CO2 emissions per capita. To achieve this, we utilize the arrange() function, specifying the variable by which we intend to sort. By doing so, we observe that, on average, Saudi Arabia, Russia, and the Netherlands possess the highest carbon footprint per capita among the selected countries.
Now I want to know when emissions per capita levels were the lowest in Saudia Arabia. I can filter only this country, select the variables of interest and sort the dataframe. We observe that the CO2 emissions per capita were at their minimum towards the late 1990s (which is not good).
# A tibble: 9 × 3
cname wdi_co2 wdi_fossil
<chr> <dbl> <dbl>
1 Denmark 9.31 84.7
2 France 5.82 52.0
3 India 1.15 64.9
4 Netherlands 10.9 94.1
5 Norway 8.94 56.4
6 Poland 8.29 94.7
7 Russia 11.5 91.3
8 Saudi Arabia 16.0 99.9
9 Spain 6.46 78.9
7.5.1 Mutate
We can also create new variables based on other ones with mutate(). Let’s say I want to compute the growth rate of co2 emissions per capita every year. For this, I will use mutate() to create new variables.
Mutate() is also widely used to recode variable. Here we create a new trajectory variable and we use case_when() to recode whether there co2_emissions are growing or not for a given year.