7  Manipulating data

At this point, you are now able to import a data file into Rstudio and quickly explore the structure of the dataset, which is usually the first step in any analysis. If the dataset is well-structured, we then need to carry out a very important operation, which we’ll see how to do in R: manipulate the data to obtain the information we want. In many cases, we’re not interested in the whole dataset; we want to select specific variables, analyze only certain groups and see how these variables vary according to these groups.

With R, there are several ways to perform these operations. Firstly, you can use R’s base functions, without having to load a specific package. An alternative approach is based on the tidyverse, the package suite developed by Hadley Wickham with a special syntax. Here, we’ll concentrate on the tidyverse functions, which I feel are easier and more intuitive to use.

The tidyverse relies extensively on what is called “the pipe” : |>. You will quickly understand what the pipe is, we use it to chain instructions just as in a recipe. Depending on the version of R, you might also use/see this %>%.

7.1 Introduction to dplyr functions

To manipulate data, we will use a package from the tidyverse called dplyr, comprising a comprehensive suite of exceptionally useful functions. To familiarize ourselves with its usage, we will explore the Quality of Government Environmental Indicators dataset. To do this, we will need first to load the tidyverse and import the data.

# Load the tidyverse

library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
qog <- haven::read_dta("data/qog_env.dta") 

7.1.1 count()

From Exercise 1, we saw that the qog dataset has a country-year structure. For each country, we have a series of indicators with one value per year. the qog dataset has a year variable and a cname variable (for country).

qog |> # This is a pipe
  count(year)
# A tibble: 75 × 2
    year     n
   <dbl> <int>
 1  1946    75
 2  1947    76
 3  1948    82
 4  1949    85
 5  1950    88
 6  1951    88
 7  1952    89
 8  1953    89
 9  1954    91
10  1955    94
# ℹ 65 more rows
table(qog$year) # Alternative way to do it but the result is less nice

1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 
  75   76   82   85   88   88   89   89   91   94   96   97   97   98  103  118 
1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 
 119  125  127  131  135  138  141  143  144  146  149  149  151  154  160  160 
1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 
 160  165  166  167  168  168  170  170  170  172  172  172  172  173  189  192 
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
 192  193  193  193  193  193  193  193  194  194  194  194  195  195  195  195 
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 
 195  195  195  195  195  194  194  194  194  194  194 
qog |> 
  count(year, cname) # Only one country-year observation
# A tibble: 11,722 × 3
    year cname                                n
   <dbl> <chr>                            <int>
 1  1946 Afghanistan                          1
 2  1946 Albania                              1
 3  1946 Andorra                              1
 4  1946 Antigua and Barbuda                  1
 5  1946 Argentina                            1
 6  1946 Australia                            1
 7  1946 Belgium                              1
 8  1946 Bhutan                               1
 9  1946 Bolivia (Plurinational State of)     1
10  1946 Brazil                               1
# ℹ 11,712 more rows

7.1.2 filter() observations

We may want to use only a subset of countries and get a smaller versions of the dataset. And the tidyverse has one for this, which is called filter() We want to select specific rows/observations of the dataset based on a specific conditions. To filter, we will use extensively boolean operators.

Operator Description
== equal
!= not equal
< less than
<= less than or equal
> greater than
>= greater than or equal
& and
| or
! not
%in% in the set

For instance, I might be only interested in environmental indicators for France. In that case I could filter the whole dataset with only the observations that has France as country variable.

qog |> filter(cname %in% c("France", "Afghanistan"))
# A tibble: 150 × 414
   cname       ccode  year cname_qog   ccode_qog ccodealp ccodealp_year ccodecow
   <chr>       <dbl> <dbl> <chr>           <dbl> <chr>    <chr>            <dbl>
 1 Afghanistan     4  1946 Afghanistan         4 AFG      AFG46              700
 2 Afghanistan     4  1947 Afghanistan         4 AFG      AFG47              700
 3 Afghanistan     4  1948 Afghanistan         4 AFG      AFG48              700
 4 Afghanistan     4  1949 Afghanistan         4 AFG      AFG49              700
 5 Afghanistan     4  1950 Afghanistan         4 AFG      AFG50              700
 6 Afghanistan     4  1951 Afghanistan         4 AFG      AFG51              700
 7 Afghanistan     4  1952 Afghanistan         4 AFG      AFG52              700
 8 Afghanistan     4  1953 Afghanistan         4 AFG      AFG53              700
 9 Afghanistan     4  1954 Afghanistan         4 AFG      AFG54              700
10 Afghanistan     4  1955 Afghanistan         4 AFG      AFG55              700
# ℹ 140 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …
qog |> 
  filter(cname != "France") 
# A tibble: 11,647 × 414
   cname       ccode  year cname_qog   ccode_qog ccodealp ccodealp_year ccodecow
   <chr>       <dbl> <dbl> <chr>           <dbl> <chr>    <chr>            <dbl>
 1 Afghanistan     4  1946 Afghanistan         4 AFG      AFG46              700
 2 Afghanistan     4  1947 Afghanistan         4 AFG      AFG47              700
 3 Afghanistan     4  1948 Afghanistan         4 AFG      AFG48              700
 4 Afghanistan     4  1949 Afghanistan         4 AFG      AFG49              700
 5 Afghanistan     4  1950 Afghanistan         4 AFG      AFG50              700
 6 Afghanistan     4  1951 Afghanistan         4 AFG      AFG51              700
 7 Afghanistan     4  1952 Afghanistan         4 AFG      AFG52              700
 8 Afghanistan     4  1953 Afghanistan         4 AFG      AFG53              700
 9 Afghanistan     4  1954 Afghanistan         4 AFG      AFG54              700
10 Afghanistan     4  1955 Afghanistan         4 AFG      AFG55              700
# ℹ 11,637 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …

Here, see that the tidyverse functions are a bit intuitive that what you could do directly with R base functions. To do the same, here an example of what we would have done :

qog[qog$cname %in% c("France"), ]
# A tibble: 75 × 414
   cname  ccode  year cname_qog      ccode_qog ccodealp ccodealp_year ccodecow
   <chr>  <dbl> <dbl> <chr>              <dbl> <chr>    <chr>            <dbl>
 1 France   250  1963 France (1963-)       250 FRA      FRA63              220
 2 France   250  1964 France (1963-)       250 FRA      FRA64              220
 3 France   250  1965 France (1963-)       250 FRA      FRA65              220
 4 France   250  1966 France (1963-)       250 FRA      FRA66              220
 5 France   250  1967 France (1963-)       250 FRA      FRA67              220
 6 France   250  1968 France (1963-)       250 FRA      FRA68              220
 7 France   250  1969 France (1963-)       250 FRA      FRA69              220
 8 France   250  1970 France (1963-)       250 FRA      FRA70              220
 9 France   250  1971 France (1963-)       250 FRA      FRA71              220
10 France   250  1972 France (1963-)       250 FRA      FRA72              220
# ℹ 65 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …

You could also filter by all of the countries that are NOT France and count how many observations we have for each.

qog |> 
  # Keep only countries that are not France
  filter(cname != "France") |> 
  # Count the number of observations for each country
  count(cname, sort = TRUE) # Use Sort = TRUE if you want to sort the results
# A tibble: 203 × 2
   cname                                n
   <chr>                            <int>
 1 Afghanistan                         75
 2 Albania                             75
 3 Andorra                             75
 4 Antigua and Barbuda                 75
 5 Argentina                           75
 6 Australia                           75
 7 Belgium                             75
 8 Bhutan                              75
 9 Bolivia (Plurinational State of)    75
10 Brazil                              75
# ℹ 193 more rows

Or you could try with a few countries. In that case, you can use %in%.

qog |>
  filter(
    cname %in% c(
      "France",
      "Norway",
      "Spain",
      "Denmark")) |> 
  count(cname)
# A tibble: 4 × 2
  cname       n
  <chr>   <int>
1 Denmark    75
2 France     75
3 Norway     75
4 Spain      75

But we also want to store the data to manipulate it and so it is easier to create an object.

countries <- c("France", "Norway", "Spain")

(qog_subset <- qog |>
   # Keep only the following countries
  filter(
    cname %in% c(
      "France",
      "Norway",
      "Spain",
      "Denmark",
      "Netherlands",
      "Poland",
      "Russia",
      "Saudi Arabia",
      "India"
    ),
    # Keep only the observations that are after 1990
    year > 1990
  ))
# A tibble: 269 × 414
   cname   ccode  year cname_qog ccode_qog ccodealp ccodealp_year ccodecow
   <chr>   <dbl> <dbl> <chr>         <dbl> <chr>    <chr>            <dbl>
 1 Denmark   208  1991 Denmark         208 DNK      DNK91              390
 2 Denmark   208  1992 Denmark         208 DNK      DNK92              390
 3 Denmark   208  1993 Denmark         208 DNK      DNK93              390
 4 Denmark   208  1994 Denmark         208 DNK      DNK94              390
 5 Denmark   208  1995 Denmark         208 DNK      DNK95              390
 6 Denmark   208  1996 Denmark         208 DNK      DNK96              390
 7 Denmark   208  1997 Denmark         208 DNK      DNK97              390
 8 Denmark   208  1998 Denmark         208 DNK      DNK98              390
 9 Denmark   208  1999 Denmark         208 DNK      DNK99              390
10 Denmark   208  2000 Denmark         208 DNK      DNK00              390
# ℹ 259 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …

7.2 Subset rows with slice()

You can also subset rows by their positions using slice().

qog_subset |> slice(5:7) # Subset only rows 5 to 7
# A tibble: 3 × 414
  cname   ccode  year cname_qog ccode_qog ccodealp ccodealp_year ccodecow
  <chr>   <dbl> <dbl> <chr>         <dbl> <chr>    <chr>            <dbl>
1 Denmark   208  1995 Denmark         208 DNK      DNK95              390
2 Denmark   208  1996 Denmark         208 DNK      DNK96              390
3 Denmark   208  1997 Denmark         208 DNK      DNK97              390
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>,
#   edgar_ch4 <dbl>, edgar_co <dbl>, edgar_co2gdp <dbl>, edgar_co2pc <dbl>, …
qog_subset |> slice(-1) # Remove the first row
# A tibble: 268 × 414
   cname   ccode  year cname_qog ccode_qog ccodealp ccodealp_year ccodecow
   <chr>   <dbl> <dbl> <chr>         <dbl> <chr>    <chr>            <dbl>
 1 Denmark   208  1992 Denmark         208 DNK      DNK92              390
 2 Denmark   208  1993 Denmark         208 DNK      DNK93              390
 3 Denmark   208  1994 Denmark         208 DNK      DNK94              390
 4 Denmark   208  1995 Denmark         208 DNK      DNK95              390
 5 Denmark   208  1996 Denmark         208 DNK      DNK96              390
 6 Denmark   208  1997 Denmark         208 DNK      DNK97              390
 7 Denmark   208  1998 Denmark         208 DNK      DNK98              390
 8 Denmark   208  1999 Denmark         208 DNK      DNK99              390
 9 Denmark   208  2000 Denmark         208 DNK      DNK00              390
10 Denmark   208  2001 Denmark         208 DNK      DNK01              390
# ℹ 258 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …
qog_slice <- qog_subset |> slice_max(year, n = 5) # Get the n max results of a variable


qog_subset |> slice_min(year, n = 3) # Get the n min results of a variable
# A tibble: 8 × 414
  cname        ccode  year cname_qog   ccode_qog ccodealp ccodealp_year ccodecow
  <chr>        <dbl> <dbl> <chr>           <dbl> <chr>    <chr>            <dbl>
1 Denmark        208  1991 Denmark           208 DNK      DNK91              390
2 France         250  1991 France (19…       250 FRA      FRA91              220
3 India          356  1991 India             356 IND      IND91              750
4 Netherlands    528  1991 Netherlands       528 NLD      NLD91              210
5 Norway         578  1991 Norway            578 NOR      NOR91              385
6 Poland         616  1991 Poland            616 POL      POL91              290
7 Saudi Arabia   682  1991 Saudi Arab…       682 SAU      SAU91              670
8 Spain          724  1991 Spain             724 ESP      ESP91              230
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>,
#   edgar_ch4 <dbl>, edgar_co <dbl>, edgar_co2gdp <dbl>, edgar_co2pc <dbl>, …
view(qog_subset)

7.3 select() variables

While filter() is useful for keeping only certain groups of rows, select() is used, as its name suggests, to select certain variables (columns) from our dataframe.

Let’s say I’m interested in CO2 emissions per capita, their overall level and how they vary over time and by country. For this, I would need only three variables that are present in the dataset: year, cname and the variable wdi_co2 which comes from the World Bank’s World Development Indicators. To do this, I use select() and just specify which variables of the dataset I want to select.

qog_subset |> 
  select(cname, year,  wdi_co2)
# A tibble: 269 × 3
   cname    year wdi_co2
   <chr>   <dbl>   <dbl>
 1 Denmark  1991   11.7 
 2 Denmark  1992   10.5 
 3 Denmark  1993   11.0 
 4 Denmark  1994   11.7 
 5 Denmark  1995   10.9 
 6 Denmark  1996   13.7 
 7 Denmark  1997   11.6 
 8 Denmark  1998   11.2 
 9 Denmark  1999   10.4 
10 Denmark  2000    9.61
# ℹ 259 more rows

The select() functions offer a lot of possibilities to select variable based on certain patterns by adding functions inside such as starts_with(), ends_with(), contains(). You can learn more by typing ?select.

colnames(qog_subset)
  [1] "cname"               "ccode"               "year"               
  [4] "cname_qog"           "ccode_qog"           "ccodealp"           
  [7] "ccodealp_year"       "ccodecow"            "ccodevdem"          
 [10] "cname_year"          "version"             "act_act"            
 [13] "as_rifr"             "as_ws"               "bti_envc"           
 [16] "ccci_coop"           "ccci_em"             "ccci_fin"           
 [19] "ccci_kyoto"          "ccci_rep"            "ccci_unfccc"        
 [22] "cckp_rain"           "cckp_temp"           "ccl_exepp"          
 [25] "ccl_leglp"           "ccl_lpp"             "ccl_mitlpp"         
 [28] "ccl_nexep"           "ccl_nlegl"           "ccl_nlp"            
 [31] "ccl_nmitlp"          "edgar_bc"            "edgar_ch4"          
 [34] "edgar_co"            "edgar_co2gdp"        "edgar_co2pc"        
 [37] "edgar_co2t"          "edgar_n2o"           "edgar_nh3"          
 [40] "edgar_nmvoc"         "edgar_nox"           "edgar_oc"           
 [43] "edgar_pm10"          "edgar_pm25"          "edgar_so2"          
 [46] "edi_edi"             "edi_gaarr"           "edi_gadrei"         
 [49] "edi_gaerr"           "edi_gair"            "edi_gapc"           
 [52] "edi_gbs"             "edi_gee"             "edi_gepd"           
 [55] "edi_gepp"            "edi_gewi"            "edi_gftir"          
 [58] "edi_ggr"             "edi_gicm"            "edi_gip"            
 [61] "edi_gipirm"          "edi_gira"            "edi_gpajad"         
 [64] "edi_gper"            "edi_gppa"            "edi_gppc"           
 [67] "edi_gppr"            "edi_grpcspa"         "edi_gser"           
 [70] "edi_jp"              "edi_pati"            "edi_pp"             
 [73] "ef_bcpc"             "ef_bct"              "ef_bul"             
 [76] "ef_bul_bc"           "ef_bulp"             "ef_carb"            
 [79] "ef_carb_bc"          "ef_carbp"            "ef_crop"            
 [82] "ef_crop_bc"          "ef_cropp"            "ef_ef"              
 [85] "ef_efp"              "ef_eft"              "ef_eftp"            
 [88] "ef_fg"               "ef_fg_bc"            "ef_fgp"             
 [91] "ef_for"              "ef_for_bc"           "ef_forp"            
 [94] "ef_gl"               "ef_gl_bc"            "ef_glp"             
 [97] "em_envmin"           "emdat_damage"        "emdat_naffect"      
[100] "emdat_ndeath"        "emdat_ndis"          "emdat_nhome"        
[103] "emdat_ninj"          "emdat_ntotaff"       "engo_nengo"         
[106] "epc_bath"            "epc_car"             "epc_co2"            
[109] "epc_dete"            "epc_enef"            "epc_ener"           
[112] "epc_fors"            "epc_lead"            "epc_nois"           
[115] "epc_pawa"            "epc_soil"            "epc_watp"           
[118] "epcc_audi_ch2"       "epcc_audi_in2"       "epcc_bath_ch2"      
[121] "epcc_bath_in2"       "epcc_car_co"         "epcc_car_hc"        
[124] "epcc_car_nox"        "epcc_care_ch2"       "epcc_care_in2"      
[127] "epcc_cd_dwsum"       "epcc_cd_upsum"       "epcc_ch_kum"        
[130] "epcc_ch2"            "epcc_cont_ch2"       "epcc_cont_in2"      
[133] "epcc_cowa_ch2"       "epcc_cowa_in2"       "epcc_dete_ch2"      
[136] "epcc_dete_in2"       "epcc_ecol_ch2"       "epcc_ecol_in2"      
[139] "epcc_eias_ch2"       "epcc_eias_in2"       "epcc_enef_ch2"      
[142] "epcc_enef_in2"       "epcc_glas_ch2"       "epcc_glas_in2"      
[145] "epcc_glas2_s"        "epcc_intro_kum"      "epcc_land_ch2"      
[148] "epcc_lanr_in2"       "epcc_lcp_dust"       "epcc_lcp_nox"       
[151] "epcc_lcp_so2"        "epcc_lcpt_ch2"       "epcc_lcpt_in2"      
[154] "epcc_lead_ch2"       "epcc_lead_in2"       "epcc_lead_s"        
[157] "epcc_moto_ch2"       "epcc_moto_in2"       "epcc_moto_s"        
[160] "epcc_nois_ch2"       "epcc_nois_in2"       "epcc_nois_s"        
[163] "epcc_pact_ch2"       "epcc_pact_in2"       "epcc_pape_ch2"      
[166] "epcc_pape_in2"       "epcc_pape2_s"        "epcc_soil_ch2"      
[169] "epcc_soil_in2"       "epcc_sulp_ch2"       "epcc_sulp_in2"      
[172] "epcc_sulp_s"         "epcc_susp_ch2"       "epcc_susp_in2"      
[175] "epcc_wabo_s"         "epcc_waco_s"         "epcc_wacr_s"        
[178] "epcc_waef_ch2"       "epcc_waef_in2"       "epcc_wale_s"        
[181] "epcc_wapr_ch2"       "epcc_wapr_in2"       "epcc_wazi_s"        
[184] "epi_agr"             "epi_air"             "epi_ape"            
[187] "epi_bca"             "epi_bdh"             "epi_bhv"            
[190] "epi_cch"             "epi_cda"             "epi_cha"            
[193] "epi_ecs"             "epi_eh"              "epi_epi"            
[196] "epi_ev"              "epi_fct"             "epi_fga"            
[199] "epi_fsh"             "epi_fss"             "epi_ghp"            
[202] "epi_gib"             "epi_grl"             "epi_h2o"            
[205] "epi_had"             "epi_hmt"             "epi_lcb"            
[208] "epi_mpa"             "epi_msw"             "epi_mti"            
[211] "epi_noa"             "epi_nxa"             "epi_ozd"            
[214] "epi_par"             "epi_pbd"             "epi_pmd"            
[217] "epi_sda"             "epi_shi"             "epi_snm"            
[220] "epi_spi"             "epi_tbg"             "epi_tbn"            
[223] "epi_tcl"             "epi_usd"             "epi_uwd"            
[226] "epi_wmg"             "epi_wrs"             "epi_wtl"            
[229] "epi_wwt"             "ess_banhhap_m"       "ess_ccnthum_p"      
[232] "ess_ccrdprs_m"       "ess_clmchng_p"       "ess_clmthgt_m"      
[235] "ess_gvsrdcc_m"       "ess_impenv_m"        "ess_inctxff_m"      
[238] "ess_lklmten_m"       "ess_sbsrnen_m"       "ess_wrclmch_m"      
[241] "fao_luagr"           "fao_luagrara"        "fao_luagrcrop"      
[244] "fao_luagrirrac"      "fao_luagrirreq"      "fao_luagrirreqcrop" 
[247] "fao_luagrorg"        "fao_luagrpas"        "fao_luagrpcrop"     
[250] "fao_lucrop"          "fao_luforest"        "fao_luforplant"     
[253] "fao_luforreg"        "fao_lupas"           "gg_asew_pop"        
[256] "gg_asewp"            "gg_asews"            "gg_asewt"           
[259] "gg_buapc"            "gg_buapt"            "gg_ei"              
[262] "gg_envrd_gbaord"     "gg_envrd_gdp"        "gg_eoda"            
[265] "gg_erdgdp"           "gg_etp"              "gg_etpw"            
[268] "gg_ffrd"             "gg_frs"              "gg_fsmc"            
[271] "gg_iufr"             "gg_mao"              "gg_ml"              
[274] "gg_mpm"              "gg_mr"               "gg_mwgpc"           
[277] "gg_mwipt"            "gg_mwlpt"            "gg_mwrpt"           
[280] "gg_oda_ccm"          "gg_pm25ex10p"        "gg_pm25ex35p"       
[283] "gg_pm25exm"          "gg_pt"               "gg_re_tpes"         
[286] "gg_reperegen"        "gg_rerd_erd"         "gg_tbs"             
[289] "gg_tms"              "gg_tps"              "gg_wsa"             
[292] "gg_wsi"              "iead_eif1"           "iead_eif2"          
[295] "iead_eif3"           "iead_inforce"        "iead_inforce_noterm"
[298] "iead_rat"            "iead_sig"            "iead_term"          
[301] "iead_withdraw1"      "iead_withdraw2"      "issp_10am"          
[304] "issp_12ap"           "issp_12bp"           "issp_12cp"          
[307] "issp_13am"           "issp_13bm"           "issp_13em"          
[310] "issp_13gm"           "issp_15ap"           "issp_18p"           
[313] "issp_19am"           "issp_19bm"           "issp_1ap"           
[316] "issp_20am"           "issp_20ap"           "issp_20dm"          
[319] "issp_21p"            "issp_22ap"           "issp_22bp"          
[322] "issp_22cp"           "issp_6m"             "issp_8am"           
[325] "issp_8bm"            "issp_9am"            "nrmi_nrpi"          
[328] "oecd_cctr_gdp"       "oecd_cctr_tot"       "oecd_eampg"         
[331] "oecd_epea"           "oecd_eps"            "oecd_etr_gdp"       
[334] "oecd_etr_tot"        "oecd_pm25ex15p"      "oecd_pm25ex25p"     
[337] "oecd_polagdpg"       "ohi_aoacc"           "ohi_caacid"         
[340] "ohi_chp"             "ohi_csslr"           "ohi_csst"           
[343] "ohi_cuv"             "ohi_fah"             "ohi_fchb"           
[346] "ohi_fclb"            "ohi_hab"             "ohi_habcom"         
[349] "ohi_habeez"          "ohi_hdinter"         "ohi_hshb"           
[352] "ohi_hssb"            "ohi_lpai"            "ohi_lpao"           
[355] "ohi_maricul"         "ohi_npblast"         "ohi_npcyan"         
[358] "ohi_ohi"             "ohi_pc3"             "ohi_pchem"          
[361] "ohi_pn3"             "ohi_pnutrient"       "ohi_ptrash"         
[364] "ohi_saali"           "ohi_tjpt"            "ohi_tour"           
[367] "ohi_trsust"          "ohi_water"           "ross_gas_exp"       
[370] "ross_gas_netexp"     "ross_gas_netexpc"    "ross_gas_price"     
[373] "ross_gas_prod"       "ross_gas_value_2000" "ross_gas_value_2014"
[376] "ross_oil_exp"        "ross_oil_netexp"     "ross_oil_netexpc"   
[379] "ross_oil_price"      "ross_oil_prod"       "ross_oil_value_2000"
[382] "ross_oil_value_2014" "sgi_en"              "sgi_enen"           
[385] "sgi_enge"            "sgi_epe"             "sgi_ger"            
[388] "slaws_mit_ex_l3"     "slaws_mit_ex_lt"     "slaws_mit_l3"       
[391] "slaws_mit_leg_l3"    "slaws_mit_leg_lt"    "slaws_mit_lt"       
[394] "vparty_envseat"      "vparty_envvote"      "wdi_agrland"        
[397] "wdi_araland"         "wdi_area"            "wdi_areabelow"      
[400] "wdi_co2"             "wdi_forest"          "wdi_fossil"         
[403] "wdi_idpdis"          "wdi_piesr"           "wdi_precip"         
[406] "wdi_tpa"             "wvs_ameop"           "wvs_ceom"           
[409] "wvs_deop"            "wvs_epmip"           "wvs_epmpp"          
[412] "wvs_imeop"           "wvs_pedp"            "wvs_ploem"          
qog_wdi <- qog_subset |> 
  # Select the country, the year and every variable that starts with wdi. 
  select(cname, year, starts_with("wdi"))

# Look at which columns we have in our dataset

qog_wdi |> colnames()
 [1] "cname"         "year"          "wdi_agrland"   "wdi_araland"  
 [5] "wdi_area"      "wdi_areabelow" "wdi_co2"       "wdi_forest"   
 [9] "wdi_fossil"    "wdi_idpdis"    "wdi_piesr"     "wdi_precip"   
[13] "wdi_tpa"      
# Select all of the variable that do not starts with wdi

qog_wdi |> select(-starts_with("wdi")) |> colnames()
[1] "cname" "year" 
# Select column by position (from 1 to 5)

qog_wdi |> select(1:5) |> colnames()
[1] "cname"       "year"        "wdi_agrland" "wdi_araland" "wdi_area"   
qog_wdi |> colnames()
 [1] "cname"         "year"          "wdi_agrland"   "wdi_araland"  
 [5] "wdi_area"      "wdi_areabelow" "wdi_co2"       "wdi_forest"   
 [9] "wdi_fossil"    "wdi_idpdis"    "wdi_piesr"     "wdi_precip"   
[13] "wdi_tpa"      
qog_wdi |> 
  # Select only the variables that have numeric as type
  select_if(is.character) |> 
  # Look at which columns we have in our dataset
  colnames()
[1] "cname"

7.4 Change column names with rename()

Occasionally, you may encounter datasets with columns named in a non-meaningful manner, such as v5, x45, or Q234, often representing variable identifiers from surveys. To enhance the understanding of the variables’ meanings, it can prove beneficial to rename these columns using the rename() function. In our current dataset, the variable names are already quite informative. However, for the purpose of this illustration, let’s rename the ‘cname’ variable as ‘country’, offering a more explicit label.

qog_renamed <- qog_wdi |> 
  # Rename the variable cname as country
  rename(country = cname) 


qog_renamed |> 
  colnames()
 [1] "country"       "year"          "wdi_agrland"   "wdi_araland"  
 [5] "wdi_area"      "wdi_areabelow" "wdi_co2"       "wdi_forest"   
 [9] "wdi_fossil"    "wdi_idpdis"    "wdi_piesr"     "wdi_precip"   
[13] "wdi_tpa"      
# More complex ! 

qog_renamed <-  qog_renamed |> 
  # Remove wdi_ from all of the variables names that starts with wdi
  rename_with(~ str_remove(.x, "wdi_"), starts_with("wdi")) 

qog_renamed |> colnames()
 [1] "country"   "year"      "agrland"   "araland"   "area"      "areabelow"
 [7] "co2"       "forest"    "fossil"    "idpdis"    "piesr"     "precip"   
[13] "tpa"      

7.5 Calculating statistics by group with group_by() and summarise()

To compute a first set of descriptive statistics, we could look at the central indicators of the co2 emissions variable (wdi_co2).

mean(qog_wdi$wdi_co2, na.rm = TRUE) # Compute the mean
[1] 8.727978
median(qog_wdi$wdi_co2, na.rm = TRUE) # Compute the median
[1] 8.664999
sd(qog_wdi$wdi_co2, na.rm = TRUE) # Compute the standard deviation
[1] 4.133231
quantile(qog_wdi$wdi_co2, na.rm = T) # Compute the quantiles
        0%        25%        50%        75%       100% 
 0.7384827  6.0988783  8.6649986 11.1314308 20.4022516 

However, we are often interested in how these indicators vary across groups such as country or year. To do so, use the summarise() function in R that allows you to compute new variables by groups. To choose grouping variable, we use first group_by() where we specify for which group we want to compute something. Here I use this function to compute the mean, the median, the standard deviation and the first and third quartile of co2 emissions per capita for each country. This gives us a new tibble with all of the summary information.

(qog_stats <- qog_wdi |> 
   group_by(cname) |> 
  summarise(mean_co2 = mean(wdi_co2, na.rm = TRUE), 
            median_co2 = median(wdi_co2, na.rm = TRUE),
            sd_co2 = sd(wdi_co2, na.rm = TRUE),
            q1 = quantile(wdi_co2, 0.25, na.rm = TRUE),
            q3 = quantile(wdi_co2, 0.25, na.rm = TRUE)))
# A tibble: 9 × 6
  cname        mean_co2 median_co2 sd_co2     q1     q3
  <chr>           <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
1 Denmark          9.31       9.65  2.14   8.11   8.11 
2 France           5.82       5.93  0.488  5.63   5.63 
3 India            1.15       1.01  0.337  0.917  0.917
4 Netherlands     10.9       11.0   0.488 10.6   10.6  
5 Norway           8.94       8.98  1.09   8.16   8.16 
6 Poland           8.29       8.21  0.567  7.88   7.88 
7 Russia          11.5       11.4   0.908 10.9   10.9  
8 Saudi Arabia    16.0       16.7   2.65  14.4   14.4  
9 Spain            6.46       6.15  0.981  5.79   5.79 

To get nice tables when you render your quarto document, you can use the knitr::kable() function.

qog_stats |> knitr::kable()
cname mean_co2 median_co2 sd_co2 q1 q3
Denmark 9.306369 9.647554 2.1354587 8.1058071 8.1058071
France 5.815746 5.926466 0.4877268 5.6274657 5.6274657
India 1.154740 1.005567 0.3372283 0.9172273 0.9172273
Netherlands 10.945372 11.040789 0.4877648 10.5946611 10.5946611
Norway 8.940977 8.983080 1.0866764 8.1622779 8.1622779
Poland 8.291731 8.205371 0.5672529 7.8777194 7.8777194
Russia 11.508477 11.443988 0.9082599 10.8866443 10.8866443
Saudi Arabia 16.012722 16.656378 2.6470291 14.4207311 14.4207311
Spain 6.458592 6.149608 0.9813928 5.7886348 5.7886348

It is also possible to sort the results to gain a clearer idea of which countries have the highest average CO2 emissions per capita. To achieve this, we utilize the arrange() function, specifying the variable by which we intend to sort. By doing so, we observe that, on average, Saudi Arabia, Russia, and the Netherlands possess the highest carbon footprint per capita among the selected countries.

qog_stats |> 
  arrange(-median_co2)
# A tibble: 9 × 6
  cname        mean_co2 median_co2 sd_co2     q1     q3
  <chr>           <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
1 Saudi Arabia    16.0       16.7   2.65  14.4   14.4  
2 Russia          11.5       11.4   0.908 10.9   10.9  
3 Netherlands     10.9       11.0   0.488 10.6   10.6  
4 Denmark          9.31       9.65  2.14   8.11   8.11 
5 Norway           8.94       8.98  1.09   8.16   8.16 
6 Poland           8.29       8.21  0.567  7.88   7.88 
7 Spain            6.46       6.15  0.981  5.79   5.79 
8 France           5.82       5.93  0.488  5.63   5.63 
9 India            1.15       1.01  0.337  0.917  0.917
qog_stats |> 
  arrange(mean_co2)
# A tibble: 9 × 6
  cname        mean_co2 median_co2 sd_co2     q1     q3
  <chr>           <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
1 India            1.15       1.01  0.337  0.917  0.917
2 France           5.82       5.93  0.488  5.63   5.63 
3 Spain            6.46       6.15  0.981  5.79   5.79 
4 Poland           8.29       8.21  0.567  7.88   7.88 
5 Norway           8.94       8.98  1.09   8.16   8.16 
6 Denmark          9.31       9.65  2.14   8.11   8.11 
7 Netherlands     10.9       11.0   0.488 10.6   10.6  
8 Russia          11.5       11.4   0.908 10.9   10.9  
9 Saudi Arabia    16.0       16.7   2.65  14.4   14.4  

Now I want to know when emissions per capita levels were the lowest in Saudia Arabia. I can filter only this country, select the variables of interest and sort the dataframe. We observe that the CO2 emissions per capita were at their minimum towards the late 1990s (which is not good).

qog_wdi |> 
  filter(cname == "Saudi Arabia") |> 
  select(cname, year, wdi_co2) |> 
  arrange(wdi_co2)
# A tibble: 30 × 3
   cname         year wdi_co2
   <chr>        <dbl>   <dbl>
 1 Saudi Arabia  1998    10.5
 2 Saudi Arabia  1997    11.1
 3 Saudi Arabia  1999    11.2
 4 Saudi Arabia  1995    12.6
 5 Saudi Arabia  1996    13.6
 6 Saudi Arabia  2001    14.0
 7 Saudi Arabia  2000    14.4
 8 Saudi Arabia  2003    14.6
 9 Saudi Arabia  2002    15.0
10 Saudi Arabia  2007    15.4
# ℹ 20 more rows

We can also compute descriptive statistics on several variables at the same time with the across() function.

qog_growth <- qog_wdi |> 
  group_by(cname) |> 
  summarise(across(c(wdi_co2, wdi_fossil),~ mean(.x, na.rm = T)))

qog_growth
# A tibble: 9 × 3
  cname        wdi_co2 wdi_fossil
  <chr>          <dbl>      <dbl>
1 Denmark         9.31       84.7
2 France          5.82       52.0
3 India           1.15       64.9
4 Netherlands    10.9        94.1
5 Norway          8.94       56.4
6 Poland          8.29       94.7
7 Russia         11.5        91.3
8 Saudi Arabia   16.0        99.9
9 Spain           6.46       78.9

7.5.1 Mutate

We can also create new variables based on other ones with mutate(). Let’s say I want to compute the growth rate of co2 emissions per capita every year. For this, I will use mutate() to create new variables.

qog_wdi |> 
  select(cname, year, wdi_fossil) |> 
  mutate(dataset = "QOG dataset")
# A tibble: 269 × 4
   cname    year wdi_fossil dataset    
   <chr>   <dbl>      <dbl> <chr>      
 1 Denmark  1991       94.1 QOG dataset
 2 Denmark  1992       90.7 QOG dataset
 3 Denmark  1993       91.8 QOG dataset
 4 Denmark  1994       94.6 QOG dataset
 5 Denmark  1995       92.4 QOG dataset
 6 Denmark  1996       98.5 QOG dataset
 7 Denmark  1997       94.4 QOG dataset
 8 Denmark  1998       92.7 QOG dataset
 9 Denmark  1999       90.9 QOG dataset
10 Denmark  2000       88.3 QOG dataset
# ℹ 259 more rows

To compute the growth rate, we need for each year the emissions per capita of that year and those from the previous year, which I access with lag().

qog_wdi <- qog_wdi |>
  select(cname, year, wdi_co2) |> 
  mutate(co2_lag = lag(wdi_co2),
        co2_growth = (wdi_co2 - co2_lag) / co2_lag*100)

7.6 Recode values with case_when

Mutate() is also widely used to recode variable. Here we create a new trajectory variable and we use case_when() to recode whether there co2_emissions are growing or not for a given year.

qog_wdi <- qog_wdi |> 
  mutate(
    trajectory = case_when(
      co2_growth > 0 ~ "Bad",
      co2_growth < 0 ~ "Good"
    )
  )

view(qog_wdi)

qog_wdi |> 
  count(trajectory)
# A tibble: 3 × 2
  trajectory     n
  <chr>      <int>
1 Bad          110
2 Good         112
3 <NA>          47

7.7 Small exercice

Find the countries that are the most affected by natural disasters (hint : look at the variables starting with emdat)