4  R basics

4.1 R as calculator

First of all, R is a fancy calculator that can be used to perform fundamental arithmeric operations.

3+7+10 # Addition
[1] 20
4-5 # Substraction
[1] -1
3*9*10 # Multiplication
[1] 270
2/6 # Division
[1] 0.3333333
2^2 # Exponentiation
[1] 4
(2+2)-(4*4)/2^2 # Mix of operations
[1] 0
my_object <- 2
my_object
[1] 2

4.2 Objects

The results of these operations are shown in the console whenever the code is run. However, these results cannot be directly used again for more operations, which is often something we would like to do. To achieve this, we make use of objects. Objects are like containers that we assign values to and keep within our programming environment. Think of them as a way to store information for later use. To create an object in R, we use the assignment operator <-.

Let us consider an example from the recent Spanish elections held in June 2023. The mainstream right party emerged as the frontrunner in the election and had the potential to secure an absolute majority in the Congreso, the Spanish parliament. This could have been achieved by forming a coalition with the radical right party Vox. However, they fell short of this goal due to a lower-than-expected seat count. This code below uses objects to store the number of seats obtained by these parties and to calculate the seats required to achieve a majority in the Congreso.

majority <- 350/2+1 # Half of the seats in the Congreso + 1
pp_seats <- 137 # Seats gained by the mainstream right party (PP)
vox_seats <- 33 # Seats gained by the radical right party (Vox)
right_seats <- pp_seats + vox_seats
majority- right_seats 
[1] 6

After executing these lines, the distinct objects are preserved within the environment pane located in the upper-right section of RStudio.

psoe_seats <- 121
sumar_seats <- 31
left_seats <- psoe_seats + sumar_seats
majority - left_seats
[1] 24
Exercise

Try to do the same with the two main left-wing parties PSOE and Sumar who respectively obtained 121 and 31 seats.

Solution. Click to expand!

Solution:

Object names

Note that I have written the names of objects with underscores. There are different conventions to write object names in R that you can discover here. I personnaly use snake case which use lowercase letters and underscores to separate words.

4.3 Vectors

The objects we used so far contained only one numeric value. However, what we mostly manipulate in R are vectors, which are sequences of different values on which we can perform operations. Vectors can be of different types (eg : numeric, character, logical, date) but they have to be of the same type. And vectors are also unidimensionals which mean they contains only one sequence of values and not several such as matrices do.

We can generate vectors with c() which stands for “concatenate”.

coalition_seats <- c(right_seats, left_seats)
coalition_seats
[1] 170 152

We can also store vectors as objects and do operations on them.

majority - coalition_seats
[1]  6 24

4.3.1 Characters vectors

So far, we’ve only used numerical vectors, made up of numbers. But we can also create character vectors, made up of strings using quotes (either ' or "). For instance, I can create two different vectors of spanish candidates, one for the left and one for the right.c

left_candidates <- c("Sanchez", "Diaz")
right_candidates <- c("Feijoo", "Abascal")

As for other vectors, you can combine them in a single vector which will return the names of all politicians.

candidates <- c(left_candidates, right_candidates)
candidates
[1] "Sanchez" "Diaz"    "Feijoo"  "Abascal"

4.3.2 Logical vectors

Another type of vector in R are logical which is made of Booleans : TRUE or FALSE.

c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

These logical vectors are useful when we want to evaluate whether a condition is True or not. For instance, we could check whether each value of theleft_candidates is equal to “Abascal”. Or check whether the PP obtained more seats than Vox.

left_candidates
[1] "Sanchez" "Diaz"   
left_candidates == "Diaz"
[1] FALSE  TRUE
pp_seats
[1] 137
vox_seats
[1] 33
pp_seats < vox_seats
[1] FALSE

4.3.3 Indexing

When we manipulate vectors, we often want to access specific elements of them, which we call indexing, which is performed by using square brackets []. You can index either by position or by name. When I write candidates[3], I want the value of the third element of the candidates vector, this is indexing by position. But when I write candidates[candidates == "Abascal"], I index by name because I want the elements that have Abascal as value.

candidates
[1] "Sanchez" "Diaz"    "Feijoo"  "Abascal"
candidates[4] # Get the third element of the vector
[1] "Abascal"
candidates[-3] # Get everything but the third element of the vector
[1] "Sanchez" "Diaz"    "Abascal"
candidates[c(1,4)] # Get the first and the fifth elements of the vector
[1] "Sanchez" "Abascal"
candidates[1:3] # Get elements from the first to the third
[1] "Sanchez" "Diaz"    "Feijoo" 
candidates[candidates == "Abascal"] # Which has Abascal as value
[1] "Abascal"
candidates[candidates != "Abascal"] # Which has not Abascal as value
[1] "Sanchez" "Diaz"    "Feijoo" 
candidates[candidates %in% c("Abascal", "Feijoo")]# Which has Abascal or Feijoo
[1] "Feijoo"  "Abascal"

4.4 Functions

To manipulate vectors and conduct operations on them, we use functions. A function is a reusable block of code that performs a specific task, it takes several input values called arguments and produce an output. Let’s say you want to calculate the mean of seats Vox has obtained in the 17 spanish regions. You could calculate the sum of the seats and dividing them by their number. But you could also just the mean() function that exists in R and that takes a vector of numbers as argument.

vox_regions <- c(9, 1, 1, 1, 0, 1, 1, 1, 3, 2, 0, 1, 0, 0, 5, 2, 0, 5) 
mean(vox_regions)
[1] 1.833333
mean_vox <- mean(vox_regions)

In R, functions often expect inputs of specific types. If you pass a character vector containing numeric numbers as strings to a function that expects a numeric vector, it may not behave as expected. As shown below, the function returns a NA which means Not available/applicable.

vox_regions2 <- c("9", "1", "1", "1", "0", "1", "1", "1", "3", "2", "0", "1", "0", "0", "5", "2", "0", "5") 
mean(vox_regions2)
Warning in mean.default(vox_regions2): argument is not numeric or logical:
returning NA
[1] NA

Similarly, computing the sum of the vox_regions vector will work as expected but trying to calculate the sum of our candidates character vector composed of candidates’s names will not give a meaningful result.

sum(vox_regions)
[1] 33
sum(candidates) # This is an error
Error in sum(candidates): invalid 'type' (character) of argument

If you are not sure about the type of your vectors, you can check with the class() function that will give you the answer.

class(vox_regions)
[1] "numeric"
class(right_candidates)
[1] "character"

Sometimes, a vector has not the good type for the operation we want to perform. To check the type of a vector, you can use the family of is. functions such as is.numeric() and is.character() that return a boolean operator. In case the vector is not the right type for our purpose, wan can try to coerce them with the family of as. functions such as as.numeric() and as.character().

is.numeric(vox_regions2) # Check if numeric
[1] FALSE
vox_regions3 <- as.numeric(vox_regions2) # Coerce to numeric
is.numeric(vox_regions3) # Check again if numeric
[1] TRUE
mean(vox_regions3) # Compute the mean
[1] 1.833333

Functions that you will find in R have been created by someone. You can also create your own functions in R. You usually start doing it when you are more advanced so do not worry it you find it hard, it is just for you to know that it is possible. Here I just create a simplified other function to calculate a mean in R.

compute_mean <- function(x) {
  mean <- sum(x)/length(x)
  paste0("The mean is of this vector is ", mean)
}
compute_mean(vox_regions)
[1] "The mean is of this vector is 1.83333333333333"

4.5 Missing values

In R, a missing value is represented by the symbol NA, which stands for “Not Available.” Missing values can arise for a variety of reasons, such as data not being observed or recorded, errors in data collection, or intentional omissions. Understanding and handling missing values is crucial because they can influence the results of your analysis or even cause some functions to return errors. For instance, imagine I haven’t found any information about Vox’s score in one of the Spanish regions, but I want to retain this information in my vector. So, I add an NA to it.

vox_regions <- c(vox_regions, NA)
vox_regions
 [1]  9  1  1  1  0  1  1  1  3  2  0  1  0  0  5  2  0  5 NA

When analyzing data, it’s not uncommon to encounter NA values, and it’s important to be aware of them. To check if a vector contains NA values, you can use the is.na() function. This function returns a logical vector indicating whether each value is NA (TRUE) or not (FALSE).

is.na(vox_regions) # Check which values of a vector are NAs
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

This is important because certain functions will not operate properly if there are NA values in your data. For instance, the mean() function will return NA if the data contains any NA values

mean(vox_regions)
[1] NA

To deal with NA, the mean() function has a na.rm

mean(vox_regions, na.rm = TRUE) # Remove NA before computing the mean
[1] 1.833333

4.6 Packages and libraries

The functions we’ve discussed so far, such as sum() and mean(), come from base R. These are pre-loaded functions available immediately upon starting R. However, many functions you’ll encounter aren’t part of base R but instead belong to specific packages that individuals or groups have developed. You can think of packages as collections of functions crafted to simplify certain tasks or to introduce new capabilities to R. For example, there’s the tidyverse package, which I asked you to install before the class

To install a package in R, you can use the install.packages() function, passing the name of the package in quotation marks (either single or double). I recommend doing this installation in the console since you don’t need to save this step; it’s a one-time action. However, every time you start your script or Quarto document, you’ll need to load the package. To do this, use the library() function, providing the package name as an argument, but without the quotation marks.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The tidyverse isn’t just a single package but rather a meta-package, meaning it bundles together several other packages, each with its own set of functions. For example, one of these bundled packages is stringr, which offers tools for manipulating character vectors. Since stringr is part of the tidyverse, if you’ve already loaded the tidyverse, there’s no need to load stringr separately. With it, you can perform tasks like converting strings in a vector to uppercase or lowercase.

str_to_lower(candidates) # Change strings to lower class
[1] "sanchez" "diaz"    "feijoo"  "abascal"
str_to_upper(candidates) # Change strings to upper class
[1] "SANCHEZ" "DIAZ"    "FEIJOO"  "ABASCAL"
str_detect(candidates, "inflation") # Detect if strings that contains a "M"
[1] FALSE FALSE FALSE FALSE

We can also combine characters vectors together with str_c().

parties <- c("PSOE", "Sumar", "PP", "Vox")

stringr::str_c(candidates, " is the candidate of ", parties)
[1] "Sanchez is the candidate of PSOE" "Diaz is the candidate of Sumar"  
[3] "Feijoo is the candidate of PP"    "Abascal is the candidate of Vox" 

Note above that I have used, the :: operator. It lets you reference a specific function from a package without loading the whole package. This is handy when two packages have functions with the same name, ensuring clarity in your code. It’s also useful for one-off function uses, avoiding the need to load an entire package. This approach can make code clearer and sometimes faster by reducing loaded dependencies

4.7 Dataframes and tibbles

However, we primarily interact with vectors through the manipulation of dataframes in R. Dataframes are composed of combinations of vectors, which can vary in types. Dataframes are two-dimensionals, with columns (or variables) and rows (or observations). This is what we use for manipulating data, computing statistics and visualization. In this class, we will work with a specific form of dataframe coming from the tidyverse packages that is called a tibble. Tibbles make dataframes easier to print and manipulate.

To understand what dataframes look like, let us continue with the results of the spanish elections. I manually create a tibble with the tibble() function with different variables about different parties, their seats, their vote share and their candidate.

elec <- tibble(
  party = c("PP", "PSOE", "Vox", "Sumar"), 
  seats = c(137, 131, 33, 31),
  vote_share = c(33.1, 31.7, 12.7, 12.3),
  candidate = c("Feijoo", "Sanchez", "Abascal", "Diaz")
)

elec
# A tibble: 4 × 4
  party seats vote_share candidate
  <chr> <dbl>      <dbl> <chr>    
1 PP      137       33.1 Feijoo   
2 PSOE    131       31.7 Sanchez  
3 Vox      33       12.7 Abascal  
4 Sumar    31       12.3 Diaz     

You see now that we have a new object in our Environment Pane with 3 observations and 5 variables. If we want to access only one variable (one vector) of that dataframe, we use the $ sign. This will return a vector of the values of this variable. You can also get the same result by indicating the position of the column inside [[]].

elec$party # Select the party variable
[1] "PP"    "PSOE"  "Vox"   "Sumar"
elec[[1]] # Double brackets here because not atomic vectors anymore but nested structure
[1] "PP"    "PSOE"  "Vox"   "Sumar"

We can also use indexing to get the value of specific cell.

elec$candidate[4] # Get the row 4 of the candidate variable
[1] "Diaz"
elec[4, 1] # get the value of the row 4, column 1
# A tibble: 1 × 1
  party
  <chr>
1 Sumar

Different functions are availble to get an idea of the informations and shape of the dataframe, which are useful when we load an unknown dataset and we want to understand its structure, what are the observations and variables.

head(elec, 1) # Return x first rows of an object
# A tibble: 1 × 4
  party seats vote_share candidate
  <chr> <dbl>      <dbl> <chr>    
1 PP      137       33.1 Feijoo   
tail(elec, 2) # Return x last rows of an object
# A tibble: 2 × 4
  party seats vote_share candidate
  <chr> <dbl>      <dbl> <chr>    
1 Vox      33       12.7 Abascal  
2 Sumar    31       12.3 Diaz     
dplyr::glimpse(elec) # Get a glimpse of your data
Rows: 4
Columns: 4
$ party      <chr> "PP", "PSOE", "Vox", "Sumar"
$ seats      <dbl> 137, 131, 33, 31
$ vote_share <dbl> 33.1, 31.7, 12.7, 12.3
$ candidate  <chr> "Feijoo", "Sanchez", "Abascal", "Diaz"
colnames(elec) # Retrieve column names of the dataframe
[1] "party"      "seats"      "vote_share" "candidate" 
nrow(elec) # Return the number of rows present in the dataframe
[1] 4
ncol(elec) # Return the number of columns present in the dataframe
[1] 4
summary(elec) # Return a summary of the variables
    party               seats         vote_share     candidate        
 Length:4           Min.   : 31.0   Min.   :12.30   Length:4          
 Class :character   1st Qu.: 32.5   1st Qu.:12.60   Class :character  
 Mode  :character   Median : 82.0   Median :22.20   Mode  :character  
                    Mean   : 83.0   Mean   :22.45                     
                    3rd Qu.:132.5   3rd Qu.:32.05                     
                    Max.   :137.0   Max.   :33.10                     

You can also create new variables based on the existing ones. Here I create a new variable called seats_share by calculating the share of seats each party has in the Congreso (dividing their seats by the total number of seats and multiplying by 100). I also create a variable disproportionality by computing the difference of the vote and the seat share of each party. This gives us an idea about how parties are advantaged by the electoral system. Here we see that the biggest parties have more seats than votes in comparison to smallest parties.

elec$seats_share <- elec$seats/350*100

elec$disproportionality <- elec$vote_share - elec$seats_share
elec
# A tibble: 4 × 6
  party seats vote_share candidate seats_share disproportionality
  <chr> <dbl>      <dbl> <chr>           <dbl>              <dbl>
1 PP      137       33.1 Feijoo          39.1               -6.04
2 PSOE    131       31.7 Sanchez         37.4               -5.73
3 Vox      33       12.7 Abascal          9.43               3.27
4 Sumar    31       12.3 Diaz             8.86               3.44