Data frames

Data frames

Data frames are similar to matrices, but can also contain elements of different types.

> quantity <- c(200, 300, 100)
> crop <- c("corn", "leek", "pea")
> subsidy <- c(TRUE, FALSE, TRUE)
> my_df <- data.frame(quantity, crop, subsidy)
> my_df

//Output Below

  quantity crop subsidy
1      200 corn    TRUE
2      300 leek   FALSE
3      100  pea    TRUE

Nice! Data frames are two-dimensional objects where variables are stored as columns and observations as rows.

Data frame structure

A great way to explore your data is by using the str() function. Let's apply it to our data frame.

> quantity <- c(200, 300, 100)
> crop <- c("corn", "leek", "pea")
> subsidy <- c(TRUE, FALSE, TRUE)
> my_df <- data.frame(quantity, crop, subsidy)
> str(my_df)

//Output Below

'data.frame':  3 obs. of  3 variables:
 $ quantity: num  200 300 100
 $ crop    : Factor w/ 3 levels "corn","leek",..: 1 2 3
 $ subsidy : logi  TRUE FALSE TRUE

See that? We can now look at the number of observations and variable types. Notice that crop is of type factor instead of character.

Strings as factors

We saw that by default, strings were saved as factors in our data frame. Sometimes we may want to override this, and store them just as strings.

> farmer <- c("Bob", "Sam", "Mike")
> my_df <- data.frame(quantity, crop, farmer, subsidy, stringsAsFactors = FALSE)
> class(my_df$crop)

//Output Below

[1] "character"

It makes sense to leave "crop" as a factor since it's a finite category. Personal names, however, are better saved as character.

Adding a new variable

We can add a new variable to our data frame and name it at the same time. Let's add the farmer column.

> my_df$farmer <- c("Bob", "Sam", "Mike")
> my_df

//Output Below

  quantity crop subsidy farmer
1      200 corn    TRUE    Bob
2      300 leek   FALSE    Sam
3      100  pea    TRUE   Mike

Yass! With the $ symbol, we can appended and name a column. Adding string vector in this way saves it as character.

Selecting variables

We can subset a particular variable column by typing the data frame name followed by $ and the variable name.

Let's see if we can also subset crop in this code snippet.

> my_df$quantity
> my_df$crop

//Output Below

[1] 200 300 100
[1] "corn" "leek" "pea"

Nice! Output is an atomic vector of a particular type.

Selecting variables II

Another attribute we can use to select a data frame variable is a double square bracket[[ ]].

> my_df[["quantity"]]
> my_df[[1]]

//Output Below

[1] 200 300 100
[1] 200 300 100

Note that we can either type the column index or the column name inside the [[ ]].

Subsetting using index

As with vectors and matrices, we can call various data frame subsets by using simple square brackets.

How would we select all of the rows, but only just 2 columns?

> my_df[ , 3:4]

//Output Below

  subsidy farmer
1    TRUE    Bob
2   FALSE    Sam
3    TRUE   Mike

Yass! We've just subsetted 2 columns. By using different index combinations we can subset single elements, rows or two-dimensional arrays.

Selecting 1 column

What might be the best way to only select 1 column?

> my_df[ , 3]

//Output Below

[1]  TRUE FALSE  TRUE

Awesome! By leaving the row empty and only writing the index of one column we are able to get the column.

Subsetting with boolean

Let's select only the rows that are receiving subsidy. Remember, subsidy is a logical variable.

my_df[subsidy, ]

//Output Below

  quantity crop subsidy farmer
1      200 corn    TRUE    Bob
3      100  pea    TRUE   Mike

Awesome! we see only the rows containing TRUE in the subsidy column.

Sorting

We can sort our data frame by a particular column using the order() function. Sort my_df by the quantity columns.

> order(my_df$quantity)
> my_df[order(my_df$quantity), ]

//Output Below

[1] 3 1 2

  quantity crop subsidy farmer
3      100  pea    TRUE   Mike
1      200 corn    TRUE    Bob
2      300 leek   FALSE    Sam

See that? The order() functions allows us to sort vectors. By default, it sorts in ascending order.