Data frames
Data frames
Data frames are similar to matrices, but can also contain elements of different types.
> quantity <- c(200, 300, 100) > crop <- c("corn", "leek", "pea") > subsidy <- c(TRUE, FALSE, TRUE) > my_df <- data.frame(quantity, crop, subsidy) > my_df
//Output Below
quantity crop subsidy 1 200 corn TRUE 2 300 leek FALSE 3 100 pea TRUE
Nice! Data frames are two-dimensional objects where variables are stored as columns and observations as rows.
Data frame structure
A great way to explore your data is by using the str()
function. Let's apply it to our data frame.
> quantity <- c(200, 300, 100) > crop <- c("corn", "leek", "pea") > subsidy <- c(TRUE, FALSE, TRUE) > my_df <- data.frame(quantity, crop, subsidy) > str(my_df)
//Output Below
'data.frame': 3 obs. of 3 variables: $ quantity: num 200 300 100 $ crop : Factor w/ 3 levels "corn","leek",..: 1 2 3 $ subsidy : logi TRUE FALSE TRUE
See that? We can now look at the number of observations and variable types. Notice that crop
is of type factor instead of character.
Strings as factors
We saw that by default, strings were saved as factors in our data frame. Sometimes we may want to override this, and store them just as strings.
> farmer <- c("Bob", "Sam", "Mike") > my_df <- data.frame(quantity, crop, farmer, subsidy, stringsAsFactors = FALSE) > class(my_df$crop)
//Output Below
[1] "character"
It makes sense to leave "crop"
as a factor since it's a finite category. Personal names, however, are better saved as character.
Adding a new variable
We can add a new variable to our data frame and name it at the same time. Let's add the farmer
column.
> my_df$farmer <- c("Bob", "Sam", "Mike") > my_df
//Output Below
quantity crop subsidy farmer 1 200 corn TRUE Bob 2 300 leek FALSE Sam 3 100 pea TRUE Mike
Yass! With the $
symbol, we can appended and name a column. Adding string vector in this way saves it as character.
Selecting variables
We can subset a particular variable column by typing the data frame name followed by $
and the variable name.
Let's see if we can also subset crop in this code snippet.
> my_df$quantity > my_df$crop
//Output Below
[1] 200 300 100 [1] "corn" "leek" "pea"
Nice! Output is an atomic vector of a particular type.
Selecting variables II
Another attribute we can use to select a data frame variable is a double square bracket[[ ]]
.
> my_df[["quantity"]] > my_df[[1]]
//Output Below
[1] 200 300 100 [1] 200 300 100
Note that we can either type the column index or the column name inside the [[ ]]
.
Subsetting using index
As with vectors and matrices, we can call various data frame subsets by using simple square brackets.
How would we select all of the rows, but only just 2 columns?
> my_df[ , 3:4]
//Output Below
subsidy farmer 1 TRUE Bob 2 FALSE Sam 3 TRUE Mike
Yass! We've just subsetted 2 columns. By using different index combinations we can subset single elements, rows or two-dimensional arrays.
Selecting 1 column
What might be the best way to only select 1 column?
> my_df[ , 3]
//Output Below
[1] TRUE FALSE TRUE
Awesome! By leaving the row empty and only writing the index of one column we are able to get the column.
Subsetting with boolean
Let's select only the rows that are receiving subsidy
. Remember, subsidy
is a logical variable.
my_df[subsidy, ]
//Output Below
quantity crop subsidy farmer 1 200 corn TRUE Bob 3 100 pea TRUE Mike
Awesome! we see only the rows containing TRUE
in the subsidy
column.
Sorting
We can sort our data frame by a particular column using the order()
function. Sort my_df
by the quantity
columns.
> order(my_df$quantity) > my_df[order(my_df$quantity), ]
//Output Below
[1] 3 1 2 quantity crop subsidy farmer 3 100 pea TRUE Mike 1 200 corn TRUE Bob 2 300 leek FALSE Sam
See that? The order()
functions allows us to sort vectors. By default, it sorts in ascending order.
Comments