[1] 2
Writing functions in R
R for palaeobiologists: Workshop and Hackathon
takes input –> does something –> returns output
A function needs a name, arguments in (), and a body in {}
subtract <- function(arg1, arg2) {
arg1 - arg2
}
Imagine calculating the mean without standard functions like mean
or sum
:
Arguments need to be provided in the correct order, or specified by name:
Make function use more convenient, can hide complexities.
Additional, optional arguments can be allowed by using ‘…’ as the last argument:
A function generally should return something, but this does not:
Return explicitly with return
, or place return value at the end of the function:
This did not work as intended. R functions only return one object. Instead use lists or other data structures:
Custom binary operators – let’s define an operator for “not in”:
if
a condition is true, do something.
else
instructs what to do when the if
condition is not met.
Instead of many if
and else
statements, try switch
Loops are used for repeating similar actions multiple times. for
loops iterate over a set of values. The iterator (i
) changes with every iteration of the loop:
To generate sequences of integers, we can use seq_len
. Let’s make a function:
while
loops repeat a task until a condition is no longer met.
Create a function that can sort a data.frame into hemispheres. That is, we want a new column that identifies the hemisphere of each entry of the data set. As an exemplary data set, we will use the reefs
data from palaeoverse
here and in the next exercises.
If you are an experienced R user, you can up the difficulty by writing a function that can sort data into any number of latitudinal bins of equal width.
sort_lat_bins <- function(df, n_bin) {
bin_width <- 180 / n_bin
df$lat_bin <- rep(NA, nrow(df))
for (i in 1:nrow(df)) {
bin_number <- floor((df$lat[i] + 90) / bin_width) + 1
bin_start <- (bin_number - 1) * bin_width - 90
bin_end <- bin_start + bin_width
df$lat_bin[i] <- (bin_start + bin_end) / 2
}
return(df)
}
# The bin names are the midpoints of the latitudinal bins.
Give variables and functions consistent names. These are the two most common styles:
Internal functions in R packages are often prefixed with a dot, e.g. .my_internal_function
. Don’t prefix the file name with a dot.
Common practices:
Example of a detailed style guide: Tidyverse’s style guide
Automatically format R code by highlighting it and clicking on Code
–> Reformat Code
or by pressing Ctrl
+Shift
+A
The lintr
package let’s you check that your code conforms to your chosen standard.
Add some general information in the beginning of a large R script.
### Change point regression analysis
### July 2021
### Kilian Eichenseer
###
### Bayesian algorithm for finding a change point in
### the linear relationship between two variables.
### Uses JAGS (https://mcmc-jags.sourceforge.io/).
### Generate data
set.seed(10)
n <- 60 # total number of data points
Even better: add formal documentation.
install.packages("roxygen2")
#'
This could generate documentation for the subtract
function from earlier:
## roxygen2
Commonly used roxygen2 tags are:
For more details, see https://r-pkgs.org/man.html
roxygen2
From RStudio, create a New Directory (File -> New Project...
)
R
folder as .R filesman
folderNAMESPACE
file once to avoid warningsWe start by installing roxygen2
and the devtools
package, if you haven’t installed them yet:
To generate a documentation template for your function, click on Code --> Insert Roxygen Skeleton
in Rstudio.
To generate documentation from our roxygen2 comments, which are denoted by the #'
tags, run
or press Ctrl + Shift + D
.
We can now read the documentation of our function by calling
#' Sort data into hemispheres
#'
#' This function sorts data into hemispheres based on latitude values.
#' Rows with latitudes \eqn{\geq} 0 are categorised as "north"; rows with latitudes \eqn{<} 0 are categorised as "south".
#'
#' @param df A data frame with a column named 'lat' representing latitudinal values.
#'
#' @return A data frame with an additional column 'hemisphere', indicating the hemisphere of each row.
#'
#' @examples
#' data <- data.frame(lat = c(30, -45, 60, -10))
#' data <- sort_hemisphere(data)
#'
#' @export
sort_hemisphere <- function(df) {
# initialise hemisphere column
df$hemisphere <- rep(NA, nrow(df))
# loop through df and assign hemispheres
for (i in 1:nrow(df)) {
if (df$lat[i] >= 0) {
df$hemisphere[i] <- "north"
} else {
df$hemisphere[i] <- "south"
}
}
return(df)
}
devtools::deocument()
, ?sort_hemisphere
should show:
Very helpful in complex functions
Check that input is correct and display custom error messages:
Use if you anticipate an error but want function to continue.
Let’s try to generate data from a multivariate normal distribution:
mvnfast::rmvn
is fast but fails for some problematic sigma
values. In case it fails, we use MASS::mvrnorm
instead:
If something went wrong, find out where using traceback()
:
Break points allow you to look inside your function’s environment. Click next to a line of code in your function to activate a break point (a red dot appears):
You may need to save the script and source it (press source
in the Rstudio toolbar)
We can now browse the function environment in the console like we normally can browse the global environment. For example we can look at sigma
:
Press Stop
to end the browsing. Don’t forget to deactivate the break point by clicking on the red dot in the script.
Functions should be tested before they are used.
Sometimes, interactive testing may be enough.
For example, identical()
tests whether two objects are exactly equal:
install.packages("testthat")
usethis::use_test()
In the test-subtract.R
file, we can write tests, for example:
If we run this test, we should get
We can also run all of our tests at once with test_local
or test_package("my_package")
:
If you have the devtools
package installed, you can also use devtools::test()
by pressing Test
in the Build
tab of Rstudio:
Useful tests include:
For testing plot functions, the expect_doppelganger()
function from the vdiffr
package can be used:
This will create an image in the tests/testthat/_snaps/function_name
directory. Upon first calling this, inspect the image to see if it is as expected. Future tests will fail if the function call in the test changes the image.
To check how much of your package is covered by tests, the covr
package is helpful:
The report()
function lets you check test coverage line by line:
testthat
functionalityChecks and error messages:
sort_hemisphere <- function(df) {
#
# --- check that input has the correct format
#
if (!is.data.frame(df)) {
stop("'df' is not a dataframe.")
}
if (!"lat" %in% names(df)) {
stop("'df' does not contain a column named 'lat'.")
}
if (!is.numeric(df$lat)) {
stop("Column 'lat' is not numeric.")
}
if (!(all(df$lat <= 90) && all(df$lat >= -90))) {
stop("Column 'lat' may only contain latitudes from -90 to 90.")
}
#
# --- assign hemispheres
#
# initialise hemisphere column
df$hemisphere <- rep(NA, nrow(df))
# loop through df and assign hemispheres
for (i in 1:nrow(df)) {
if (df$lat[i] >= 0) {
df$hemisphere[i] <- "north"
} else {
df$hemisphere[i] <- "south"
}
}
return(df)
}
testthat
tests:
# Tests for sort_hemisphere function
test_that("Function rejects non-dataframe input", {
expect_error(sort_hemisphere("not a dataframe"), "'df' is not a dataframe.")
})
test_that("Function rejects dataframe without 'lat' column", {
df <- data.frame(lng = c(30, -45, 60, -10))
expect_error(sort_hemisphere(df), "'df' does not contain a column named 'lat'.")
})
test_that("Function rejects non-numeric 'lat' column", {
df <- data.frame(lat = c("30N", "45S", "60N", "10S"))
expect_error(sort_hemisphere(df), "Column 'lat' is not numeric.")
})
test_that("Function rejects 'lat' vales < -90 or > 90", {
df <- data.frame(lat = c(30, -95, 100, -10 ))
expect_error(sort_hemisphere(df), "Column 'lat' may only contain latitudes from -90 to 90.")
})
test_that("Function correctly assigns hemispheres", {
df <- data.frame(lat = c(30, -45, 60, -10))
result <- sort_hemisphere(df)
expect_equal(result$hemisphere, c("north", "south", "north", "south"))
})
apply
and related functions apply a function to elements of arrays, lists, …For example, to get the class of the first three columns of the reefs
data:
Let’s have a look what happened there. apply(X, MARGIN, FUN)
takes an array X (our reefs
dataframe), and applies a function (FUN) to elements of that array, specified by MARGIN.
MARGIN = 2 indicates columns, MARGIN = 1 would be rows. So here we applied the class
function to every column of reefs
.
apply
simplifies the output, so here it returned a vector
with one element for each column.
lapply
is similar to apply but for list or vector input. It returns a list for each element of the data.
vapply
is a safer version of sapply
, it requires the user to specify the anticipated class and length of the elements of the output:
do.call()
takes a function and a list of arguments, and applies the function to the arguments. This can be useful in a variety of situations:
To combine data:
With dynamic arguments:
For example, we want to allow the user to specify additional arguments for a custom plot function, including xlim
. In case he doesn’t specify it, we want to automatically generate xlim
values:
my_plot <- function(x, ...) {
args <- list(...) # save the additional arguments
if("xlim" %in% names(args)) {
xlim <- args$xlim # if user provides xlim, use that
args$xlim <- NULL # to avoid argument duplication
} else {
xlim <- range(x) + c(-1, 1) # automatic xlim
}
do.call(plot, c(list(x = x, xlim = xlim), args))
}
Environment can be conceptualised as a place where objects with a name and value exist.
Each function, for
loop, …, creates its own environment.
If we run the following function to assign to b
the value of a
and then look for b
in the global environment
we get an error because b
only existed within the function environment.
More on environments: adv-r.hadley.nz/environments.html
R uses scoping rules to look for variables (or functions). If a variable is not found in a function environment, R looks in the parent environment (i.e. the environment in which the function was created).
x
is a free variable in the double_x
function – it is not supplied to or defined in the function. Instead, it’s looked up in the environment where double_x
was created, the global environment.
This can get tricky, see here for more details: bookdown.org/rdpeng/rprogdatascience/scoping-rules-of-r.html
If you have large data sets and complex functions, you may want to enhance their performance.
The microbenchmark
package performs an operation many times, and measures the average time it takes. You can also compare different operations.
The profvis
package lets you identify bottlenecks in your code:
Only spend time trying to make your code faster if
Here is a good overview on making R functions run faster: Best coding practices in R
Pre-allocating memory is faster than growing objects repeatedly.
Assume, we have recorded the results of 1,000 dice rolls:
Let’s check which approach is faster:
Unit: nanoseconds
expr min lq mean median uq max neval
is_six_1(data) 1300 1400 3892.4 1500 1600 2277800 1000
is_six_2(data) 800 900 3333.4 1000 1100 2261000 1000
is_six_2()
is faster, as R doesn’t have to grow the res
object in every iteration.
R has many functions that are vectorised.
Unit: nanoseconds
expr min lq mean median uq max neval
is_six_2(data) 1200 1400 5263.4 1500 1600 3656300 1000
is_six_3(data) 300 400 1572.9 400 500 1088400 1000
The vectorised version is_six_3()
is much faster.
Vectorised matrix functions like rowSums()
, colSums()
or rowMeans()
can lead to impressive speedups:
The Rcpp
package allows for integrating C++ code with R, which can make R functions a lot faster. It requires the installation of a C++ compiler (R tools for Windows, Xcode for Mac, possibly “sudo apt-get install r-base-dev” on Linux)
Read more at Rcpp: Seamless R and C++ Integration or Avanced R
I also highly recommend ChatGpt for help with creating C++ functions.
As an example, let’s compare a random walk implemented in R with one implemented with Rcpp.
Next, the Rcpp version:
We now have the function random_walk_Rcpp
in the global environment.
Let’s make sure both versions work:
Now let us compare the speed:
Loops are much faster in C++!
profvis()
. The result may not be very informative as this was probably a short, rather quick functionmicrobenchmark()
Profvis doesn’t tell us much here:
Using vapply instead of a for
loop:
Taking advantage of vectorisation:
Comparing the approaches with microbenchmark
:
library(microbenchmark)
microbenchmark(
sort_hemisphere(reefs),
sort_hemisphere_vapply(reefs),
sort_hemisphere_ifelse(reefs)
)
## Unit: microseconds
## expr min lq mean median uq
## sort_hemisphere(reefs) 50641.0 57764.00 65826.721 62712.50 68810.45
## sort_hemisphere_vapply(reefs) 1882.0 1992.60 2683.510 2343.35 2969.55
## sort_hemisphere_ifelse(reefs) 733.7 788.25 951.854 828.80 1011.55
## max neval
## 145343.2 100
## 13024.0 100
## 2714.3 100
Comments
Good code shouldn’t need a large amount of comments - but comment enough that you can still use your code two years later.