Writing functions in R

R for palaeobiologists: Workshop and Hackathon

Function basics

What is a function

takes input –> does something –> returns output

mean(c(1, 2, 3))
[1] 2


A function needs a name, arguments in (), and a body in {}


subtract <- function(arg1, arg2) { 
  arg1 - arg2 
} 


subtract(2, 1)
[1] 1

Why do we need functions

  • Readability
  • Organisation
  • Modularity
  • Reusability

Imagine calculating the mean without standard functions like mean or sum:

  data <- c(1,2,3)
  total <- 0
  count <- 0
  for (value in data) {
    total <- total + value
    count <- count + 1
  }
  total/count
[1] 2

Arguments

Arguments need to be provided in the correct order, or specified by name:

subtract(2, 1)
[1] 1


subtract(1, 2)
[1] -1


subtract(arg2 = 1, arg1 = 2)
[1] 1

Default values

Make function use more convenient, can hide complexities.

subtract(2)
Error in subtract(2): argument "arg2" is missing, with no default


This will work if we set a default for arg2:

subtract <- function(arg1, arg2 = 1) {
  arg1 - arg2
}


subtract(2)
[1] 1


Ellipsis (‘…’)

Additional, optional arguments can be allowed by using ‘…’ as the last argument:

my_plot <- function(arg1, arg2, ...) {
  plot(arg1, arg2, ...)
}


my_plot(2, 1, col = "red", pch = 17, cex = 2)

Return

A function generally should return something, but this does not:

subtract <- function(arg1, arg2) {
  result <- arg1 - arg2
}
subtract(2,1)


Return explicitly with return, or place return value at the end of the function:

subtract <- function(arg1, arg2) {
  result <- arg1 - arg2
  return(result)
}
subtract(2,1)
[1] 1


Return multiple objects

just_return <- function(arg1, arg2) {
return(arg1)
return(arg2)
}
just_return(2, 1)
[1] 2

This did not work as intended. R functions only return one object. Instead use lists or other data structures:

just_return <- function(arg1, arg2) {
return(c(arg1, arg2))
}
just_return(2, 1)
[1] 2 1

Binary operators

Standard function syntax:

sum(c(1,2))
[1] 3

Operator syntax:

1 + 2
[1] 3

Most binary operators come in %:

3 %in% c(1,2,3)
[1] TRUE


Custom binary operators – let’s define an operator for “not in”:

`%!in%` <- function(x, y) !(x %in% y)
3 %!in% c(1,2,3)
[1] FALSE

Control structures – if

if a condition is true, do something.

if (1 + 1 == 2) print("True")
[1] "True"


add_or_subtract <- function(arg1, arg2, operation) {
 if (operation == "add") {
   result <- arg1 + arg2
 }
 if (operation == "subtract") {
   result <- arg1 + arg2
 }
 result
}
add_or_subtract(2,1,"add")
[1] 3

Control structures – else

else instructs what to do when the if condition is not met.

if (1 + 1 == 3) print("True") else print("False")
[1] "False"


add_or_subtract <- function(arg1, arg2, operation) {
 if (operation == "add") {
   result <- arg1 + arg2
 } else {
   result <- arg1 - arg2
 }
 return(result)
}
add_or_subtract(2,1,"subtract")
[1] 1

Control structures – switch

Instead of many if and else statements, try switch

fossil_description <- function(fossil) {
 switch(fossil,
  ammonite = "coiled shell",
  Tyrannosaurus = "serrated teeth",
  Lepidodendron = "scaly bark", 
  "not a fossil"
 )
}
fossil_description("Tyrannosaurus")
[1] "serrated teeth"
fossil_description("Lewis")
[1] "not a fossil"

Control structures – for loops

Loops are used for repeating similar actions multiple times. for loops iterate over a set of values. The iterator (i) changes with every iteration of the loop:

for (i in c(1,2,3)) print(i)
[1] 1
[1] 2
[1] 3

To generate sequences of integers, we can use seq_len. Let’s make a function:

print_repetitions <- function(n) {
 for (i in seq_len(n)) { 
   print(i)
 }
}
print_repetitions(2)
[1] 1
[1] 2

Control structures – while loops

while loops repeat a task until a condition is no longer met.

add_until_4 <- function(x) {
  while(x < 4) {
    x <- x + 1
    print(x)
  }
}
add_until_4(1)
[1] 2
[1] 3
[1] 4

Exercise 1 - Function for latitudinal binning

Create a function that can sort a data.frame into hemispheres. That is, we want a new column that identifies the hemisphere of each entry of the data set. As an exemplary data set, we will use the reefs data from palaeoverse here and in the next exercises.

If you are an experienced R user, you can up the difficulty by writing a function that can sort data into any number of latitudinal bins of equal width.

Here is what the result may look like when sorted into hemispheres:

reefs[72:73, c("name", "lat", "hemisphere")]
          name     lat hemisphere
72 Begunjscica 46.4333      north
73     W-Ceram -3.2500      south

Exercise 1 - A solution for hemispheres

Code
sort_hemisphere <- function(df) {
  df$hemisphere <- rep(NA, nrow(df))
  for (i in 1:nrow(df)) {
    if (df$lat[i] >= 0) {
      df$hemisphere[i] <- "north"
    } else {
      df$hemisphere[i] <- "south"
    }
  }
  return(df)
}

Exercise 1 - A solution for latitudinal bins

Code
sort_lat_bins <- function(df, n_bin) {
  bin_width <- 180 / n_bin
  df$lat_bin <- rep(NA, nrow(df))
  for (i in 1:nrow(df)) {
    bin_number <- floor((df$lat[i] + 90) / bin_width) + 1
    bin_start <- (bin_number - 1) * bin_width - 90
    bin_end <- bin_start + bin_width
    df$lat_bin[i] <- (bin_start + bin_end) / 2
  }
  return(df)
}
# The bin names are the midpoints of the latitudinal bins.

Best practices

Naming style

Give variables and functions consistent names. These are the two most common styles:

snake_case

used by Tidyverse’s style guide

CamelCase

used by Google’s R style guide

Internal functions in R packages are often prefixed with a dot, e.g. .my_internal_function. Don’t prefix the file name with a dot.

Clean coding

Common practices:

  • Avoid renaming existing functions and variables
mean <- mean(c(1, 2, 3))
mean
[1] 2
mean(c(1, 2, 3))
[1] 2

R is clever, in this case this still works.

  • Use <- for assignment, not =
data = c(1, 2, 3)
data <- c(1, 2, 3)

Example of a detailed style guide: Tidyverse’s style guide

Automatic formatting

Automatically format R code by highlighting it and clicking on Code –> Reformat Code or by pressing Ctrl+Shift+A


Unformatted code

subtract <- function(arg1,arg2 ){
arg1 -arg2}


Formatted code

subtract <- function(arg1, arg2) {
  arg1 - arg2
}

lintr

The lintr package let’s you check that your code conforms to your chosen standard.

Here, we choose the tidyverse style:

install.packages("lintr")
lintr::use_lintr(type = "tidyverse")

And now we check for style violations in our project directory:

lintr::lint_dir()

Comments

  • document purpose and usage of code
  • explain complex / non-intuitive code
# Calculate convex hull
tmp <- tmp[chull(x = tmp[, lng], 
                 y = tmp[, lat]), c(lng, lat)]
  • organise code into sections
#=== Set-up ===
  unique_taxa <- unique(occdf[, name])
  # Order taxa
  unique_taxa <- unique_taxa[order(unique_taxa)]
#=== convex hull  ===

Good code shouldn’t need a large amount of comments - but comment enough that you can still use your code two years later.

Comments

Add some general information in the beginning of a large R script.

### Change point regression analysis
### July 2021
### Kilian Eichenseer
###
### Bayesian algorithm for finding a change point in 
### the linear relationship between two variables. 
### Uses JAGS (https://mcmc-jags.sourceforge.io/).

### Generate data
set.seed(10) 
n <- 60 # total number of data points

Even better: add formal documentation.

Documentation

?mean

roxygen2

  • R package, install with install.packages("roxygen2")
  • Used to create documentation for R packages (functions, data, …)
  • Start every line of documentation with #'

This could generate documentation for the subtract function from earlier:

#' Subtraction
#' 
#' Subtracts `arg2` from `arg1`
#'
#' @param arg1 `Numeric`. First argument.
#' @param arg2 `Numeric`. Second argument.
#' @return A `numeric` containing the difference 
#' between `arg1` and `arg2`.
#' @examples
#' subtract(2,1)

?subtract

## roxygen2

Commonly used roxygen2 tags are:

  • @param for function parameters
  • @return for the function output
  • @details for additional
  • @example provide examples of the function
  • @seealso reference related functions or external resources

For more details, see https://r-pkgs.org/man.html

Exercise 2 - Document your function

  • Make sure your code follows the same style throughout
  • Add in-line comments if necessary
  • Create documentation with roxygen2

Exercise 2 - Create a package

From RStudio, create a New Directory (File -> New Project...)

Exercise 2 - Create a package

Exercise 2 - Create a package

  • Save functions in the R folder as .R files
  • Documentation will be automatically created in the man folder
  • You may need to delete the NAMESPACE file once to avoid warnings

Exercise 2 - use roxygen2

We start by installing roxygen2 and the devtools package, if you haven’t installed them yet:

install.packages("roxygen2")
install.packages("devtools")

To generate a documentation template for your function, click on Code --> Insert Roxygen Skeleton in Rstudio.

To generate documentation from our roxygen2 comments, which are denoted by the #' tags, run

devtools::document()

or press Ctrl + Shift + D.

We can now read the documentation of our function by calling

?my_function # use the name of your documented function

Exercise 2 - A solution for hemispheres

Code
#' Sort data into hemispheres
#'
#' This function sorts data into hemispheres based on latitude values.
#' Rows with latitudes \eqn{\geq} 0 are categorised as "north"; rows with latitudes \eqn{<} 0 are categorised as "south".
#'
#' @param df A data frame with a column named 'lat' representing latitudinal values.
#'
#' @return A data frame with an additional column 'hemisphere', indicating  the hemisphere of each row.
#'
#' @examples
#' data <- data.frame(lat = c(30, -45, 60, -10))
#' data <- sort_hemisphere(data)
#'
#' @export
sort_hemisphere <- function(df) {
  # initialise hemisphere column
  df$hemisphere <- rep(NA, nrow(df))
  # loop through df and assign hemispheres
  for (i in 1:nrow(df)) {
    if (df$lat[i] >= 0) {
      df$hemisphere[i] <- "north"
    } else {
      df$hemisphere[i] <- "south"
    }
  }
  return(df)
}

Exercise 2 - A solution for hemispheres

After running devtools::deocument(), ?sort_hemisphere should show:
Click to reveal image

Testing and Debugging

Error handling

Very helpful in complex functions

subtract(2, 1)
[1] 1
subtract("2", 1)
Error in arg1 - arg2: non-numeric argument to binary operator

Check that input is correct and display custom error messages:

subtract <- function(arg1, arg2) {
  if (is.numeric(arg1) == FALSE) {
    stop("`arg1` must be numeric") 
    }
  arg1 - arg2
}
subtract("2", 1)
Error in subtract("2", 1): `arg1` must be numeric

tryCatch

Use if you anticipate an error but want function to continue.

Let’s try to generate data from a multivariate normal distribution:

mvnfast::rmvn(1, mu = c(0,0), sigma = matrix(rep(1,4), 2))
Error in mvnfast::rmvn(1, mu = c(0, 0), sigma = matrix(rep(1, 4), 2)): chol(): decomposition failed

mvnfast::rmvn is fast but fails for some problematic sigma values. In case it fails, we use MASS::mvrnorm instead:

my_rmvn <- function(n, mu, sigma) {
  tryCatch(mvnfast::rmvn(n, mu, sigma),
           error = function(e) MASS::mvrnorm(n, mu, sigma))
}
my_rmvn(n = 1, mu = c(0,0), sigma = matrix(rep(1,4), 2))
[1] 1.161649 1.161649

Traceback

If something went wrong, find out where using traceback():

my_rmvn(n = 1, mu = 0, sigma = matrix(rep(1,4), 2))
Error in MASS::mvrnorm(n, mu, sigma): incompatible arguments
traceback()

Break points

Break points allow you to look inside your function’s environment. Click next to a line of code in your function to activate a break point (a red dot appears):

Now run the function…

my_rmvn(n = 1, mu = c(0,0), sigma = matrix(rep(1,4), 2))

You may need to save the script and source it (press source in the Rstudio toolbar)

Break points

Break points

We can now browse the function environment in the console like we normally can browse the global environment. For example we can look at sigma:

Press Stop to end the browsing. Don’t forget to deactivate the break point by clicking on the red dot in the script.

Tests

Functions should be tested before they are used.

Sometimes, interactive testing may be enough.

For example, identical() tests whether two objects are exactly equal:

identical(subtract(2,1),
          1)
[1] TRUE

Automated testing with the testthat R package is superior.


Read more on testing at r-pkgs.org.

testthat

  • Used to create unit tests for R packages that can be run automatically
  • R package, install with install.packages("testthat")
  • Set up infrastructure for a test with usethis::use_test()

Create a test file for the subtract function:

usethis::use_test("subtract")

This has created a test-subtract.R file in package_name/tests/testthat/

testthat

In the test-subtract.R file, we can write tests, for example:

test_that("subtraction works", {
  expect_equal(subtract(2,1),
               1)
})

If we run this test, we should get

testthat

We can also run all of our tests at once with test_local or test_package("my_package"):

testthat::test_local()

If you have the devtools package installed, you can also use devtools::test() by pressing Test in the Build tab of Rstudio:

testthat

Useful tests include:

  • expect_equal
  • expect_true, expect_false
  • expect_error
  • expect_snapshot (for results that are difficult to describe with code)

For testing plot functions, the expect_doppelganger() function from the vdiffr package can be used:

test_that("plot function works", {
  plot_1 <- function() plot(1,2)
  vdiffr::expect_doppelganger("plot_1", plot_1)

This will create an image in the tests/testthat/_snaps/function_name directory. Upon first calling this, inspect the image to see if it is as expected. Future tests will fail if the function call in the test changes the image.

Test coverage

To check how much of your package is covered by tests, the covr package is helpful:

covr::package_coverage()

If you have loaded the package, you may need to unload the package before using the covr package, e.g. with

devtools::unload()

Test coverage

The report() function lets you check test coverage line by line:

covr::report()

Exercise 3 - Test your function

  • Add tests for your function using testthat functionality
  • Add custom error messages to your function, flagging inappropriate input

Setup for testing:

install.packages(testthat)
install.packages(usethis)
usethis::use_test("function_name")

Exercise 3 - A solution for hemispheres

Checks and error messages:

Code
sort_hemisphere <- function(df) {
  #
  # --- check that input has the correct format
  #
  if (!is.data.frame(df)) {
    stop("'df' is not a dataframe.")
  }
  if (!"lat" %in% names(df)) {
    stop("'df' does not contain a column named 'lat'.")
  }
  if (!is.numeric(df$lat)) {
    stop("Column 'lat' is not numeric.")
  }
  if (!(all(df$lat <= 90) && all(df$lat >= -90))) {
    stop("Column 'lat' may only contain latitudes from -90 to 90.")
  }
  #
  # --- assign hemispheres
  #
  # initialise hemisphere column
  df$hemisphere <- rep(NA, nrow(df))
  # loop through df and assign hemispheres
  for (i in 1:nrow(df)) {
    if (df$lat[i] >= 0) {
      df$hemisphere[i] <- "north"
    } else {
      df$hemisphere[i] <- "south"
    }
  }
  return(df)
}

Exercise 3 - A solution for hemispheres

testthat tests:

Code
# Tests for sort_hemisphere function

test_that("Function rejects non-dataframe input", {
  expect_error(sort_hemisphere("not a dataframe"), "'df' is not a dataframe.")
})

test_that("Function rejects dataframe without 'lat' column", {
  df <- data.frame(lng = c(30, -45, 60, -10))
  expect_error(sort_hemisphere(df), "'df' does not contain a column named 'lat'.")
})

test_that("Function rejects non-numeric 'lat' column", {
  df <- data.frame(lat = c("30N", "45S", "60N", "10S"))
  expect_error(sort_hemisphere(df), "Column 'lat' is not numeric.")
})

test_that("Function rejects 'lat' vales < -90 or > 90", {
  df <- data.frame(lat = c(30, -95, 100, -10 ))
  expect_error(sort_hemisphere(df), "Column 'lat' may only contain latitudes from -90 to 90.")
})

test_that("Function correctly assigns hemispheres", {
  df <- data.frame(lat = c(30, -45, 60, -10))
  result <- sort_hemisphere(df)
  expect_equal(result$hemisphere, c("north", "south", "north", "south"))
})

Advanced tools

apply

  • apply and related functions apply a function to elements of arrays, lists, …
  • more concise than loops and don’t change global environment

For example, to get the class of the first three columns of the reefs data:

reef_classes <- rep(NA,3)
for (i in 1:3) {
  reef_classes[i] <- class(palaeoverse::reefs[,i])
}
reef_classes
[1] "character" "character" "character"

Or with apply:

apply(palaeoverse::reefs[,1:3], MARGIN = 2, FUN = class)
   r_number        name   formation 
"character" "character" "character" 

apply

Let’s have a look what happened there. apply(X, MARGIN, FUN) takes an array X (our reefs dataframe), and applies a function (FUN) to elements of that array, specified by MARGIN.

MARGIN = 2 indicates columns, MARGIN = 1 would be rows. So here we applied the class function to every column of reefs.

apply simplifies the output, so here it returned a vector with one element for each column.

lapply

lapply is similar to apply but for list or vector input. It returns a list for each element of the data.

data <- list(c(1, 2, 3), c(4, 5))
lapply(data, mean)
[[1]]
[1] 2

[[2]]
[1] 4.5

sapply does the same, but also tries to simplify the output; here we get a vector:

data <- list(c(1, 2, 3), c(4, 5))
sapply(data, mean)
[1] 2.0 4.5

vapply

vapply is a safer version of sapply, it requires the user to specify the anticipated class and length of the elements of the output:

data <- list(c(1, 2, 3), c(4, 5))
vapply(data, FUN = mean, FUN.VALUE = numeric(1))
[1] 2.0 4.5

Here, we specified in FUN.VALUE that each element of the output should be a numeric of length 1.

do.call

do.call() takes a function and a list of arguments, and applies the function to the arguments. This can be useful in a variety of situations:


To combine data:

a <- list(c(1, 2, 3), c(4, 5, 6))
do.call(rbind, a)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6


When we have a function name as a string:

my_function <- "mean"
do.call(my_function, list(c(1, 2, 3)))
[1] 2

do.call

With dynamic arguments:

For example, we want to allow the user to specify additional arguments for a custom plot function, including xlim. In case he doesn’t specify it, we want to automatically generate xlim values:

my_plot <- function(x, ...) {
    args <- list(...) # save the additional arguments
  if("xlim" %in% names(args)) {
    xlim <- args$xlim # if user provides xlim, use that
    args$xlim <- NULL # to avoid argument duplication
  } else {
    xlim <- range(x) + c(-1, 1) # automatic xlim
  }
  do.call(plot, c(list(x = x, xlim = xlim), args))
}

do.call

par(mfrow = c(1,2))
my_plot(1:3, pch = 17, 
        main = "automatic xlim")
my_plot(1:3, pch = 17, xlim = c(1, 5), 
        main = "user-specified xlim")

Environment

Environment can be conceptualised as a place where objects with a name and value exist.

For example, when we run

x <- 1

x now exists in the global environment, and has the value 1.


Each function, for loop, …, creates its own environment.

Environment

If we run the following function to assign to b the value of a

a_to_b <- function(a) b <- a
a_to_b(a = 1)

and then look for b in the global environment

b
Error in eval(expr, envir, enclos): object 'b' not found

we get an error because b only existed within the function environment.


More on environments: adv-r.hadley.nz/environments.html

Scoping

R uses scoping rules to look for variables (or functions). If a variable is not found in a function environment, R looks in the parent environment (i.e. the environment in which the function was created).

x <- 1
double_x <- function(y) 2 * x + y
double_x(0)
[1] 2

x is a free variable in the double_x function – it is not supplied to or defined in the function. Instead, it’s looked up in the environment where double_x was created, the global environment.

This can get tricky, see here for more details: bookdown.org/rdpeng/rprogdatascience/scoping-rules-of-r.html

Measuring performance

If you have large data sets and complex functions, you may want to enhance their performance.


To check how long an operation takes, use system.time:

system.time(rnorm(10^6))
   user  system elapsed 
   0.00    0.00    0.05 

Measuring performance

The microbenchmark package performs an operation many times, and measures the average time it takes. You can also compare different operations.


What is faster, rnorm or rlnorm?

microbenchmark::microbenchmark(rnorm(10^4),
                               rlnorm(10^4))
Unit: microseconds
         expr   min     lq     mean  median      uq    max neval
  rnorm(10^4) 330.3 358.05  466.132  423.45  508.45 1114.8   100
 rlnorm(10^4) 865.1 909.05 1169.391 1066.15 1359.55 2446.1   100

Profiling

The profvis package lets you identify bottlenecks in your code:

short_pause <- function(x) Sys.sleep(0.1 * x)
long_pause <- function(x) Sys.sleep(0.2 * x)

time_waste <- function(x) {
  short_pause(x)
  long_pause(x)
}
profvis::profvis(time_waste(1))

Speeding up functions

Only spend time trying to make your code faster if

  • it works as intended
  • you have identified the parts that are slowing it down


Here is a good overview on making R functions run faster: Best coding practices in R

Memory allocation

Pre-allocating memory is faster than growing objects repeatedly.

Assume, we have recorded the results of 1,000 dice rolls:

data <- sample.int(6, n = 100, replace = TRUE)

Now, we want to check which ones show “6”. Compare these two approaches:

is_six_1 <- function(x) {
  res <- NULL
  for (i in seq_along(x)) {
    res[i] <- x[i] == 6
  }
  res
}
is_six_2  <- function(x) {
  res <- rep(NA,length(x))
  for (i in seq_along(x)) {
    res[i] <- x[i] == 6
  }
  res
}

Memory allocation

Let’s check which approach is faster:

microbenchmark::microbenchmark(is_six_1(data),
                               is_six_2(data), times = 1000)
Unit: nanoseconds
           expr  min   lq   mean median   uq     max neval
 is_six_1(data) 1300 1400 3892.4   1500 1600 2277800  1000
 is_six_2(data)  800  900 3333.4   1000 1100 2261000  1000


is_six_2() is faster, as R doesn’t have to grow the res object in every iteration.

Vectorisation

R has many functions that are vectorised.

Instead of running a loop to check each value of our dice rolls, we can check them all at once by taking advantage of the ability of == to take vector input:

is_six_2  <- function(x) {
  res <- rep(NA,length(x))
  for (i in seq_along(x)) {
    res[i] <- x[i] == 6
  }
  res
}
is_six_3 <- function(x) {
  x == 6
}

Vectorisation

microbenchmark::microbenchmark(is_six_2(data), 
                               is_six_3(data), times = 1000)
Unit: nanoseconds
           expr  min   lq   mean median   uq     max neval
 is_six_2(data) 1200 1400 5263.4   1500 1600 3656300  1000
 is_six_3(data)  300  400 1572.9    400  500 1088400  1000


The vectorised version is_six_3() is much faster.

Vectorisation

Vectorised matrix functions like rowSums(), colSums() or rowMeans()can lead to impressive speedups:

data <- matrix(rnorm(10^4), nrow = 100)

microbenchmark::microbenchmark(
  apply(data, 2, sum),
  colSums(data),
  times = 1000)
Unit: microseconds
                expr   min    lq     mean median     uq     max neval
 apply(data, 2, sum) 147.8 185.0 311.6260  226.5 286.35 15924.4  1000
       colSums(data)   6.6   7.6  10.6659    8.4  10.30   185.6  1000

Rcpp

The Rcpp package allows for integrating C++ code with R, which can make R functions a lot faster. It requires the installation of a C++ compiler (R tools for Windows, Xcode for Mac, possibly “sudo apt-get install r-base-dev” on Linux)



I also highly recommend ChatGpt for help with creating C++ functions.

Rcpp

As an example, let’s compare a random walk implemented in R with one implemented with Rcpp.

First, the R version:

random_walk_R <- function(steps) {
  walk <- numeric(steps)
  for (i in 2:steps) {
    walk[i] <- walk[i-1] + ifelse(runif(1) > 0.5, 1, -1)
  }
  return(walk)
}

Rcpp

Next, the Rcpp version:

library(Rcpp)
sourceCpp(
'#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector random_walk_Rcpp(int steps) {
  NumericVector walk(steps);
  for(int i = 1; i < steps; i++) {
    walk[i] = walk[i-1] + (R::runif(0, 1) > 0.5 ? 1 : -1);
  }
  return walk;
}')

We now have the function random_walk_Rcpp in the global environment.

Rcpp

Let’s make sure both versions work:

plot(random_walk_R(1000), type = "l", ylim = c(-50,50))
points(random_walk_Rcpp(1000),type = "l", col = "red")

Rcpp

Now let us compare the speed:

microbenchmark::microbenchmark(
  random_walk_R(1000),
  random_walk_Rcpp(1000),
  times = 100)


Loops are much faster in C++!

Exercise 4

  • Try profiling your code with profvis(). The result may not be very informative as this was probably a short, rather quick function
  • Can you use apply or similar functions to reduce the amount of code?
  • Can you use vectorisation or other tricks to speed it up?
  • Compare your old aproach and your updated function with microbenchmark()

Exercise 4 - A solution for hemispheres

Profvis doesn’t tell us much here:

Code
library(profvis)
profvis(sort_hemisphere(reefs))
Click to reveal image

Exercise 4 - A solution for hemispheres

Using vapply instead of a for loop:

Code
sort_hemisphere_vapply <- function(df) {
  df$hemisphere <- vapply(df$lat, function(l) {
    if (l >= 0) {
      return("north")
    } else {
      return("south")
    }
  }, FUN.VALUE = character(1))
  return(df)
}

Exercise 4 - A solution for hemispheres

Taking advantage of vectorisation:

Code
sort_hemisphere_ifelse <- function(df) {
  df$hemisphere <- ifelse(df$lat >= 0, "north", "south")
  return(df)
}

Exercise 4 - A solution for hemispheres

Comparing the approaches with microbenchmark:

Code
library(microbenchmark)
microbenchmark(
  sort_hemisphere(reefs),
  sort_hemisphere_vapply(reefs),
  sort_hemisphere_ifelse(reefs)
)
## Unit: microseconds
##                           expr     min       lq      mean   median       uq
##         sort_hemisphere(reefs) 50641.0 57764.00 65826.721 62712.50 68810.45
##  sort_hemisphere_vapply(reefs)  1882.0  1992.60  2683.510  2343.35  2969.55
##  sort_hemisphere_ifelse(reefs)   733.7   788.25   951.854   828.80  1011.55
##       max neval
##  145343.2   100
##   13024.0   100
##    2714.3   100