Brett Klamer

Use This, Not That

Get lengths along a list

length() is a fast primitive function, but when finding the length of many objects along a list, lengths() (notice the ending s) is better than applying length() over the list.

# Generate some data
set.seed(1)
short_list <- list(
  a <- 1:2,
  b <- 1:3,
  c <- 1:4
)
long_list <- replicate(
  1000, 
  sample(1:10000, size = sample(1000:10000, 1)), 
  simplify = FALSE
)

# Benchmark Times
check_equality <- function(values) {
  all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
microbenchmark::microbenchmark(
  short_lengths = lengths(short_list),
  short_vapply = vapply(
    X = short_list, 
    FUN = length, 
    FUN.VALUE = integer(1), 
    USE.NAMES = FALSE
  ),
  check = check_equality
)
## Unit: nanoseconds
##           expr  min   lq    mean median   uq   max neval
##  short_lengths  441  485  671.74  639.0  733  4170   100
##   short_vapply 3249 3509 3959.05 3729.5 4032 15962   100
microbenchmark::microbenchmark(
  long_lengths = lengths(long_list),
  long_vapply = vapply(
    X = long_list, 
    FUN = length, 
    FUN.VALUE = integer(1), 
    USE.NAMES = FALSE
  ),
  check = check_equality
)
## Unit: microseconds
##          expr     min       lq      mean  median      uq      max neval
##  long_lengths   9.797  11.2365  14.30106  12.894  13.708   51.518   100
##   long_vapply 287.307 306.3855 337.12065 312.305 323.983 2365.191   100

Get data.frame row lengths

Here we compare the methods of getting the number of rows in a data.frame. Consider the penalty for cases of applying the function many times versus storing and calling the saved object.

  • stored_value: The time it takes to get the length which has already been assigned to an object.
  • stored_subset1: The time it takes to get the length from the object returned by dim() using single bracket ([) indexing.
  • stored_subset2: as above using double bracket ([[) indexing.
  • internal: Using the internal function .row_names_info(). Note that although this is an internal function with a dot name, it does appear to be safe to use being that it is a generic exported function.
  • primitive: using the primitive function dim() along with single bracket ([) indexing.
  • convenience: using the convenience function nrow() (which just calls dim()[1L]).
# Generate data
data <- data.frame(a = 1)

dims <- dim(data)
n_col <- dims[1L]

# Benchmark Times
check_equality <- function(values) {
  all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
microbenchmark::microbenchmark(
  stored_value = n_col, 
  stored_subset1 = dims[1L], 
  stored_subset2 = dims[[1L]], 
  internal = .row_names_info(data, type = 2L), 
  primitive = dim(data)[1L], 
  convenience = nrow(data)
)
## Unit: nanoseconds
##            expr  min     lq    mean median     uq   max neval
##    stored_value   31   47.0   64.96   65.0   74.5   439   100
##  stored_subset1  103  133.5  182.89  176.0  207.5   793   100
##  stored_subset2  147  197.0  282.03  265.5  298.0  2505   100
##        internal  416  549.0  648.66  613.0  718.0  1642   100
##       primitive 2532 2795.0 2998.44 2931.5 3106.5  5507   100
##     convenience 2725 2952.0 3418.19 3172.5 3387.0 21887   100