blog/content/posts/2024/duckplyr-performance/index.md

---
title: Performance experiments with duckplyr in R
description: I recently became aware of the 'duckplyr' library for R. Here are the results of my experimenting with it and benchmarking it against `dplyr`.
date: 2024-08-26T19:00:00+0200
draft: false
# ShowLastmod: true
toc: false
scrolltotop: true
tags:
  - R
  - statistics
---
I recently became aware of the [duckplyr][] library for R, which takes the place
of tidyverse's [dplyr][] library, but uses the [DuckDB] database under the hood.
Without really knowing anything about how dplyr works and if the use of DuckDB
would improve my workflow at all, I decided to perform an experiment. I am
currently analyzing two datasets, one with ~80k records and ~70 variables and
one with ~60k records and ~100 variables. Both datasets are wrangled with
[Tidyverse][]-foo in multiple ways and finally combined. The wrangling of the
data involves things like `rowwise()` and `c_across()`, which I know from
experience is quite an 'expensive' operation.

In order to get the execution times of my code, I did this repeatedly:

1. Restart R (by pressing <kbd>CTRL</kbd> <kbd>SHIFT</kbd> <kbd>F10</kbd>).
2. Run

   ```r
   system.time(rmarkdown::render("my_file.Rmd"))
   ```

3. Record the user time and the system time elapsed.
4. Repeat twice.

I did this with both the "balanced power mode" and the "performance mode" on my
[laptop][]. During execution of the code, I left the laptop alone in order not
to interfere with the timing.

This is the result of my benchmarking:

{{< figure src="benchmarking1.svg" >}}

The times are user times. I left out the system times, which are in the range of
2-3 seconds.

Not really mind-boggling, right? It occurred to me that I rather double-check
that `duckplyr` was really being used. Indeed, this was _not_ the case:

```r
> class(clinical_data)
[1] "tbl_df"      "tbl"         "data.frame"
```

`clinical_data` was missing the `duckplyr_df' class. How come?

I import the raw data from Excel files (don't ask...) into tibbles, and
evidently, this prevents `duckplyr` from seeing the data frames. So I piped the
data frames through `as_duckplyr_tibble()` explicitly, and this got me the right
classes:

```r
> class(clinical_data)
[1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"
```

However, this did not really speed up the execution either.

{{< figure src="benchmarking2.svg" >}}

I looked around my RMarkdown chunks and their outputs, but I did not find any
warning that `duckplyr` had to fall back to `dplyr`'s methods. This could have
explained the absence of a noticeable difference.

Here are the average times (in seconds) for the benchmarking runs.

```r
> runs_table
# A tibble: 6 × 4
# Groups:   library, power_mode [6]
  library                            power_mode   mean    sd
  <chr>                              <chr>       <dbl> <dbl>
1 dplyr                              balanced     31.8 0.722
2 dplyr                              performance  32.6 0.477
3 duckplyr                           balanced     31.4 1.10
4 duckplyr                           performance  31.3 0.495
5 duckplyr with `as_duckplyr_tibble` balanced     36.0 0.517
6 duckplyr with `as_duckplyr_tibble` performance  33.6 0.303
```

So at least for my (!!!) use case, the use of `duckplyr` instead of `dplyr` did
not make any practical difference, and I can also leave my laptop's performance
mode alone. When it comes to optimizing performance, you can't just buy a
solution off the shelf, you always have to try and find the best solution for
your specific problem.

Your mileage will vary, of course. The people who develop `duckplyr` are
brilliant, and the fact that it does not work for me tells more about me and my
work than it does about `duckplyr`.

## The duckplyr demo dataset

As a case in point, the [duckplyr demo repository][duckplyr-demo] contains a
taxi data set. The ZIP file alone is a ~1.7 GB download. Deflated, the files
take up 2.4 GB. With about 21 million records (24 variables), this dataset
is _considerably_ larger than mine.

Here are the results from running `dplyr/run_all_queries.R` and
`duckplyr/run_all_queries.R` on my Thinkpad P14s (performance mode in F40 KDE):

| Library  | q01   | q02   | q03   | q04    |
|----------|------:|------:|------:|-------:|
| dplyr    | 3.4 s | 3.9 s | 9.1 s | 14.3 s |
| duckplyr | 4.3 s | 4.4 s | 9.4 s | 14.8 s |

I should add that execution times vary with each run, but the big picture stays
the same.

Maybe I'm missing the point and it's not about execution times, after all.

`¯\_(ツ)_/`

[dplyr]: https:/dplyr.tidyverse.org
[duckdb]: https://duckdb.org
[duckplyr]: https://duckplyr.tidyverse.org
[duckplyr-demo]: https://github.com/Tmonster/duckplyr_demo
[laptop]: {{< relref "P14s" >}}
[tidyverse]: https://tidyverse.org