blog/content/posts/2024-08-26-duckplyr-performance/index.md

128 lines
4.7 KiB
Markdown
Raw Normal View History

2024-08-28 12:44:13 +00:00
---
title: Performance experiments with duckplyr in R
description: I recently became aware of the 'duckplyr' library for R. Here are the results of my experimenting with it and benchmarking it against `dplyr`.
date: 2024-08-26T19:00:00+0200
draft: false
# ShowLastmod: true
toc: false
scrolltotop: true
tags:
- R
- statistics
---
I recently became aware of the [duckplyr][] library for R, which takes the place
of tidyverse's [dplyr][] library, but uses the [DuckDB] database under the hood.
Without really knowing anything about how dplyr works and if the use of DuckDB
would improve my workflow at all, I decided to perform an experiment. I am
currently analyzing two datasets, one with ~80k records and ~70 variables and
one with ~60k records and ~100 variables. Both datasets are wrangled with
[Tidyverse][]-foo in multiple ways and finally combined. The wrangling of the
data involves things like `rowwise()` and `c_across()`, which I know from
experience is quite an 'expensive' operation.
In order to get the execution times of my code, I did this repeatedly:
1. Restart R (by pressing <kbd>CTRL</kbd> <kbd>SHIFT</kbd> <kbd>F10</kbd>).
2. Run
```r
system.time(rmarkdown::render("my_file.Rmd"))
```
3. Record the user time and the system time elapsed.
4. Repeat twice.
I did this with both the "balanced power mode" and the "performance mode" on my
[laptop][]. During execution of the code, I left the laptop alone in order not
to interfere with the timing.
This is the result of my benchmarking:
{{< figure src="benchmarking1.svg" >}}
The times are user times. I left out the system times, which are in the range of
2-3 seconds.
Not really mind-boggling, right? It occurred to me that I rather double-check
that `duckplyr` was really being used. Indeed, this was _not_ the case:
```r
> class(clinical_data)
[1] "tbl_df" "tbl" "data.frame"
```
`clinical_data` was missing the `duckplyr_df' class. How come?
I import the raw data from Excel files (don't ask...) into tibbles, and
evidently, this prevents `duckplyr` from seeing the data frames. So I piped the
data frames through `as_duckplyr_tibble()` explicitly, and this got me the right
classes:
```r
> class(clinical_data)
[1] "duckplyr_df" "tbl_df" "tbl" "data.frame"
```
However, this did not really speed up the execution either.
{{< figure src="benchmarking2.svg" >}}
I looked around my RMarkdown chunks and their outputs, but I did not find any
warning that `duckplyr` had to fall back to `dplyr`'s methods. This could have
explained the absence of a noticeable difference.
Here are the average times (in seconds) for the benchmarking runs.
```r
> runs_table
# A tibble: 6 × 4
# Groups: library, power_mode [6]
library power_mode mean sd
<chr> <chr> <dbl> <dbl>
1 dplyr balanced 31.8 0.722
2 dplyr performance 32.6 0.477
3 duckplyr balanced 31.4 1.10
4 duckplyr performance 31.3 0.495
5 duckplyr with `as_duckplyr_tibble` balanced 36.0 0.517
6 duckplyr with `as_duckplyr_tibble` performance 33.6 0.303
```
So at least for my (!!!) use case, the use of `duckplyr` instead of `dplyr` did
not make any practical difference, and I can also leave my laptop's performance
mode alone. When it comes to optimizing performance, you can't just buy a
solution off the shelf, you always have to try and find the best solution for
your specific problem.
Your mileage will vary, of course. The people who develop `duckplyr` are
brilliant, and the fact that it does not work for me tells more about me and my
work than it does about `duckplyr`.
## The duckplyr demo dataset
As a case in point, the [duckplyr demo repository][duckplyr-demo] contains a
taxi data set. The ZIP file alone is a ~1.7 GB download. Deflated, the files
take up 2.4 GB. With about 21 million records (24 variables), this dataset
is _considerably_ larger than mine.
Here are the results from running `dplyr/run_all_queries.R` and
`duckplyr/run_all_queries.R` on my Thinkpad P14s (performance mode in F40 KDE):
| Library | q01 | q02 | q03 | q04 |
|----------|------:|------:|------:|-------:|
| dplyr | 3.4 s | 3.9 s | 9.1 s | 14.3 s |
| duckplyr | 4.3 s | 4.4 s | 9.4 s | 14.8 s |
I should add that execution times vary with each run, but the big picture stays
the same.
Maybe I'm missing the point and it's not about execution times, after all.
`¯\_(ツ)_/`
[dplyr]: https:/dplyr.tidyverse.org
[duckdb]: https://duckdb.org
[duckplyr]: https://duckplyr.tidyverse.org
[duckplyr-demo]: https://github.com/Tmonster/duckplyr_demo
[laptop]: {{< ref "/posts/2024-08-05-linux-on-thinkpad-P14s-Gen-5" >}}
[tidyverse]: https://tidyverse.org