blog/content/posts/2024/duckplyr-performance/index.md

128 lines
4.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Performance experiments with duckplyr in R
description: I recently became aware of the 'duckplyr' library for R. Here are the results of my experimenting with it and benchmarking it against `dplyr`.
date: 2024-08-26T19:00:00+0200
draft: false
# ShowLastmod: true
toc: false
scrolltotop: true
tags:
- R
- statistics
---
I recently became aware of the [duckplyr][] library for R, which takes the place
of tidyverse's [dplyr][] library, but uses the [DuckDB] database under the hood.
Without really knowing anything about how dplyr works and if the use of DuckDB
would improve my workflow at all, I decided to perform an experiment. I am
currently analyzing two datasets, one with ~80k records and ~70 variables and
one with ~60k records and ~100 variables. Both datasets are wrangled with
[Tidyverse][]-foo in multiple ways and finally combined. The wrangling of the
data involves things like `rowwise()` and `c_across()`, which I know from
experience is quite an 'expensive' operation.
In order to get the execution times of my code, I did this repeatedly:
1. Restart R (by pressing <kbd>CTRL</kbd> <kbd>SHIFT</kbd> <kbd>F10</kbd>).
2. Run
```r
system.time(rmarkdown::render("my_file.Rmd"))
```
3. Record the user time and the system time elapsed.
4. Repeat twice.
I did this with both the "balanced power mode" and the "performance mode" on my
[laptop][]. During execution of the code, I left the laptop alone in order not
to interfere with the timing.
This is the result of my benchmarking:
{{< figure src="benchmarking1.svg" >}}
The times are user times. I left out the system times, which are in the range of
2-3 seconds.
Not really mind-boggling, right? It occurred to me that I rather double-check
that `duckplyr` was really being used. Indeed, this was _not_ the case:
```r
> class(clinical_data)
[1] "tbl_df" "tbl" "data.frame"
```
`clinical_data` was missing the `duckplyr_df' class. How come?
I import the raw data from Excel files (don't ask...) into tibbles, and
evidently, this prevents `duckplyr` from seeing the data frames. So I piped the
data frames through `as_duckplyr_tibble()` explicitly, and this got me the right
classes:
```r
> class(clinical_data)
[1] "duckplyr_df" "tbl_df" "tbl" "data.frame"
```
However, this did not really speed up the execution either.
{{< figure src="benchmarking2.svg" >}}
I looked around my RMarkdown chunks and their outputs, but I did not find any
warning that `duckplyr` had to fall back to `dplyr`'s methods. This could have
explained the absence of a noticeable difference.
Here are the average times (in seconds) for the benchmarking runs.
```r
> runs_table
# A tibble: 6 × 4
# Groups: library, power_mode [6]
library power_mode mean sd
<chr> <chr> <dbl> <dbl>
1 dplyr balanced 31.8 0.722
2 dplyr performance 32.6 0.477
3 duckplyr balanced 31.4 1.10
4 duckplyr performance 31.3 0.495
5 duckplyr with `as_duckplyr_tibble` balanced 36.0 0.517
6 duckplyr with `as_duckplyr_tibble` performance 33.6 0.303
```
So at least for my (!!!) use case, the use of `duckplyr` instead of `dplyr` did
not make any practical difference, and I can also leave my laptop's performance
mode alone. When it comes to optimizing performance, you can't just buy a
solution off the shelf, you always have to try and find the best solution for
your specific problem.
Your mileage will vary, of course. The people who develop `duckplyr` are
brilliant, and the fact that it does not work for me tells more about me and my
work than it does about `duckplyr`.
## The duckplyr demo dataset
As a case in point, the [duckplyr demo repository][duckplyr-demo] contains a
taxi data set. The ZIP file alone is a ~1.7 GB download. Deflated, the files
take up 2.4 GB. With about 21 million records (24 variables), this dataset
is _considerably_ larger than mine.
Here are the results from running `dplyr/run_all_queries.R` and
`duckplyr/run_all_queries.R` on my Thinkpad P14s (performance mode in F40 KDE):
| Library | q01 | q02 | q03 | q04 |
|----------|------:|------:|------:|-------:|
| dplyr | 3.4 s | 3.9 s | 9.1 s | 14.3 s |
| duckplyr | 4.3 s | 4.4 s | 9.4 s | 14.8 s |
I should add that execution times vary with each run, but the big picture stays
the same.
Maybe I'm missing the point and it's not about execution times, after all.
`¯\_(ツ)_/`
[dplyr]: https:/dplyr.tidyverse.org
[duckdb]: https://duckdb.org
[duckplyr]: https://duckplyr.tidyverse.org
[duckplyr-demo]: https://github.com/Tmonster/duckplyr_demo
[laptop]: {{< relref "P14s" >}}
[tidyverse]: https://tidyverse.org