--- title: Performance experiments with duckplyr in R description: I recently became aware of the 'duckplyr' library for R. Here are the results of my experimenting with it and benchmarking it against `dplyr`. date: 2024-08-26T19:00:00+0200 draft: false # ShowLastmod: true toc: false scrolltotop: true tags: - R - statistics --- I recently became aware of the [duckplyr][] library for R, which takes the place of tidyverse's [dplyr][] library, but uses the [DuckDB] database under the hood. Without really knowing anything about how dplyr works and if the use of DuckDB would improve my workflow at all, I decided to perform an experiment. I am currently analyzing two datasets, one with ~80k records and ~70 variables and one with ~60k records and ~100 variables. Both datasets are wrangled with [Tidyverse][]-foo in multiple ways and finally combined. The wrangling of the data involves things like `rowwise()` and `c_across()`, which I know from experience is quite an 'expensive' operation. In order to get the execution times of my code, I did this repeatedly: 1. Restart R (by pressing CTRL SHIFT F10). 2. Run ```r system.time(rmarkdown::render("my_file.Rmd")) ``` 3. Record the user time and the system time elapsed. 4. Repeat twice. I did this with both the "balanced power mode" and the "performance mode" on my [laptop][]. During execution of the code, I left the laptop alone in order not to interfere with the timing. This is the result of my benchmarking: {{< figure src="benchmarking1.svg" >}} The times are user times. I left out the system times, which are in the range of 2-3 seconds. Not really mind-boggling, right? It occurred to me that I rather double-check that `duckplyr` was really being used. Indeed, this was _not_ the case: ```r > class(clinical_data) [1] "tbl_df" "tbl" "data.frame" ``` `clinical_data` was missing the `duckplyr_df' class. How come? I import the raw data from Excel files (don't ask...) into tibbles, and evidently, this prevents `duckplyr` from seeing the data frames. So I piped the data frames through `as_duckplyr_tibble()` explicitly, and this got me the right classes: ```r > class(clinical_data) [1] "duckplyr_df" "tbl_df" "tbl" "data.frame" ``` However, this did not really speed up the execution either. {{< figure src="benchmarking2.svg" >}} I looked around my RMarkdown chunks and their outputs, but I did not find any warning that `duckplyr` had to fall back to `dplyr`'s methods. This could have explained the absence of a noticeable difference. Here are the average times (in seconds) for the benchmarking runs. ```r > runs_table # A tibble: 6 × 4 # Groups: library, power_mode [6] library power_mode mean sd 1 dplyr balanced 31.8 0.722 2 dplyr performance 32.6 0.477 3 duckplyr balanced 31.4 1.10 4 duckplyr performance 31.3 0.495 5 duckplyr with `as_duckplyr_tibble` balanced 36.0 0.517 6 duckplyr with `as_duckplyr_tibble` performance 33.6 0.303 ``` So at least for my (!!!) use case, the use of `duckplyr` instead of `dplyr` did not make any practical difference, and I can also leave my laptop's performance mode alone. When it comes to optimizing performance, you can't just buy a solution off the shelf, you always have to try and find the best solution for your specific problem. Your mileage will vary, of course. The people who develop `duckplyr` are brilliant, and the fact that it does not work for me tells more about me and my work than it does about `duckplyr`. ## The duckplyr demo dataset As a case in point, the [duckplyr demo repository][duckplyr-demo] contains a taxi data set. The ZIP file alone is a ~1.7 GB download. Deflated, the files take up 2.4 GB. With about 21 million records (24 variables), this dataset is _considerably_ larger than mine. Here are the results from running `dplyr/run_all_queries.R` and `duckplyr/run_all_queries.R` on my Thinkpad P14s (performance mode in F40 KDE): | Library | q01 | q02 | q03 | q04 | |----------|------:|------:|------:|-------:| | dplyr | 3.4 s | 3.9 s | 9.1 s | 14.3 s | | duckplyr | 4.3 s | 4.4 s | 9.4 s | 14.8 s | I should add that execution times vary with each run, but the big picture stays the same. Maybe I'm missing the point and it's not about execution times, after all. `¯\_(ツ)_/` [dplyr]: https:/dplyr.tidyverse.org [duckdb]: https://duckdb.org [duckplyr]: https://duckplyr.tidyverse.org [duckplyr-demo]: https://github.com/Tmonster/duckplyr_demo [laptop]: {{< relref "P14s" >}} [tidyverse]: https://tidyverse.org