From b5118b2835f78644a4df480a348fdc8317828c24 Mon Sep 17 00:00:00 2001 From: Daniel Kraus Date: Wed, 28 Aug 2024 14:44:13 +0200 Subject: [PATCH] Add post about duckplyr performance. --- assets/scss/_predefined.scss | 12 ++ config.toml | 3 + .../benchmarking1.svg | 74 ++++++++++ .../benchmarking2.svg | 87 ++++++++++++ .../2024-08-26-duckplyr-performance/index.md | 127 ++++++++++++++++++ 5 files changed, 303 insertions(+) create mode 100644 content/posts/2024-08-26-duckplyr-performance/benchmarking1.svg create mode 100644 content/posts/2024-08-26-duckplyr-performance/benchmarking2.svg create mode 100644 content/posts/2024-08-26-duckplyr-performance/index.md diff --git a/assets/scss/_predefined.scss b/assets/scss/_predefined.scss index 06543c4..11738eb 100644 --- a/assets/scss/_predefined.scss +++ b/assets/scss/_predefined.scss @@ -7,6 +7,18 @@ $highlight-grey: #7d828a; $midnightblue: #2c3e50; $typewriter: hsl(172, 100%, 36%); +// Scroll to Top Default colors + +$stt-stroke:#CCC; +$stt-circle:#3b3e48; +$stt-arrow:#018574; + +kbd { + font-size: 0.9em !important; + color: inherit; + background-color: $midnightblue; +} + // Fonts $fonts: "IBM Plex Sans Light", "Segoe UI", Candara, sans-serif; $code-fonts: "IBM Plex Mono", Consolas, "Andale Mono WT", "Andale Mono", Menlo, Monaco, monospace; diff --git a/config.toml b/config.toml index fb67d8c..23f020e 100644 --- a/config.toml +++ b/config.toml @@ -46,6 +46,9 @@ expiryDate = ["expiryDate"] # Categories are disabled by default. # category = "categories" +[markup.goldmark.renderer] + unsafe = true + # Enable to get proper Mathjax support #[markup] # [markup.goldmark] diff --git a/content/posts/2024-08-26-duckplyr-performance/benchmarking1.svg b/content/posts/2024-08-26-duckplyr-performance/benchmarking1.svg new file mode 100644 index 0000000..60851d1 --- /dev/null +++ b/content/posts/2024-08-26-duckplyr-performance/benchmarking1.svg @@ -0,0 +1,74 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0 +10 +20 +30 + + + + + + + +dplyr +duckplyr +Library +Time (s) + +Laptop power mode + + +balanced +performance +Time elapsed with dplyr vs. duckplyr + + diff --git a/content/posts/2024-08-26-duckplyr-performance/benchmarking2.svg b/content/posts/2024-08-26-duckplyr-performance/benchmarking2.svg new file mode 100644 index 0000000..1e12564 --- /dev/null +++ b/content/posts/2024-08-26-duckplyr-performance/benchmarking2.svg @@ -0,0 +1,87 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0 +10 +20 +30 +40 + + + + + + + + + +dplyr +duckplyr +duckplyr +with +`as_duckplyr_tibble` +Library +Time (s) + +Laptop power mode + + +balanced +performance +Time elapsed with dplyr vs. duckplyr + + diff --git a/content/posts/2024-08-26-duckplyr-performance/index.md b/content/posts/2024-08-26-duckplyr-performance/index.md new file mode 100644 index 0000000..ca9ad7f --- /dev/null +++ b/content/posts/2024-08-26-duckplyr-performance/index.md @@ -0,0 +1,127 @@ +--- +title: Performance experiments with duckplyr in R +description: I recently became aware of the 'duckplyr' library for R. Here are the results of my experimenting with it and benchmarking it against `dplyr`. +date: 2024-08-26T19:00:00+0200 +draft: false +# ShowLastmod: true +toc: false +scrolltotop: true +tags: + - R + - statistics +--- +I recently became aware of the [duckplyr][] library for R, which takes the place +of tidyverse's [dplyr][] library, but uses the [DuckDB] database under the hood. +Without really knowing anything about how dplyr works and if the use of DuckDB +would improve my workflow at all, I decided to perform an experiment. I am +currently analyzing two datasets, one with ~80k records and ~70 variables and +one with ~60k records and ~100 variables. Both datasets are wrangled with +[Tidyverse][]-foo in multiple ways and finally combined. The wrangling of the +data involves things like `rowwise()` and `c_across()`, which I know from +experience is quite an 'expensive' operation. + +In order to get the execution times of my code, I did this repeatedly: + +1. Restart R (by pressing CTRL SHIFT F10). +2. Run + + ```r + system.time(rmarkdown::render("my_file.Rmd")) + ``` + +3. Record the user time and the system time elapsed. +4. Repeat twice. + +I did this with both the "balanced power mode" and the "performance mode" on my +[laptop][]. During execution of the code, I left the laptop alone in order not +to interfere with the timing. + +This is the result of my benchmarking: + +{{< figure src="benchmarking1.svg" >}} + +The times are user times. I left out the system times, which are in the range of +2-3 seconds. + +Not really mind-boggling, right? It occurred to me that I rather double-check +that `duckplyr` was really being used. Indeed, this was _not_ the case: + +```r +> class(clinical_data) +[1] "tbl_df" "tbl" "data.frame" +``` + +`clinical_data` was missing the `duckplyr_df' class. How come? + +I import the raw data from Excel files (don't ask...) into tibbles, and +evidently, this prevents `duckplyr` from seeing the data frames. So I piped the +data frames through `as_duckplyr_tibble()` explicitly, and this got me the right +classes: + +```r +> class(clinical_data) +[1] "duckplyr_df" "tbl_df" "tbl" "data.frame" +``` + +However, this did not really speed up the execution either. + +{{< figure src="benchmarking2.svg" >}} + +I looked around my RMarkdown chunks and their outputs, but I did not find any +warning that `duckplyr` had to fall back to `dplyr`'s methods. This could have +explained the absence of a noticeable difference. + +Here are the average times (in seconds) for the benchmarking runs. + +```r +> runs_table +# A tibble: 6 × 4 +# Groups: library, power_mode [6] + library power_mode mean sd + +1 dplyr balanced 31.8 0.722 +2 dplyr performance 32.6 0.477 +3 duckplyr balanced 31.4 1.10 +4 duckplyr performance 31.3 0.495 +5 duckplyr with `as_duckplyr_tibble` balanced 36.0 0.517 +6 duckplyr with `as_duckplyr_tibble` performance 33.6 0.303 +``` + +So at least for my (!!!) use case, the use of `duckplyr` instead of `dplyr` did +not make any practical difference, and I can also leave my laptop's performance +mode alone. When it comes to optimizing performance, you can't just buy a +solution off the shelf, you always have to try and find the best solution for +your specific problem. + +Your mileage will vary, of course. The people who develop `duckplyr` are +brilliant, and the fact that it does not work for me tells more about me and my +work than it does about `duckplyr`. + +## The duckplyr demo dataset + +As a case in point, the [duckplyr demo repository][duckplyr-demo] contains a +taxi data set. The ZIP file alone is a ~1.7 GB download. Deflated, the files +take up 2.4 GB. With about 21 million records (24 variables), this dataset +is _considerably_ larger than mine. + +Here are the results from running `dplyr/run_all_queries.R` and +`duckplyr/run_all_queries.R` on my Thinkpad P14s (performance mode in F40 KDE): + +| Library | q01 | q02 | q03 | q04 | +|----------|------:|------:|------:|-------:| +| dplyr | 3.4 s | 3.9 s | 9.1 s | 14.3 s | +| duckplyr | 4.3 s | 4.4 s | 9.4 s | 14.8 s | + +I should add that execution times vary with each run, but the big picture stays +the same. + +Maybe I'm missing the point and it's not about execution times, after all. + +`¯\_(ツ)_/` + +[dplyr]: https:/dplyr.tidyverse.org +[duckdb]: https://duckdb.org +[duckplyr]: https://duckplyr.tidyverse.org +[duckplyr-demo]: https://github.com/Tmonster/duckplyr_demo +[laptop]: {{< ref "/posts/2024-08-05-linux-on-thinkpad-P14s-Gen-5" >}} +[tidyverse]: https://tidyverse.org