u/Medical-Common1034 — reddlx

Follow-up benchmark: where R data pipelines pay their cost: fread, readr, vroom, data.table and dplyr

I wrote a follow-up article after my previous post about benchmarking a Shiny dashboard pipeline for blog analytics.

The previous article was somewhat polarizing. Most feedback was positive, but some people disagreed with the benchmarking methodology. That criticism was useful, so I tried to make this new benchmark more transparent and reproducible.

The pipeline is still based on my real blog analytics use case:

read NGINX TSV logs
filter bot traffic
apply time-window filters
do ASN / Geo enrichment
compute read-time metrics

This time I benchmarked 6 pipeline configurations:

readr::read_tsv() + dplyr
vroom::vroom() + dplyr
fread() + dplyr
readr::read_tsv() + data.table
vroom::vroom() + data.table
fread() + data.table

The benchmark uses 20 generated log files, from 1× to 20× the original size, with increasing timestamp and IP cardinality instead of simply appending the same file repeatedly.

Methodology:

20 log files, from ~557k rows to ~11.1M rows
10 consecutive runs per file and per configuration (we will distinguish the cold-starts from the others runs)
execution time measured per pipeline step with proc.time(), so elapsed, user, and system time recorded
memory tracked with gc() counters: current/max NCells and VCells (more on that in the article)

A few things I found interesting:

fread + data.table is still the most predictable path when the pipeline eventually works on fully materialized data.
readr + dplyr is closer than I expected in several parts of the pipeline.
data.table becomes more clearly advantageous in grouped / index-like filtering steps.
vroom + dplyr is very interesting: it looks extremely cheap at ingestion because columns stay lazy / ALTREP-backed.

But vroom does not make parsing free; it moves the cost to later operations that force materialization.

Nevertheless, in this specific pipeline, that can still be a big memory win (and elapsed_time not dramatic), because the dataframe is filtered before all columns are fully materialized so later materialization happens on a smaller subset compared to the other configurations.

gc()-based memory results are useful, but they are not perfect per-operation allocation measurements because the R process is reused across reloads.

Article:

https://julienlargetpiet.tech/articles/where-r-data-pipelines-pay-their-cost-data-table-dplyr-fread-readr-and-vroom.html

I would be curious to hear how people here would improve the methodology further.

(PS: Thanks for the comments! It’s pretty late here, so I’m going to sleep now. I’ll come back and answer questions in around 8 hours.)

reddit.com

u/Medical-Common1034 — 12 days ago

▲ 35 r/dataengineer+1 crossposts

I benchmarked dplyr vs data.table on my Shiny log dashboard

I wrote a small article after rewriting part of my Shiny dashboard for my blog analytics.

The app reads an NGINX TSV log file, filters bot traffic, does some ASN / Geo enrichment, then computes a few metrics and plots.

The benchmark is on a real log file:

725,832 rows
124 MB TSV
median of 9 runs per step
peak RSS measured with /usr/bin/time -v

A few things I found interesting:

fread() was the best ingestion path in this case
fread + dplyr was surprisingly close to fread + data.table for the first cleaning step
data.table became much better in the later grouped / index-based filtering steps
vroom was not a great fit here because the pipeline ends up touching most columns anyway
precomputing masks like keep <- condition; df <- df[keep] was often slightly faster

In the end, data.table seems to give deeper control over the execution path, which makes it easier to avoid unnecessary copies and use index-based filtering more efficiently.

Article:

https://julienlargetpiet.tech/articles/data-table-vs-dplyr-in-a-data-pipeline.html

Curious if people here would structure this pipeline differently, especially the data.table parts.

reddit.com

u/Medical-Common1034 — 29 days ago

▲ 46 r/haskell

I benchmarked Cartesian product implementations in Haskell, then compared them with C

I wrote a small article around implementing Cartesian products, starting from Haskell’s sequence.

The article goes through a naive Haskell implementation, a more idiomatic list-comprehension version, native sequence, then versions using Data.Vector.Unboxed, mutable vectors, runST, unsafeFreeze to try a different memory representation.

The second half compares those designs with C implementations, mostly to look at what changes when the memory layout and allocation model are made explicit.

The most interesting result for me was that changing representation in Haskell reduced allocations a lot without automatically improving runtime. In some cases, fusion helped a bit (no temporary indices).

I’d be happy to get feedback on the Haskell side, especially the vector/ST implementations and whether there are more idiomatic or faster ways to express this.

Here is the article link:

https://julienlargetpiet.tech/articles/the-cartesian-product-disaster-tour-haskell-c-and-25gb-of-allocations.html

If you want to share any optimizations, you can do a PR at this repo:

https://github.com/julienlargetpiet/PerfLabs

It will be mentioned in the next article update.

PS: We now managed to find (currently) best version which is this one (in C) running at 160ms for 100k iterations to 5^5 lists here:

https://github.com/julienlargetpiet/PerfLabs/blob/main/CartesianProduct/contrib2/contrib2.c

I'm updating after trying some new Haskell impl, thanks everyone for the help and the intuition on where to dig !

reddit.com

u/Medical-Common1034 — 2 months ago