u/Fast_Dealer_6462

Hi all,

First time poster! Great sub 👍

TLDR: PySpark or R (SparkR deprecated in latest Spark version, but sparklyr still relevant?) for team that have a base R focus

I work in an Analytics group (public sector), and we are currently making a move from an on-prem setup which was pretty much custom made for our purpose a few years ago. It involves lots of R scripts running SQL via a scheduler on various database systems and prepping datasets for BI and reporting. Works but brittle, nuanced etc.

We now have a modern data lake setup, supported by a good data engineering team. We need to move our process into Fabric.

One choice I'm struggling with is which language to use, since a lot of our workflows will be via Spark notebooks. My gut says PySpark since it's what the data engineering team use (therefore code review and support would be a bit closer), but most of the team only have experience in R (and not a very current version at that, mostly base R) and would likely struggle with a move to python based workflows.

However with the deprecation of SparkR, and general second tier status of R in Spark, I am a bit concerned.

Thoughts?

Codebase choice: SparkR or PySpark for Fabric