Codebase choice: SparkR or PySpark for Fabric
Hi all,
First time poster! Great sub 👍
TLDR: PySpark or R (SparkR deprecated in latest Spark version, but sparklyr still relevant?) for team that have a base R focus
I work in an Analytics group (public sector), and we are currently making a move from an on-prem setup which was pretty much custom made for our purpose a few years ago. It involves lots of R scripts running SQL via a scheduler on various database systems and prepping datasets for BI and reporting. Works but brittle, nuanced etc.
We now have a modern data lake setup, supported by a good data engineering team. We need to move our process into Fabric.
One choice I'm struggling with is which language to use, since a lot of our workflows will be via Spark notebooks. My gut says PySpark since it's what the data engineering team use (therefore code review and support would be a bit closer), but most of the team only have experience in R (and not a very current version at that, mostly base R) and would likely struggle with a move to python based workflows.
However with the deprecation of SparkR, and general second tier status of R in Spark, I am a bit concerned.
Thoughts?