r/apachespark

▲ 16 r/apachespark+1 crossposts

Struggling to learn Spark UI on Databricks, all tutorials are outdated. Any good resources?

Hey everyone, I'm fairly new to Spark and trying to understand how it actually executes jobs specifically the DAG visualization, stages, task metrics, and executor stats in the Spark UI.

The problem I'm running into: almost every video tutorial I find was recorded on an older version of Databricks, and the UI looks completely different from what I see today. The gap is big enough that I can't follow along at all.

A few specific issues I've hit:

- `spark.databricks.io.cache.enabled` throws a CONFIG_NOT_AVAILABLE error on newer runtimes

- `spark.catalog.clearCache()` throws a NOT_SUPPORTED error because I'm on Serverless compute (Community Edition)

- The Spark UI itself looks different from what tutorials show

I'm using Databricks Community Edition (free tier), which I've now learned only gives Serverless compute so some things just aren't available.

My questions:

  1. Is there a good up-to-date resource (video, blog, or docs) for understanding the Spark UI on the current Databricks version?

  2. For learning Spark internals (DAG, stages, task metrics), is it better to just use local Spark or Google Colab instead of Databricks free tier?

  3. Any tips for following older Spark UI tutorials and mentally mapping them to the current UI?

Thanks in advance!

reddit.com
u/FlatTackle918 — 1 day ago

I m good at Apache Spark but I wanna deep dive more. Which content do you recommend and where Can I try it for free because my laptop is not powerful enough?

reddit.com
u/dataengineer95 — 3 days ago

Wick: Type-Safe Spark API

I've played a bit with Wick, the new type-safe Spark API from Netflix. I've only tried the basics, but if you're a beginner interested in how it works, check out my latest article.

matejcerny.cz
u/matej_cerny — 5 days ago
▲ 56 r/apachespark+1 crossposts

Learning (Py)Spark the easy way

Hi guys, I'm starting a job as a Junior Data Engineer soon and I will be using a lot of PySpark yet I have no experience with it. I want to grasp the basics and start my journey into the engine architecture and optimization but I'm kind of lazy so I'm looking for the easy way. I do have experience with Python and SQL as I have worked as a SWE and DevOps Engineer before.

I was wondering if there are any good courses I can just go through that will teach me the basic commands and concepts, ideally something low effort I can just put an hour in every now and then.

Also I'm looking for a book that goes deeper into architecture and optimization so I can start to gain some deeper knowledge. I have read books like 'designing data intensive application' and am looking for something similar where it mostly explains separated concepts so I can stop reading for a week without being lost when starting again.

YouTube channel recommendations with content I can tune out to while still learning just a little bit would also be appreciated. Or anything else for lazy engineers like me.

Thanks in advance!

reddit.com
u/Salt_Macaron_6582 — 8 days ago

G1GC garbage collector

Anyone has run their spark jobs with the G1GC Garbage Collector?

I got that recommendation from an automated performance scan tool a vendor sent us for testing.

TA

reddit.com
u/oalfonso — 9 days ago

How are you guys handling Iceberg table maintenance in production?

We’ve been running Iceberg on Spark for a while and the maintenance side keeps surprising me with how much glue code we end up writing — compaction schedules, snapshot expiration, orphan file cleanup, manifest rewrites, monitoring when small-file counts blow up etc. Can someone give me insights how are you guys doing maintenance stuff in your organisation?

P.S: Asking this on different sub reddits to gather more info

reddit.com
u/Remarkable-Ant-2473 — 9 days ago