u/Aggressive_Cash_7436

Question on Datalake Behaviour Reading Many Small Files versus Fewer Larger Files

Has anyone ever checked or know the difference in read transactions on a datalake when querying a table with lots of smaller files versus fewer larger files?

For example, if I have two tables with identical data contained does that mean:

  1. The table with the data spread across 100 files will have 100 read operations in the datalake each time the data is queried?
  2. The table with the data spread across 1,000 files will have 1,000 read operations each time the data is queried?
  3. Alternatively does the number of files not matter and the number of read operations in the datalake is the same regardless.

I know there will be pruning, skipping, use optimize etc. depending on the context but I'm hoping for clarity on the above in a simple scenario where all data has to be read to execute the query.

reddit.com
u/Aggressive_Cash_7436 — 12 days ago

How have you been able to keep Serverless costs under control?

Since enabling serverless we continue to see users selecting this option in notebooks and then running some queries that take hours to run and leading to high costs.

We have all purpose clusters and personal clusters but despite constantly raising these options with users we continue to see an ongoing issue where serverless is selected and then code is left to run for hours.

Unfortunately the majority of users and queries are fine but all it takes is a handful of queries that run longer than an hour to cause massive costs.

reddit.com
u/Aggressive_Cash_7436 — 19 days ago

Hi All.

Earlier this year a pipeline was turned on using Spark Declarative Pipelines in continuous mode. Immediately we noticed an explosion in storage write costs in the datalake.

We lowered these costs massively by making some configuration changes and I hope our learnings help someone in the future that has the same problem.

Two very important settings to configure are pipelines.trigger.interval and spark.sql.shuffle.partitions.

The fundamental issue is that there are changelog files that get updated with every refresh that a table has within the pipeline.

The partitions are important because it means if you have the default partitions set (default is 200) that there will be 200 changelog files and 200 changelog.crc files that get updated for every refresh for each table.

For every refresh there are also four write operations in the datalake to each of these changelog files: flush, append, create and rename.

This means that for every refresh there will be 1,600 write operations. (200 partitions x 4 write operations x 2 changelog types).

If you do not set a trigger interval this will be set at 5 seconds by default but we were seeing even less at times. This means if you do not make these configuration changes there will be ~27 million write operations per table per day if you have a pipeline that has frequent amounts of data being ingested throughout the day.

We have since updated these settings and although the partitions help just changing the pipeline interval to 60 seconds will see large savings if that sort of latency is acceptable. If you want very low latency then leave the interval as is but please look into partitions as a minimum.

Not only have our storage write costs decreased but we have also noticed significantly less compute power required.

Please keep in mind this was for an ingestion a few million records a day so the refreshes were frequent which may not always be the case depending on your data volumes. Either way I would highly recommend you look at your pipelines and storage write costs specifically to ensure the same is not happening to you.

____________

This is an update following this earlier post https://www.reddit.com/r/databricks/comments/1slgxmb/do_you_set_pipelinestriggerinterval_on_spark/8

reddit.com
u/Aggressive_Cash_7436 — 26 days ago