Question on Datalake Behaviour Reading Many Small Files versus Fewer Larger Files
Has anyone ever checked or know the difference in read transactions on a datalake when querying a table with lots of smaller files versus fewer larger files?
For example, if I have two tables with identical data contained does that mean:
- The table with the data spread across 100 files will have 100 read operations in the datalake each time the data is queried?
- The table with the data spread across 1,000 files will have 1,000 read operations each time the data is queried?
- Alternatively does the number of files not matter and the number of read operations in the datalake is the same regardless.
I know there will be pruning, skipping, use optimize etc. depending on the context but I'm hoping for clarity on the above in a simple scenario where all data has to be read to execute the query.