u/xahyms10

Hi everyone,

I’m currently working on ingesting historical data from an API into Databricks, and I’d like to get some opinions on the best approach.

The API data volume is quite inconsistent by date. Some days have no records at all, some days only have around 100 records, some have 50k records, and the highest I’ve seen so far is more than 2 million records in a single day.

My current approach is:
1 day = 1 ingestion window
Run ingestion for 1 month of historical data at a time

This works fine for most dates, but the issue happens when one particular day has more than 1 million records. The job fails with an OOM error.

One idea I’m considering is to first check the record count for each day. Then, if a day has more than 1 million records, I split that particular day into smaller hourly windows instead of ingesting the whole day at once.

For those who have handled similar API ingestion scenarios in Databricks, how do you usually deal with this kind of volume spike?

Would you recommend dynamic windowing like this, or is there a better pattern for handling unstable historical data volumes from APIs?

Also curious if there are any best practices around avoiding OOM in this kind of API-to-Delta ingestion pipeline.

How do you handle API ingestion when historical data volume varies a lot and causes OOM?