u/Administrative_Bar46

I currently have a use case where I am using PyMuPDF4llm and layout to process multiple PDF documents in batches. One document takes about 4 minutes to process, and I need to handle 2,000+ documents per day.
I tried two approaches:

Using a multiprocessing library on a single-node job cluster with a lot cpu
Using Spark RDDs
Fundamentally, I don’t think this is a strong Spark use case. There is a lot I/O operation(15% of task time is on io ) from a remote path s3. When I tried using Spark, I ran into several issues, and I’m currently focusing on the multiprocessing approach instead.
One approach I have not tried is ai parser but don’t think that handle pdf collection and comment natively.

Do you think this is the right direction to go? If you’ve worked on something similar, could you share any recommendations, or technical artifacts/patterns that might help?

Any and all insight is helpful. This is my first time using multiprocessing in databricks

Scaling cpu bound process in databricks