u/seacess

50GB worth of excel files, how to load?

Hi,

I got a task where I get hundreds of excel files, each 700-800MB of size. I cannot influence what I get so I am stuck with these files.

Things tried so far on 6 files for starters 4.5GB:

- Notebook(Python) - One file takes 30min, all 6 files it will time out.

- Copy job - I get a message that the file is too big for it :(

- Dataflow - all 6 files 24min, so to prevent timeout will probably need to build few of them and the orchestrate in pipeline.

Any suggestions on how to deal with this monster anything I am missing here? I am for now trying to put them in one table in a lake house for further data flow processing.

reddit.com
u/seacess — 14 hours ago