u/HitTheSonicWall — reddlx

100% Databricks newbie here, but pretty seasoned nerd.

I've been tasked with downloading a rather large dataset from Databricks. It's 15 files of various sizes, but the larger ones (300GB, 1.2TB and 2.7TB respectively) are giving me trouble.

I started with the Databricks CLI, which worked fine but the download died after an hour or so, very consistently. I then noted that the first line of the README says "This project is in Public Preview." Great.

I then moved to Firefox under Linux, where I was able to start the downloads. They seem to die after exactly 16.1GB, after which I am able to resume them, and they start where they left off. Yay, I only have to click resume 2.7TB/16.1GB=167 times to get my file.

Trouble is, after a while my session expires, and I can no longer resume the downloads.

I'm also getting pretty shit speeds (100Mbit/s) or so combined, on a 1Gbit business fiber connection, but if I could at least get something stable, I'd be happy.

It should probably be mentioned that I'm on the freebie tier of databricks.

Edit: People have asked for background as to why I'm doing this, which is a 100% legitimate question. A company in our line of work has released this very large dataset into the public domain. They picked Databricks, I didn't. We wish to download this dataset to our on-prem systems so we can process it using our fairly niche and highly resource intensive algorithms. It's not really an option to run things on Databricks, for a number of good reasons.