Street view style navigation for real-estate
For quite some time, I have wanted to create real-estate viewing experience that is both easy to capture on the input side and easy to use on the output side. Been working on it on an off for a few years now, but it is only recently that the various pieces of the puzzle fallen into place.
- Capture should be easy and quick ==> hand-held video with a fisheye lens in one take
- Handle both dim indoors and bright outdoors ==> can't lock exposure.
- The map (the streets and junctions of the "street-view") are automatically determined and restricted to where 360 views can be safely interpolated from training data.
- Sub-second loads. No one skimming through multiple RE properties in one sitting has time for > 10 second loads per property
- Minimal requirements on viewing hardware.
IMHO the navigation modes inspired by the gaming world is going to be a hard sell for the casual user in the RE market.
Please let me know what you think about this mode of navigation, live demo here:
Currently tested on the desktop and iphone 7 and 15 pro with atleast a fast 4G level speed. Adapative streaming planned for the future.
Now onto some more details relevant to this sub:
The video was taken on an Osmo 360. However, I only use one of the lenses as a test for future use of other cameras with a single fish eye lens (example a Panasonic GH5 with 4mm fisheye). Also I didn't need to spend time masking myself out. Using just one lens did mean that I had to be careful when making sharp 180 turns which happened twice in the above capture.
The whole two floors minus bedrooms took just under 9 minutes. For reliable 360 views, I had to repeat my trajectory through the house in opposite directions. Had I used a rig with a fisheye in front of me and one behind me, I could have done this in under 5 minutes! I only know of the portal cam that is equally time-efficient (no LIDAR in my case though). But you do have to plan ahead for the most time-efficient capture trajectory.
The exposure levels across the house varies by over 104 times, or just under 7 stops! No chance of locking exposure. But I do set a max limit on the shutter speed (1/250) to keep all frames sharp. See my previous post from a few months ago on how I deal with this.
I use SLAM instead of SFM (like COLMAP) since I am using videos. Although SFM can be run in sequential mode for videos, it lacks loop closure which corrects for large scale drift. Also SLAM aims to be real-time although you can trade-off speed for quality as I have done here. Furthermore, SLAM chooses keyframes for you that align with what is needed for training splats naturally - neighbouring frames with the right tradeoff of parallax vs overlap. The video had over 12,600 frames at 3K x 3K resolution that was whittled down to over 2,100 key-frames by SLAM. In SFM, you have to do the keyframe selection by other means.
The fisheye lens from the Osmo 360 was calibrated while doing SLAM over the whole approximately 210 degree view. You have to give it an initial guess for the FOV.
The 2100 keyframes were split into sets of at-most 300 frames to train gaussian splats. I chose to use ray-tracing for training the splats: 3DGRT for now. Ray tracing approaches have no problems with extreme fisheye distortion. I have tried 3DGUT, it has problems at the very edges of the fisheye coz of the approximations it makes at the edges which is unfortunate, since 3DGUT is about 2-4 times faster to train.
Each training set is trained for only 3,000 iterations (no I did not miss a zero there :) For final delivery, I probably would train for more. But with good initialization (whole other topic I might get into some other time) you don't really need many iterations for training splats!
No culling of floaters despite wide exposure changes. No sharpening or post-processing of any kind.
Video has the potential to be as detailed and crisp and surpass tripod mounted photos, others have demonstrated this on a smaller scale even on blurry input if you are willing to spend more GPU time. They do this by assuming a simple physics of blurring and optimizing the poses too while training. I don't know if hyperscape does this, but regardless think about the quality of the output with just video from relatively crappy sensors.