r/computervision
Ultralytics Just Added Semantic Segmentation Models & They Look INSANE
Just tested the new Ultralytics Semantic Segmentation models on video inference and honestly the results are super clean 👀
The new -sem models include:
• yolo26n-sem.pt
• yolo26s-sem.pt
• yolo26m-sem.pt
• yolo26l-sem.pt
• yolo26x-sem.pt
Big upgrades:
✅ Pixel-level scene understanding
✅ Semantic masks directly in inference outputs
✅ Cityscapes + ADE20K support
✅ PNG mask datasets supported
✅ Mosaic, MixUp, CutMix & perspective transforms now support semantic masks
✅ Real-time video inference performance 🚀
This feels like a huge step for:
🚗 Autonomous Driving
🤖 Robotics
📹 Smart Surveillance
🏙️ Smart City Applications
⚡ Edge AI
I tested it on video and shared the demo here:
https://youtu.be/swnAMHKZU20
Curious to know:
Do you think semantic segmentation will become the next major focus after object detection?
Shoplifting detection system
I want to use my old DVR and cameras to detect shoplifting in my store. What is the current state of the art on this, is it possible? Can I train YOLO to detect suspicious movements made by clients? Sorry if it's a basic question, I'm just starting.
Conferences for first solo author paper?
I have been building some thing for a while and one ideas after another, finally I have come up with a real novel algorithm for training model that works very well. As it should, because it's grounded in physics.. (if I explain you the ideas behind my model, you'd actually agree that it should work better). The kind of ideas that are obvious but hidden in plain sight or thought about it but just no one tried so far.
I have already filed a provisional patent application on it.. and now looking to publish it.
I have published in other ai domains but never in cvpr or the likes. And it's just my own work.. completely solo. Not a professor, nor have a PhD degree.
I'm now looking to get it published in a conference but I also feel like going all my own might be tough just because I'm not affiliated to any research labs or universities.. I know how to write papers.. what kind of results are expected and so on.. but I also know lot of editors just send out desk rejections to anyone without affiliations.. sad but true thing. Depends on scientific community and editors
What should I do? Target a second tier conference or even a workshop first? There is enough merit in the paper and deserves better in my perception.
Running SAM3 on NVIDIA Jetson Nano
Real-time edge AI vision just got better.
We’ve released Embedl SAM3 for TensorRT, a fully reproducible, end-to-end deployment of facebook/sam3 on NVIDIA GPUs (Jetson AGX Orin, Nano), with INT8 post-training quantization built with Embedl Deploy that bridges the gap between hardware constraints on edge devices and PyTorch: https://huggingface.co/embedl/sam3
One script (https://docs.embedl.com/embedl-deploy/latest/auto\_tutorials/sam3.html) that only requires a Python package with the only dependency being PyTorch. The script takes you from a Hugging Face checkpoint to running TensorRT engine export, fusions, quantization, compilation.
Use a smaller image size to get started faster.
The performance:
NVIDIA Jetson AGX Orin Image size Latency
224×224 → 40.4ms / 24.7 FPS (real-time)
448×448 → 118.5ms INT8, 10% faster than FP16
672×672 → 187.6ms INT8, 27% faster than FP16
NVIDIA Jetson Orin Nano
224×224 → 89.6ms / 11.2 FPS
448×448 → 262.6ms INT8, 20% faster than FP16
The speed-up isn’t the headline. Getting the model running reliably is. SAM3’s ViT backbone, window attention, RoPE embeddings, and FPN neck create real deployment issues: memory, quantization sensitivity, poor accuracy, export and compilation breaking down. Embedl Deploy handles all of it: hardware-aware, accuracy-preserving, out of the box. And PyTorch is the only dependency: no graph surgery, no ONNX simplification scripts, no extra calibration tooling to wrangle. PTQ and QAT in one unified workflow with only PyTorch and TensorRT.
This is not just for Jetson or NVIDIA GPUs. We are building Embedl Deploy for any edge hardware. Whatever device you’re deploying to, we solve the same problem: take your model from PyTorch to production without months of debugging.
Any comments are welcome. The same workflow applies to any Torchvision model, and more complicated models such as DinoV3 which we will release soon.
Other edge-friendly models can be found in https://huggingface.co/embedl
Interest in AI visual inspection for Aviation MRO (Maintenance Repair and Overhaul )
Hi Guys, I am trying to open a business offering services for Automatic visual inspection using AI for MRO (Maintenance Repair and Overhaul, using AI detections like YOLO and computer vision. this is my site : www.AiVisualMRO.com
I see very little interest from businesses in using using AI detections of defects, like corrosion, dents and scratches, or even part detection and inspection, and AI automated report generation. I tried ad on Linked in but basically only works word of mouth.
QUESTION: to the people that already use computer vision in commercial environment : Do you find it hard to advertise your services ? how do you find your clients ?
Street view style navigation for real-estate
For quite some time, I have wanted to create real-estate viewing experience that is both easy to capture on the input side and easy to use on the output side. Been working on it on an off for a few years now, but it is only recently that the various pieces of the puzzle fallen into place.
- Capture should be easy and quick ==> hand-held video with a fisheye lens in one take
- Handle both dim indoors and bright outdoors ==> can't lock exposure.
- The map (the streets and junctions of the "street-view") are automatically determined and restricted to where 360 views can be safely interpolated from training data.
- Sub-second loads. No one skimming through multiple RE properties in one sitting has time for > 10 second loads per property
- Minimal requirements on viewing hardware.
IMHO the navigation modes inspired by the gaming world is going to be a hard sell for the casual user in the RE market.
Please let me know what you think about this mode of navigation, live demo here:
Currently tested on the desktop and iphone 7 and 15 pro with atleast a fast 4G level speed. Adapative streaming planned for the future.
Now onto some more details relevant to this sub:
The video was taken on an Osmo 360. However, I only use one of the lenses as a test for future use of other cameras with a single fish eye lens (example a Panasonic GH5 with 4mm fisheye). Also I didn't need to spend time masking myself out. Using just one lens did mean that I had to be careful when making sharp 180 turns which happened twice in the above capture.
The whole two floors minus bedrooms took just under 9 minutes. For reliable 360 views, I had to repeat my trajectory through the house in opposite directions. Had I used a rig with a fisheye in front of me and one behind me, I could have done this in under 5 minutes! I only know of the portal cam that is equally time-efficient (no LIDAR in my case though). But you do have to plan ahead for the most time-efficient capture trajectory.
The exposure levels across the house varies by over 104 times, or just under 7 stops! No chance of locking exposure. But I do set a max limit on the shutter speed (1/250) to keep all frames sharp. See my previous post from a few months ago on how I deal with this.
I use SLAM instead of SFM (like COLMAP) since I am using videos. Although SFM can be run in sequential mode for videos, it lacks loop closure which corrects for large scale drift. Also SLAM aims to be real-time although you can trade-off speed for quality as I have done here. Furthermore, SLAM chooses keyframes for you that align with what is needed for training splats naturally - neighbouring frames with the right tradeoff of parallax vs overlap. The video had over 12,600 frames at 3K x 3K resolution that was whittled down to over 2,100 key-frames by SLAM. In SFM, you have to do the keyframe selection by other means.
The fisheye lens from the Osmo 360 was calibrated while doing SLAM over the whole approximately 210 degree view. You have to give it an initial guess for the FOV.
The 2100 keyframes were split into sets of at-most 300 frames to train gaussian splats. I chose to use ray-tracing for training the splats: 3DGRT for now. Ray tracing approaches have no problems with extreme fisheye distortion. I have tried 3DGUT, it has problems at the very edges of the fisheye coz of the approximations it makes at the edges which is unfortunate, since 3DGUT is about 2-4 times faster to train.
Each training set is trained for only 3,000 iterations (no I did not miss a zero there :) For final delivery, I probably would train for more. But with good initialization (whole other topic I might get into some other time) you don't really need many iterations for training splats!
No culling of floaters despite wide exposure changes. No sharpening or post-processing of any kind.
Video has the potential to be as detailed and crisp and surpass tripod mounted photos, others have demonstrated this on a smaller scale even on blurry input if you are willing to spend more GPU time. They do this by assuming a simple physics of blurring and optimizing the poses too while training. I don't know if hyperscape does this, but regardless think about the quality of the output with just video from relatively crappy sensors.
Did SAM3 changed the Image Annotation game completely?
Recently auto-annotation has been commoditised, which means, due to the advancements in Foundation models like SAM3, Dino family and also VLMs like Gemini 3.0 Flash, T Rex + Models from IDEA Research ; it has become much easier to generate bounding boxes and use them to train domain specific models. Review and QA of AI generated annotation surely becomes a bottleneck as no model is 100% accurate in whatever it sees.
I have annotated hundreds of images manually a couple of years ago and it feels much easier than before to use AI to annotate, but the ChatGPT moment still seems really far.
The importance of the following question will be felt by everyone in this sub and everyone who trains specialised models professionally or for hobby.
Like LLMs have a huge scope of fine tuning and pre training specialised models for specific use cases, do vision models still have similar scope where people will keep training Object Detection models for their use cases? Or there will be a time where some AI lab will launch an efficient enough model which will detect anything without any pretraining or finetuning.?
Consider this an open discussions, suggest techniques or simply act on your insecurities of gradually becoming obsolete( hehe)
Free hosting for computer vision experiments
I am looking for a free platform to host a FastAPI app for heavy computer vision experiments not production
preferably simple deployment for inference testing with minimal setup
any alternatives to platforms like Hugging Face Spaces since its resources are not dedicated would be appreciated
I custom trained a pipeline of Computer Vision models to rate dicks (ratemydick.ai), and it works!
After scaling a startup in India profitability, with a revenue of ~$12m USD ($56m factoring parity), and valued at $60m+, I'm launching something most people would never fund. Long story short, I thought of some "easy money" products, but then I started to do research, spoke to urologists, friends, randoms and realized there are actually a real problems to solve in this domain.
The first problem is that loneliness has become an epidemic-- people have less friends and close people to confide or ask questions to. Even if they did, asking about some topics are so anxiety inducing they might just never ask, or worse, they ask the internet that is full of trolls, scams and unverified data sources.
The second major issue identified after speaking to doctors, is that people might not even know they have a medical concern, and by the time it becomes a serious impediment, the issue has exaggerated. So, if something is identified early while they're using the "fun" use of this tool, a recommendation to seek medical advice can be made.
What's launching today is the "fun" part of this tool. I spent 2.5 months (full-time) training, calibrating and ensuring >95% accuracy on zone identification, masking and result analysis. Over the course of time, I'll continue training on various aspects for a more robust report output and implementing user feedback.
help needed for finding datasets
I’m working on a student(beginner) focused on vehicle speed estimation using YOLO + tracking (likely ByteTrack/OpenCV). I initially looked into BrnoCompSpeed, but the dataset size is extremely large (~200GB+) and difficult for me to handle on limited storage and internet.I mainly needed datasets on which i can run my codes and also check if they are giving correct answers or not
Building a video stabilization pipeline for car inspection footage - hitting a wall
Looking for advice, I am building a video stabilization pipeline for a car inspection company. technicians record short videos of car components (engine bay, undercarriage, door frames, trunk) using handheld smartphones.
The goal is to stabilize the raw footage to make damage detection easier and faster.
Recording environment
Engine bay: bright, overexposed in sunlight, lots of texture
Undercarriage: dim, technician on a creeper, vertical bounce and hand shake
Door frames: close up, mostly steady but with drift and tilt
What I have tried:
Approach 1: LK optical flow + RANSAC affine + adaptive Gaussian smoothing
1- Shi-Tomasi corner detection + pyramidal Lucas-Kanade optical flow
2- 2- RANSAC-filtered estimateAffinePartial2D (4-DOF: translation + rotation + uniform scale)
3- 3- Per-frame adaptive Gaussian sigma based on local shakiness in a 30-frame sliding window
4- 4- OpenCV warpAffine (bicubic, BORDER_REFLECT_101) + FFmpeg H.264 encode
The sigma scales with local shake amplitude: shaky sections get high sigma (strong smoothing), stable sections get low sigma (light touch).
The results were disappointing. Technicians noticed the stabilization was attempted but described the output as barely stable, you can tell something was done but the video still feels shaky and hard to read. Out of 12 test clips across different car zones, only about 2 looked genuinely stable.
Approach 2 - Inspired adaptive pipeline
After hitting the ceiling with Approach 1, I reverse engineered how production grade stabilizers handle this problem and identified four improvements to implement:
Phase 1 - Short-clip sigma cap
Cap the Gaussian smoothing window proportionally to clip length so it never spans more than ~10% of the video. Formula: max_sigma = min(10.0, n_frames / 30.0). This fixed over-smoothing on very short clips where sigma=10 was averaging across 28% of the entire video.
Phase 2 - Laplacian blur gating in trajectory estimation
Detect blurry frames via Laplacian variance before running feature tracking. Skip them entirely and interpolate their transforms from neighboring sharp frames instead of zero-padding. Zero-padding creates staircase jumps in the cumulative trajectory; interpolation bridges smoothly.
Phase 3 - Blur-aware jitter validation
The quality metric was measuring HF variance using all frames including blurry ones. Blurry frames produce garbage optical flow that inflates the output variance artificially, making good outputs look like failures. Fix: determine blurry frame positions from the input video and apply the same skip mask to both input and output measurements.
Phase 4 - L1-optimal trajectory smoothing
Replace the per-frame Gaussian with a global LP solver across the entire clip (described in Approach 2 above).
The results after testing all four phases were still disappointing.
After trying dozens of approaches, these two got me the furthest.
I have run out of ideas on how to push stability further on this type of footage with a CPU-only constraint.
If anyone has tackled similar problems (handheld inspection footage, mixed intentional panning and tremor, high blur rates) I would genuinely appreciate any direction.
Resume worthy cv projects
Pls suggest some resume worthy cv projects.🙏🏻
I custom trained a pipeline of Computer Vision models to rate dicks (ratemydick.ai), and it works!
After scaling a startup in India profitability, with a revenue of ~$12m USD ($56m factoring parity), and valued at $60m+, I'm launching something most people would never fund. Long story short, I thought of some "easy money" products, but then I started to do research, spoke to urologists, friends, randoms and realized there are actually a real problems to solve in this domain.
The first problem is that loneliness has become an epidemic-- people have less friends and close people to confide or ask questions to. Even if they did, asking about some topics are so anxiety inducing they might just never ask, or worse, they ask the internet that is full of trolls, scams and unverified data sources.
The second major issue identified after speaking to doctors, is that people might not even know they have a medical concern, and by the time it becomes a serious impediment, the issue has exaggerated. So, if something is identified early while they're using the "fun" use of this tool, a recommendation to seek medical advice can be made.
What's launching today is the "fun" part of this tool. I spent 2.5 months (full-time) training, calibrating and ensuring >95% accuracy on zone identification, masking and result analysis. Over the course of time, I'll continue training on various aspects for a more robust report output and implementing user feedback.
AI Edit QGIS plugin Update: automatic segmentation feature to convert land cover rasters into vector polygons !
I dropped the AI Edit plugin a month ago. At the beginning, it was only for image generation, but users really just wanted a vectorization tool. It works great now, and I'm happier (:
If someone have idea to have THE BEST polygone, I'm earring
Class occupancy analytics - what actually worked for you?
We had a mid-size East Coast gym chain trying to compare app bookings vs actual class attendance. Main reason: they didn’t want to pay trainers for “full” classes that were full only in the app 🙂
We tried counters, analytics, occupancy tricks… absolute circus. Staff walking in/out, people booking and ghosting, trainers arguing attendance numbers.
Turns out counting objects vaguely resembling human bodies (mops included) is the easy part. Counting unique participants is harder.
Curious if anyone here found a solution that actually worked in production - not just marketing fairy tales from video analytics vendors?
How to Prepare for Computer Vision Roles (Phd/Big Companies)
Hi ! I am currently pursuing my masters in the domain of machine learning. I have explored computer vision in term of reconstruction/depth estimation/deep learning. Now I want to prepare my skills and my cv so that I can get into Google/Microsoft/Ivy League Universities. What are the things that I should focus on? What is asked in interviews?
What to expect : Junior CV Engineer
I have a CV Engineer Interview coming up for a small gantry sorting startup. I took a CV class, made a chess project on using OpenCV, training AlexNet/ResNet in PyTorch and some other ML stuff (in my ML class).
JD says knowledge of openCV, dd model understanding, detection and segmentation models, pytorch (and other generic CS stuff). They are expecting 1 year of experience but I have none.
I have a gap year so I am bit rusty overall and barely remember anything specific to CV. What should i absolutely focus on for the interview? Oh and I have 2 days
Built a real-time facial recognition + emotion tracking system Looking for feedback
Hey everyone, I’ve been working on a computer vision project focused on real-time facial recognition and tracking.
Current features:
- Live webcam face detection
- Face identity recognition/database
- Emotion analysis
- Head/face tracking
- Profile cards/UI
- Real-time dashboard system
Right now I’m mainly focused on improving:
- tracking accuracy
- performance/latency
- UI polish
- scalability of the face database
I’m interested in robotics/security applications long term, so this is kind of my “entry point” project into that space.
Would love honest feedback on:
- the architecture
- code organization
- feature ideas
- performance optimization
- what you’d improve next
GitHub:
https://github.com/k-scurf/Auty/tree/main
Demo:
https://vimeo.com/1193621679?share=copy&fl=sv&fe=ci
Thanks — still learning and trying to improve fast.
How to get rejected by IEEE T-PAMI with 'Excellent' scores?[D]
Hello everyone. I am keeping my identity anonymous today to protect my professional career. I am a researcher in Computer Vision, and I am sharing this story because I have hit a devastating deadlock with IEEE T-PAMI and the IEEE Ethics Office.
Our Situation
In the decision letter, there were three highly positive reviews (Two EXCELLENT, One GOOD). However, the AE (who is one of T-PAMI associate EICs) rejected the paper by quoting comments from a "4th" reviewer.
>The most staggering part: We later accidentally met the actual 4th reviewer. He CONFIRMED having submitted a POSITIVE review, which was strangely withdrawn by the editor in the backend before the final decision was made.
The AE lied by saying: "... received 3 sets of comments, and one on the way ... ".
We have formally requested the IEEE (and Computer Society) to thoroughly investigate this issue, specifically asking them to check AE's backend activity logs in the submission system.
However, half a year has passed, and we have received no direct response.
We could have simply moved on and submitted elsewhere. But because this Associate EIC has such wide influence, we realized that staying silent means enabling them. If we don't expose this, they will continue to exploit the system and do this to us and other peers.
Has anyone experienced something similar with IEEE or other top venues? Any advice or help bringing visibility to this would be greatly appreciated.
Evidence:
Below is the report to IEEE Ethics (identifying information has been covered):