r/computervision

What if I told you we Trained this PCB defect detector in Plain English (Open-Sourced)

We used RailCompute to connect Codex and automate the full workflow: data prep, training, and model eval with no human in the loop except typing instructions in English.

The aim was to test whether a basic ML workflow could be driven by natural language rather than manually writing the training pipeline or setting up any infrastructure.

The GitHub repo with the detector code and trained model is in the comments.

u/Due-Guard221 — 6 hours ago

▲ 32 r/computervision+3 crossposts

RF-DETR CPP: TensorRT inference library for RF-DETR with GPU mask decode, CUDA Graph, and zero Python at runtime

Built a C++ inference library for RF-DETR, Roboflow's transformer-based object detection model. The motivation was simple, every RF-DETR deployment I came across was running Python and PyTorch at inference time, which is fine for experimentation but not great when you need consistent low latency in production.

The library runs the full pipeline in C++: a fused CUDA kernel for preprocessing, async H2D transfers via pinned memory, and CUDA Graph capture for low-overhead inference dispatch. On an RTX 5070 Ti at FP16 it hits around 2ms per frame for detection and 6ms for instance segmentation.

This release covers both object detection and instance segmentation. Segmentation masks are decoded on the GPU through a custom kernel that upsamples and thresholds all detections in parallel. 

Happy to get any feedback or ideas.

u/fapa64 — 7 hours ago

▲ 49 r/computervision

3D viewer for gaze estimates from chess stream videos

I built "chess-gaze": feed it a chess stream clip; it writes per-frame records for face/eye/head-pose + gaze estimates, then builds a local 3D viewer.

Uses UniGaze, not pupil-center heuristics.

Code, demo.

u/legotin — 9 hours ago

▲ 0 r/computervision

Is 35 too late to enter CV?

I have been a swe for 8 years and now I'm planning a career pivot into ADAS or similar CV fields that combines hardware, AI and software through my masters degree.

I'm trying to be realistic and know what waits for me later after graduation , for example pay cut, starting from zero etc. and how to pivot strategically to the best area of CV and also what to focus on while looking for an internship or a part time job while I'm studying, thanks! I'm in Germany and oll graduate at 35

reddit.com

u/Delicious_Crazy513 — 9 hours ago

▲ 5 r/computervision

[YOLO] Tracker ID keeps resetting when vehicle passes under an overpass , I tried ByteTrack, StrongSORT, DeepOCSORT

Hi All, i am working on a dash cam based rash driver detection project. The pipeline is YOLOv8s → DeepOCSORT (with OSNet ReID) basically a trajectory-based risk classification.

The problem: there's an overpass in my video. A vehicle I'm tracking as ID:3 passes under it, and the moment it comes out the other side it gets assigned a new ID (ID:17). Detection never actually drops , YOLO keeps the bounding box throughout. The tracker just decides it's a different vehicle.

The vehicle goes from bright daylight into the dark shadow under the overpass, then back into daylight ,so the appearance embedding looks completely different on either side even though it's literally the same car.

Has anyone dealt with this? Is there an illumination-invariant ReID model that handles this better? Or is this just a fundamental limitation of appearance-based trackers on dashcam footage?

reddit.com

u/Accomplished-Car9987 — 12 hours ago

▲ 11 r/computervision+1 crossposts

A small library for multi-dimensional image similarity. Looking for feedback.

For a downstream task, I had to extract unique frames only from videos. The tricky part was that "duplicate" covered two different things in the same video.
- back-to-back frames where nothing visibly moved, and
- frames showing the same scene a few seconds apart.

Perceptual hashing measures pixel-level similarity and embedding models measure content similarity, so neither alone matched what I meant by unique. I had to run both and look at the scores together.

Doing that with separate libraries meant separate preprocessing, separate score scales, and glue code to combine them. The glue was the reusable part, so I turned it into a library. You can pick the kinds of similarity that matter for your case (pixels, scene, object, face, style) and get a score for each in one call:

```python
from imageprism import ImagePrism

prism = ImagePrism(dimensions=["hash", "semantic"])
prism.compare("a.jpg", "b.jpg").scores # {"hash": 0.12, "semantic": 0.82}
```

It is CPU only, no PyTorch, no API keys. It's at 0.1.0 and still rough in places. For a single kind of similarity, the specialized libraries are the better choice: imagehash for hashing, CLIP directly for semantic search, insightface for faces. imageprism combines several of them behind one interface, so the value is the integration, not the models.

I don't know whether this is a common problem or just something I ran into once. If you've dealt with image similarity before, I'd appreciate hearing where this falls short. That feedback will tell me whether it's worth developing further. Please drop a star if you think this is useful.

GitHub: https://github.com/nebulaanish/imageprism

PyPI: https://pypi.org/project/imageprism/

pypi.org

u/NebulaAnish — 14 hours ago

▲ 13 r/computervision

Hello how to identify a good project?

Hello, i want to understand how do people chose what type of projects to take on. Is it just from limitations in research papers. I know you have to try to solve a problem but i just cant find the problem to solve. For a solo developer i don't have much compute, in this case how do you guys chose ML projects.

reddit.com

u/Unhappy-Recipe6808 — 21 hours ago

▲ 0 r/computervision

What is missing from current CV dataset and annotation workflows?

I’m working on Daqa, a waitlist-stage workspace for teams preparing AI training datasets, and I’m trying to sanity-check the computer vision side with people who actually build image/video datasets.

The workflow I’m looking at is everything around annotation: sourcing or uploading data, profiling quality issues, cleaning/deduping, generating missing cases, labeling/reviewing, tracking provenance/license evidence, validating the dataset, and exporting in formats like COCO, YOLO, or image manifests.

I’d really value feedback on four things:

What feature would you most want to see in a tool for this workflow?
Does the pricing on https://daqa.ai/ make sense for CV dataset prep?
What would you need to see before joining a waitlist or trying it?
What tools do you use today for this use case, such as CVAT, Roboflow, Label Studio, FiftyOne, scripts/notebooks, etc., and what do they still lack?

I’m especially trying to understand whether the pain is annotation itself, or the surrounding workflow: source tracking, review, dataset versioning, validation, and clean export.

reddit.com

u/falaq-ai — 1 day ago

▲ 23 r/computervision

Is computer vision a good speciality to choose?

I've been a SWE for 8 years and now I got back to do my masters and get specialized in something, would AI automate computer Vision in the long run?

reddit.com

u/Delicious_Crazy513 — 1 day ago

▲ 2 r/computervision

Fave outdoor cameras for CV?

Anyone have good suggestions for outdoor (preferably PTZ) cams like ubiquity or similar?

Looking to run some live object tracking on them.

reddit.com

u/beedunc — 24 hours ago

▲ 6 r/computervision+1 crossposts

VS Code extension for inspecting image

https://reddit.com/link/1uobh4t/video/u2liw3rnrgbh1/player

If you've ever added a temporary cv2.imshow() or plt.imshow() call just to see what's in a variable while debugging, this might save you some time.

What it does

CV Variable Preview hooks into the VS Code debugger so you can inspect Python variables as images without leaving — or modifying — your debug session:

- Right-click any numpy/torch/PIL/TF variable in the Variables or Watch panel → image opens in a side panel instantly
- Hover over a variable name in source → inline thumbnail, shape, dtype, min/max
- Zoom up to 16×, per-pixel value readout, per-channel histogram (32 bins)
- Pin multiple images to compare them side by side — useful for checking augmentation pipelines, comparing activations before and after a layer, etc.
- Live mode: panel refreshes automatically on each F10/F11 step

Supported types

numpy.ndarray, PIL.Image (all modes including palette), torch.Tensor (CPU/CUDA, with or without grad), TensorFlow eager tensors, pandas.DataFrame/Series, lists/batches of arrays (renders as a grid, capped at 64 items).

How it works under the hood

The Python conversion runs entirely inside the active debug frame via a DAP evaluate request — no subprocess, no sidecar process, no imports added to your script. The TS side just reads the result and renders it in a webview.

GitHub: https://github.com/ariharasudhanm/cv-variable-preview

reddit.com

u/deep_vision_pirate — 1 day ago

▲ 10 r/computervision

I built an app to collect and annotate samples in place for domain-specific needs

Some time ago I started diving into ML with Andrew Ng's Coursera course, was quite interesting.

I decided to train a small handwriting model for Georgian (my native language) - current models don't recognize it well - just for fun.

So I started writing letters on paper, taking photos on my iPhone, copying them to my laptop, annotating there, and passing them to the training script. It was very tedious.

I looked for an app that could handle it on the phone - take a picture, annotate in place, export to whatever format. There were a few tools but none fit what I needed. So I thought, why not build the simple app myself.

It looked simple, and it was, nothing complex. But then I was on a train, no internet connection on parts of the route, and it hit me: ah, it should be offline-first. I had to redesign the whole approach - offline-first, sync, conflict resolution. Built that myself too, was an interesting challenge. I now use it for training Georgian handwriting with Kraken, which is an interesting part on its own and something I enjoy.

Then I thought, why not generalize it. Basic idea: a tool for easy collection of domain-specific data that doesn't exist yet. I started it as a single-user thing, then decided to make it multi-user/collaborative - so imagine students sitting in class occasionally photographing handwriting, and all the data gathered in a central place where an admin can review and export it to ML pipelines or other tools.

I also added on-device SAM (Segment Anything Model) to draw polygons around objects automatically, no internet needed.

The tool is simple, the idea's nothing fancy - I started building it for myself and had fun with it. Not sure if there's an alternative out there. I did look and didn't find exactly what I needed: simple tool, take a photo, annotate multiple regions, organize into spaces (organizations) and projects.

It's early alpha, iOS-only for now, up on TestFlight if anyone wants to poke at it. It's rough, critical feedback very welcome.

Curious if anyone else has had to build a dataset from scratch for a niche domain - how did you handle the collecting part?

u/AdmiralMontana — 1 day ago

▲ 2 r/computervision

Question about MVTec AD 2 wallplug ground truth masks

Hi all,

I was researching anomaly detection with MVTec AD 2 and got confused about the ground truth masks for the wallplug category, especially the overexposed defects.

I am trying to understand the annotation logic. Is the ground truth supposed to mark the visible anomaly spot itself, the whole affected object, or the missing or invalid part caused by the anomaly?

In some examples, the mask seems to mark the visible anomalous spot. In another case, the whole object part seems to be considered anomalous. In image 001, it looks like the mask may be highlighting a missing or hypothetical removed part, but I am not sure, because the shape does not seem to match the expected part very well.

Has anyone else worked with this category and noticed this? Is this a known annotation issue, or is there a logic behind these masks that I am missing?

Images are from the MVTec AD 2 dataset, licensed under CC BY-NC-SA 4.0. I am sharing only small examples for a noncommercial research question, with attribution to MVTec.

u/j_root_ — 1 day ago

▲ 12 r/computervision+1 crossposts

Inverse INSID3: Background-Guided Segmentation with DINOv3

I built a small computer vision project based on INSID3, the CVPR 2026 training-free in-context segmentation method using DINOv3.
My version flips the idea: instead of providing a foreground reference, you provide background or normal examples. The algorithm removes background-like regions and segments the remaining object/anomaly.
It supports multiple background sources and can also turn coarse boxes into more precise masks. Other applications are possible like zero-shot anomaly detection.

Would love feedback or test cases: https://github.com/dimfot3/Inverse-INSID3

u/dimfot333 — 1 day ago

▲ 2 r/computervision

How to parse airplane HUD

Hi,

I am currently trying to parse the contents of a virtual F/A-18C Hornet in DCS. For this I utilize OpenCV and with a green channel filter and some thresholding combined with ROI I am able to grab the elements displayed. Template matching is then used for the individual glyphs to extract the value. The only issue is that the pitch ladder turns and sometimes overlays for example the altitude value like here: example

Is there a way to somehow separate the values using CV?

Thank you.

u/schnibbediSchmabb — 1 day ago

▲ 3 r/computervision

Publication advice

I'm doing research on a niche detection problem, not A* level novelty, and haven't decided where to publish it. Just interested in people's opinions here. Which is the best option for a future resume/industry job? CVPR workshop, regional IEEE conference (Europe), or a journal (let's say Springer Q2)?

reddit.com

u/whosaidoverfitting — 1 day ago

▲ 9 r/computervision

Thoughts ?

Building a fly tipping detection system using YOLOv8/RF-DETR and Roboflow. 320 labelled images so far, retraining with 820 augmented images now.

First model hitting 95% on vehicle detection but struggling to generalise to unseen images — currently working on dataset variety and augmentation to fix overfitting.

Planning to add OCR for number plate reading and a behaviour sequence logic layer on top of the detections.

Happy to share what I’ve learned so far — any advice on improving generalisation with a small dataset?

reddit.com

u/NeuroDash — 2 days ago

▲ 24 r/computervision

Looking to Contribute to Open Source Computer Vision Projects

Hi!

I work as a Research Associate in Computer Vision and I'm looking for interesting open source CV projects where I can contribute while learning something new. If you know of any active projects or communities that welcome contributors, I'd appreciate your recommendations.

Thanks!

reddit.com

u/Ankitzanzmera — 3 days ago

▲ 52 r/computervision

Solving Cross-image object detection in SAM 3

Hey everyone,

We all know SAM 3 is incredible with its visual prompting features, but I recently ran into a pretty frustrating limitation while building out an object detection pipeline. If you want to use a specific visual prompt (like a bounding box of an object in a reference image) to detect that same object across a bunch of other, distinct images, SAM 3 doesn't natively support this.
You can use text prompt but some objects cannot be explained using text prompt.

I spent some time experimenting with workarounds and wanted to share the approach I landed on, plus see if anyone has tackled the next step I'm working towards.

The First Attempt: The "Video Frame" Approach

My initial thought was to hack the video segmentation feature. You can join the images as sequential frames and perform inference through them using the initial visual prompt.

The Problem: This only works well for actual video or highly sequential data. If your target images aren't extremely visually similar to the reference image, the model rapidly loses context, and the accuracy absolutely tanks.

Current Workaround:

I decided to take a completely different route to force the model to look at the reference and target at the exact same time.

Here is the flow:

Take the reference image (containing the object's bounding box).
Take the target image (where you want to find the object).
Join them side-by-side into a single, unified image.
Pass this combined image into SAM 3 along with the original visual prompt.

Because SAM 3 is analyzing it as one single image, it easily finds the related objects across the combined canvas. After inference, it's just a matter of running a quick script to adjust the detected bounding box coordinates back to the original target image's dimensions.

The Results

The results so far have been surprisingly good! I've been running this inference on my current dataset, and it's doing exactly what I need it to. That being said, I still need to scale up my sample dataset size to truly benchmark how robust this is across edge cases.

What's Next: Getting to the Embeddings

While the concatenation approach works, it's undeniably inefficient for large-scale production pipelines. Rebuilding images on the fly adds overhead.

My next step is trying to extract the SAM 3 visual prompt embeddings, store them, and figure out a way to directly inject and reuse them across subsequent images, essentially brute-forcing the native support it currently lacks.

Has anyone here successfully extracted and reused SAM 3 embeddings for cross-image inference? Would love to hear if anyone is working on something similar or has ideas on optimizing this!

code: Link
video: Link

u/Full_Piano_3448 — 2 days ago

▲ 177 r/computervision

Padel Match - Built this for an Analytics Company using Open Source (Still in MVP)

u/Due-Guard221 — 3 days ago