u/FewConcentrate7283

What it actually takes to build an AR overlay on a physical object in real time.

What it actually takes to build an AR overlay on a physical object in real time.

Everyone loves a clean AR demo. You put on a headset, a beanbag lands on a cornhole board, and a beautifully rendered score badge floats effortlessly right above it. It looks like magic.

But behind the scenes, AR on physical objects is roughly 80% coordinate system problems. I just broke down the technical architecture of what we're building for Quantum Caddy (a real-time AR scoring system) and how we are shifting from a fixed-camera ecosystem to head-tracked, spatial AR glasses. If you are building anything in the computer vision or spatial computing space, these are the architectural hurdles no one warns you about in the demo videos:

1. The Core Issue: 2D Pixels vs. 3D Space

A camera sees a flat 2D image, but a physical object exists in 3D. If your coordinate math is off by even two centimeters, your AR asset floats over the wrong spot. In a precision scoring or training system, that's a broken product, not a cosmetic bug.

  • Phase 0 (Fixed): Right now, we use a static 2D homography via a fixed camera. We map four board corners at session start, compute a transformation matrix, and translate bounding boxes to zone coordinates. It works perfectly for screens, but it breaks the moment you move.
  • Phase 2 (Spatial AR): Moving to the Everysight Maverick AI glasses completely changes the architecture. The camera moves with the wearer's head while the physical object stays put. You can no longer rely on a static matrix; you need a live, continuous world-model updating from head pose in real time.

2. The Architectural Blueprint

To tackle a dynamic environment with severe latency constraints (we need <400ms from bag-land to AR display), we mapped out a decoupled system design:

  • WorldState: Holds the canonical 3D position of the physical asset.
  • TrajectoryRuntime: Runs a Kalman filter on a front-facing camera to smooth out parabolic trajectory arcs.
  • GlassesAdapter: Translates system game events into hardware-specific HUD commands.
  • Continuous Gemma Loop: A background LLM loop that proactively generates "coaching chips" because AR glasses lack a keyboard, and voice commands fail in loud venues.

3. Edge Cases That Will Break Your Model

If you take away one thing from our calibration refinement sprints, let it be this: Your math will look beautiful in the center of the frame and completely lie to you at the edges. Lens distortion and oblique camera angles mean that a homography or spatial anchor that boasts millimeter accuracy in the center can be an entire zone off near the corners. You have to aggressively account for non-planar surfaces and lens distortion drop-offs before you ever ship a line of production code.

For those building in spatial audio, CV tracking, or smart glasses development—how are you handling dynamic spatial anchoring without overloading your hardware's compute budget?

(Full engineering breakdown with our file notes over atTruPath Labs)

u/FewConcentrate7283 — 3 days ago

Computer vision is about to bring elite sports tracking to your rec league — and it's cheaper than you think

For years, the kind of tracking tech used in the NFL, FIFA, and MLB — multi-camera rigs, Hawk-Eye, Statcast — has been completely out of reach for amateur leagues and weekend tournaments. Player updates at the rec level still happen "in bits and pieces, some clips here, a few messages there."

But four things converged recently that are about to change that: monocular-to-3D tracking (one phone camera replacing a $500k motion capture lab), trackers that can handle occlusion, real-time object detection models, and edge compute boards like the NVIDIA Jetson Orin Nano for $249 running 100+ fps locally.

The results are already showing up in padel (95% tracking accuracy, match reports in 10 min instead of 3 hrs), pickleball (DUPR ratings from a single uploaded video), and even baseball bullpens getting Trackman-class pitch analysis from regular video.

The catch? Trust is hard. A system that's 96% right is still a dispute generator to the person on the wrong end of the 4%. And vision breaks fast in inconsistent environments — reflections, lighting changes, players changing shirts.

Really interesting breakdown of where this is heading and why the smart play is to start with a single sport on a fixed playfield.

🔗 https://trupathventures.net/labs/field-notes/cv-comes-for-rec-sports

u/FewConcentrate7283 — 5 days ago

Quantum Caddy's Vision System: Architecture for a Real-Time Scoring Engine

Quantum Caddy's vision system turns two commodity cameras into a frame-accurate, real-time scoring engine for a physical projectile-targeting sport, built on a strict separation between perception (which emits events), rules (which score), and narration (which explains).

Overview

The problem QC's vision system solves is deceptively narrow and genuinely hard: watch a physical projectile-targeting sport through ordinary cameras and produce a scoring record that a human referee would agree with, in real time, on hardware cheap enough to ship as a consumer product.

"Deceptively narrow" because the sport has a small rule set and a fixed playfield. "Genuinely hard" because the visual signal is adversarial in all the usual ways — motion blur on fast projectiles, occlusion when objects stack, lighting that swings from shade to direct sun, a playfield that physically shifts mid-session, and cameras that drop frames or disconnect. A scoring system that is 95% correct is not a scoring system; it is a dispute generator. The bar is human-referee parity, and the architecture is shaped almost entirely by the gap between "a model that detects objects well" and "a system you can trust to keep score."

The central design decision is a three-layer separation of concerns. Perception consumes camera frames and emits discrete, typed events — "a throw occurred," "an object settled in this zone." It never applies a rule. Rules consume events and produce score. This layer is fully deterministic — same events in, same score out, no model in the loop. Narration consumes the scored game state and produces coaching and commentary. This is where the language model lives.

This document covers the perception layer — the vision system proper. The separation matters here because it defines what the vision system is not allowed to do: it cannot guess at score, it cannot apply game logic, it cannot let a confident-but-wrong frame propagate into the record. Its only job is to emit events that are individually defensible.

System Shape

Two cameras, not one. A release-zone camera watches the area where the player throws; a target-zone camera looks down on the playfield where projectiles land. Each camera answers a question the other cannot: the release camera establishes that a throw happened and roughly when; the target camera establishes where the projectile ended up. Neither view alone is sufficient — the release camera cannot see the final resting position, and the target camera cannot reliably distinguish a thrown object from one nudged by hand. The architecture treats these as two halves of a single event that must be temporally correlated.

Both feeds run over standard RTSP from commodity 5MP IP cameras. All inference happens on a local Apple Silicon edge node — there is no cloud round-trip in the scoring path. This is a hard constraint, not a preference: a consumer product cannot depend on a venue's uplink, and a scoring decision that takes a network hop is too slow to narrate live.

Layer 1 — Detection

The detector is RT-DETRv2-S, a small anchor-free transformer detector. The choice over the more common YOLO family was driven by two things. First, architecture: RT-DETR is NMS-free — it emits a fixed set of object queries directly, which removes a class of post-processing tuning (non-max suppression thresholds) that is brittle under crowding, exactly the regime this system operates in when objects cluster on the playfield. Second, and decisively for a shipping product: licensing. The mainstream YOLO implementations are AGPL-3.0; RT-DETRv2 is Apache 2.0. The codebase keeps a YOLO backend strictly walled off for internal data-collection use and treats export to an Apache-licensed format as a hard gate before anything leaves an internal environment. This is the kind of decision that looks like a footnote and is actually load-bearing — an AGPL dependency in the inference path is a due-diligence failure waiting to happen.

The detector is wrapped in a backend abstraction that selects an inference engine by model file type — CoreML, ONNX Runtime, or the internal-only YOLO path — behind one unchanging API. The production path on the edge node is CoreML, which routes inference to the Apple Neural Engine. This is a meaningful systems choice: moving inference off the GPU and onto the dedicated ML accelerator drops per-frame latency from the ~15–30ms range into the ~3–6ms range, and frees the GPU for display and video encode. On an out-of-distribution holdout the detector scores an F1 above 0.99 — but the more important number is the confidence threshold discipline: the threshold is set deliberately high, because in this system a false positive (a phantom object) is far more expensive than a false negative. A missed detection is recovered on the next frame; a phantom detection can manufacture a score.

The model detects three classes: the projectiles, the target surface, and the scoring aperture. The latter two matter because the system does not assume a fixed camera mount — it re-detects the playfield itself every frame.

Layer 2 — Tracking

Detection is per-frame and identity-free. Tracking adds persistent identity across frames via ByteTrack, a detection-association tracker that matches objects frame-to-frame by spatial overlap and keeps "lost" tracks alive for a buffer of frames so a brief occlusion does not spawn a new identity.

On top of ByteTrack sits a small per-object kinematic state machine: each tracked projectile moves through airborne → sliding → stationary, with a terminal settled-in-aperture state set externally by the scoring geometry. Transitions are driven purely by instantaneous speed — above a threshold the object is in flight, below it the object is on the surface, and a run of consecutive low-speed frames is required before "stationary" is declared. The separation here is intentional: ByteTrack answers which object is this, the kinematic FSM answers what is this object doing. Keeping those two questions in separate components means a tracker tuning change cannot silently alter the physics interpretation.

Layer 3 — Calibration

A detector reports pixels. Scoring needs geometry. Calibration is the bridge: it maps the camera's pixel space to the playfield's coordinate space so that "this projectile center is at pixel (x, y)" becomes "this projectile is on-surface / in-aperture / off-target."

The calibration model is small and explicit — the four corners of the target surface and the center and radius of the scoring aperture. On-surface tests use a cross-product winding test against the calibrated quadrilateral; aperture tests use a radial distance check. Calibration can be set interactively (an operator clicks the corners) or recovered automatically from fiducial markers.

Calibration is also the system's most brittle seam, and the architecture is honest about it: if the physical playfield shifts — and it does, because players bump it — the pixel-to-geometry mapping is stale and every downstream zone classification is quietly wrong. A relock-without-full-recalibration mode is a known frontier item. Naming this plainly is part of the design philosophy: a vision system that hides its failure modes is not trustworthy; one that surfaces them can be engineered around.

The Event Boundary — Three Coupled State Machines

This is the heart of the vision system, and its most distinctive idea. The boundary between perception and rules is not a function call — it is three coupled finite state machines, because the act of deciding "an event occurred" is itself stateful and is where the hard bugs live.

Throw Detection FSM. Background subtraction flags motion in the release zone. A short run of consecutive motion frames is required before a "burst" is opened — this rejects single-frame noise. The burst captures a window of frames, and then the candidate throw must pass a set of physics gates: minimum flight time, minimum horizontal travel, a trajectory-confidence floor, a minimum arc height, and both upper and lower speed bounds. These gates are a kinematic plausibility filter. The upper speed bound rejects tracker jumps; the lower bound rejects loitering; the arc and travel minimums reject a hand reaching into frame. The point is to reject non-throws upstream, so the downstream layers never have to reason about body motion. A rejected candidate produces no event at all — it is logged for diagnostics and otherwise does not exist.

Pair Window FSM. An accepted throw opens a time-bounded window. The target-zone camera must confirm a landing within that window for the throw to be scored as on-surface. If the window expires with no confirmation, the throw is scored as off-target — a real outcome, not an error. This FSM is the temporal correlation between the two cameras: it is what makes "the release camera saw a throw" and "the target camera saw a landing" into a single event.

Settlement FSM. The target-zone camera does not trust a single frame. When the object count on the surface increases, the new count must hold stable for a run of frames before the system commits — this rides out detection flicker. On commit, the system identifies which object is new by comparing against a frozen snapshot of the surface taken from the frame before the count changed, then classifies its zone and emits the scored event.

These three FSMs are causally chained, and the governing insight is that a bug at any seam silently drops an event or double-counts one. That is why the event boundary is modeled this explicitly. The failures this system has actually hit were not bad detections; they were phantom pairs and invisible outcomes living in the coupling between these machines.

The Decoupling Lesson

One architectural fix is worth singling out because it generalizes. The Pair Window's expiry check — the tick that decides "this throw's window has elapsed, score it off-target" — was originally driven by the release camera's frame loop. The consequence: when the release camera went dark, disconnected, or restarted, the expiry tick stopped firing, and throws that should have resolved as off-target instead hung in memory indefinitely. The window could not expire because the thing that checked for expiry was coupled to a camera that was no longer running.

The fix was to move the expiry watcher onto a dedicated background timer, ticking at a fixed rate independent of any camera loop. The lesson is the transferable one: time-based logic must be driven by a clock, not by a data stream that can stall. Any vision system that correlates events across independent sensors will eventually meet this bug; the architecture now has it designed out.

Why the Architecture Is Shaped This Way

Three principles fall out of the above, and they are what would carry to a different sport.

Perception emits events; it never scores. The determinism boundary is sacred. Everything probabilistic — the detector, the tracker, the FSMs — lives on the perception side. Everything that produces a number a player will argue about lives on the rules side and is fully deterministic. A model is never in the scoring path.

The event boundary is itself a state machine. Deciding that something happened is not a threshold; it is a stateful process with its own failure modes, and modeling it explicitly is what makes the failures findable.

Failure modes are named, not hidden. Calibration brittleness, occlusion, single-camera degradation — these are written down as frontier items, not papered over. A trustworthy system is one whose limits are legible.

Hard Problems / Open Frontiers

Calibration relock. A physical playfield shift currently forces full recalibration. A lightweight relock against fiducials, without operator intervention, is the highest-value open item.

Stacked-object occlusion. When projectiles physically stack, the overhead view cannot resolve the count. The path forward is a second overhead camera and/or a segmentation model that reasons about partial occlusion.

Single-camera degradation. The system should degrade gracefully — and legibly — when one of the two cameras is unavailable, rather than silently losing a class of outcomes.

Edge-timing coupling. The decoupling fix closed one instance of this; a general audit of "what logic is implicitly coupled to a frame rate that can stall" is worth doing once, deliberately.

reddit.com
u/FewConcentrate7283 — 8 days ago
▲ 2 r/RealityEngineered+1 crossposts

The Grade System

Every session gets a grade. F through B+. The brutality of honest assessment — and why it works.

One of the strangest things about building alone with AI is that nobody grades your work.

When you're on a team, there's ambient accountability. Someone reviews your PR. Your manager sees the sprint review. The engineer you're pairing with says "wait, that's not right" in real time. Failure has witnesses and witnesses have consequences.

Building alone, the only feedback loop is the one you build. And if you don't build one, you'll convince yourself everything is fine longer than it is.

I built one.

How It Works

Every work session — not every day, every session, which is sometimes 3 hours and sometimes 12 — gets a post-session review. One of the AI agents in my framework handles project oversight. At the end of each session, I ask for an assessment.

The assessment covers: what was planned, what shipped, what regressed, what's still blocked, what the root cause of any failures was. Then it gives a grade.

A through F. No plusses except B+. You can't get an A.

The reason you can't get an A is that perfect sessions don't exist. There's always a gap between what you planned to ship and what actually shipped. There's always something that revealed itself during the session that wasn't visible at the start. An A would mean you predicted the session perfectly in advance, and nobody does that. The ceiling is B+.

The reason there's no A- through C+ gradation is that granularity at the top is flattering noise. What matters is: was the session productive (B range), marginal (C range), or a failure (D or F)?

What an F Looks Like

One session in week three: grade F.

What happened: work had been done overnight to improve a major system component. The work was completed. When we came in to validate it, the improvements were there — but three things that had been working before were now broken. Classic regression. New code fixed the target problem and introduced two new ones.

The grade was F because the session left the system in a worse state than it started. Not because the work was bad — the work was good. The failure was in validation. The new code wasn't tested against the existing behavior before it was merged. The regressions weren't caught until I ran the live system and reported what I saw.

The fix took most of the day. By evening, the grade was B.

The F wasn't punishment. It was a flag. Something in the process broke down, and we needed to figure out what before the next session started.

What a B+ Looks Like

One session in week four: grade B+.

What happened: a major evaluation pipeline shipped. The system could now test itself against scenarios it had never seen before. The eval ran clean. The results were informative. The code shipped without regressions.

The reason it's not an A: one planned item didn't land. A second component was started but not finished. The session was productive and left the system better than it found it, but the plan and the output weren't perfectly aligned.

B+ is, in practice, a great session. It happens maybe once or twice a week.

Why Brutal Honesty Works Here

The AI doesn't have an ego about the grade.

This sounds obvious but it's important. Human collaborators have good days and bad days. They get defensive about grades. They rationalize. They push back. You have to manage the relationship as well as the work.

The AI doesn't do any of that. If the session was an F, the session was an F. The post-mortem is clinical. The root cause analysis is honest. The plan for tomorrow is adjusted accordingly.

That absence of defensiveness is one of the most underrated things about this working arrangement. The feedback loop is clean because neither side has anything to protect.

The human side — me — has ego. I don't love seeing F days logged. But the log is honest and the log is useful, and over 25 days it's revealed patterns I wouldn't have seen otherwise. Which types of work produce regressions. Which sessions produce the best outcomes. What "good sequencing" actually looks like in practice.

The grade is a tool. Tools don't judge you. They just show you the shape of the work.

The Arc

If you stacked the session grades over 25 days, you'd see a noisy upward curve.

Days one through five: mostly C and D range. The architecture is being established, the data is wrong, the detection isn't working. Low productivity is expected.

Days six through fifteen: C to B range. Things start working. The feedback loop gets tighter. Regressions start getting caught faster.

Days fifteen through twenty-five: mostly B range with occasional F days where something regresses hard, followed by recovery sessions that sometimes hit B+.

The F days don't go away. They happen less. That's the best you can do.

Next: Post 05 — Why PM Isn't Overhead

Building Blind is a solo founder's journal. No tutorials. No frameworks. Just what actually happened.

reddit.com
u/FewConcentrate7283 — 12 days ago

Building an autonomous self-healing agent to monitor a live CV + LLM pipeline — what hierarchy, SOPs, and guardrails are you using in production?

Hey all — I'm setting up Hermes (NousResearch open-source agent) alongside Claude Code to autonomously monitor, audit, and self-heal a production computer vision + LLM system I've built for a sports tracking application.

Quick context on what I'm watching:

  • CV pipeline — RT-DETR model running CoreML on Apple Silicon, tracking throws from an overhead RTSP camera in real time
  • LLM layer — locally fine-tuned Gemma 4 E2B on FastAPI, fires on each throw event to generate scoring commentary and coaching
  • Supabase backend — structured throw events, round results, and game state emitted after each detection
  • Mission Control — Next.js dashboard that shows live game state

Right now when the system is running, I manually bring in Claude to audit whether the LLM did the right thing — did it hallucinate a score, did the FSM transitions fire correctly, did the physics filter catch what it should, did Supabase receive a clean event. That works but it's not sustainable and it's not real-time.

So I'm building Hermes to own that loop. Here's what I want it to do:

  1. Live audit while the pipeline runs — watch LLM outputs against ground truth (what the CV model actually detected), flag hallucinations, bad scores, FSM violations
  2. Security layer — catch prompt injection attempts, LLM outputs that leak internal state, responses that go out of bounds
  3. Self-healing — tiered response: minor issues get logged silently, service-level failures get an auto-fix attempt (restart the service, rerun the step, roll back a bad Supabase event), anything it can't fix pages me immediately
  4. Always-on — not just tied to QC runs. Long term I want it watching website security, business ops, cross-venture health

The plan is Hermes running locally on a Mac Studio, connected to Claude Code (it auto-discovers credentials, so it uses my existing subscription — no separate API billing). Hermes watches the logs and Supabase events, detects anomalies, invokes Claude Code to audit and reason about them, then acts on the response and writes new skills as it learns.

What I'm trying to figure out:

  • Hierarchy / agent design — Did you build this as a single watchdog agent or a hierarchy? (e.g., one coordinator that dispatches to specialized sub-agents for CV vs LLM vs infra?) I'm leaning toward a tiered hierarchy but not sure where to draw the lines
  • SOPs for self-healing — What's your runbook look like when the agent decides to auto-fix something? How do you prevent it from making a bad situation worse? I've read that ~50% of production failures are infrastructure-level and the agent often doesn't even know the failure happened
  • Local vs VPS — The CV pipeline runs on a Mac Studio so Phase 1 is local. But for always-on monitoring of websites and business ops, I'm thinking VPS for Phase 2. Anyone run a hybrid where the local agent reports to a VPS-hosted coordination layer?
  • Silent failure detection — This seems to be the #1 killer. How do you instrument Hermes to know it itself is failing, not just the system it's watching?
  • Rate limiting and cost discipline — Since I'm using Claude subscription via headless mode, I need to be smart about how often I invoke it. What triggers do you use to decide when the AI brain needs to fire vs when a rule-based check is enough?

Nobody I've found has done exactly this for a CV + LLM real-time tracking system, so I'm building somewhat from first principles. But the self-healing agent pattern for Kubernetes and general infra is clearly proven — I'm trying to adapt those patterns to a perception + reasoning stack instead of a container cluster.

If you've built something in this space — even adjacent to it — I'd love to hear what your hierarchy looked like, what SOPs you put in place, what blew up, and what you'd do differently.

reddit.com
u/FewConcentrate7283 — 14 days ago
▲ 0 r/asl+1 crossposts

We trained an ASL recognition model 21 separate times—each time holding out a different deaf signer for testing and training on the other 20.

Despite using the same architecture, recipe, and 250-sign vocabulary across all 21 folds, the results reveal a massive disparity in user experience that "average" numbers usually hide.

The Headline Numbers

  • Best-served signer: 64.16% top-1 accuracy
  • Worst-served signer: 25.58% top-1 accuracy
  • The Spread: 38.57 percentage points
  • The "Mean": 41.74% (This aligns with typical literature, but hides the failure cases).

The Reality: 24% of the signers in the dataset scored below 30%. For these users, the model is effectively broken, despite "decent" average reports.

Why This Matters

Most published cross-signer ASL numbers report a single average. Our prior work reported a tiny standard deviation ($0.4467 \pm 0.0097$) because we only averaged two signers.

By spending 21× the compute to expose the full distribution, we found the standard deviation is actually 12× wider than a small split suggests. A field that stops at the average materially misrepresents the experience for at least a quarter of the population.

The Hypotheses (Pre-registered)

  • H1: Spread > 25 pp – PASS (38.57 pp)
  • H2: Worst signer < 0.30 – PASS (0.2558)
  • H3: Handshape complexity explains varianceREFUTED ($r^2 = 0.008$)

The Actionable Finding: Coarse sign-level tags (like "two-handed" or "face-adjacent") don't predict the performance gap. The signal is signer-level: likely regional dialects, signing speed, and individual kinematic styles—features currently missing from public datasets.

Methodology & Compute

  • Dataset:Google ISLR (asl-signs), 250 signs × 21 signers.
  • Architecture: FrameTransformer (4.85M params).
  • Hardware: ~80 min per fold on RTX 3090 (Total ~$13 on RunPod).
  • Determinism: Fully reproducible via torch.use_deterministic_algorithms(True).

What’s Next?

A 38 pp gap isn't a "bigger model" problem; it's a data diversity problem. Our Phase 4 plan focuses on partner-driven capture targeting 30+ signers across regional dialects, using consent infrastructure co-designed with deaf-community organizations.

Full Notebook (Open & Forkable):

Kaggle: Parley Notebook 03 - Signer Dialect Leave-One-Out

reddit.com
u/FewConcentrate7283 — 24 days ago
▲ 1 r/RealityEngineered+1 crossposts

Quick update for folks who saw the earlier posts on what we're doing at Parley (the open-research arm working on the deaf↔hearing conversation gap). We have just made two notebooks public on Kaggle:

Headline Findings

(Signer-holdout split, 3 seeds, error bars included)

From N00 (Exploratory Data Analysis):

  • Structural Missingness: Face/pose landmarks rarely go missing, but one-hand missingness dominates the dataset.
  • Temporal Leakage: Median clip length varies by over 2x across signers. Temporal models that don't normalize end up partly learning "who" is signing rather than "what" is being signed.
  • The Validation Trap: The Kaggle-default random split leaks signer identity. We recommend a Signer-holdout (17/2/2) split as the default for rigorous testing.

From N01 (Hand-Shape Baselines):

  • Linear probe (hand landmarks): 25.3% top-1
  • Hand-only MLP (single frame): 31.5% ± 1.7%
  • Temporal 1D-conv (full sequence): 36.4% ± 1.5%
  • The Gap: The jump from hand-only to temporal is only ~4.9 pp. This suggests most of the signal in isolated signs is raw hand separability. Current architectures are recovering ~5 points on top of priors—useful, but not where the "hard problem" actually lives.

The Implication

For anyone working in this space: published isolated-sign numbers tell us much less about real conversational ASL than the field currently treats them. The real challenges—continuous signing, signer variance, co-articulation, and occlusion—are still ahead of us.

What’s next on the roadmap:

  • N02: Real landmark-only ceiling with identical splits (identifying if the plateau is the model or the data).
  • N03: Cross-signer leave-one-out (which signs degrade most?).
  • N04: Isolated → Continuous (what breaks first?).
reddit.com
u/FewConcentrate7283 — 26 days ago
▲ 1 r/asl+1 crossposts

Sharing a research arm I'm running called Parley — long-term goal is bidirectional Deaf/hearing conversation on AR glasses, but right now we're just doing honest CV science in public.

The honesty problem: Most published ASL recognition papers report ~83% top-1 on word-level recognition. Most of those numbers come from random splits — train and test signers overlap. When you split by signer (held-out signers never seen during training), accuracy collapses to ~30–40% across architectures. That gap is the actual product gap.

Notebook 01 — Hand-shape baseline (public):
https://www.kaggle.com/code/truepathventures/parley-notebook-01-hand-shape-baseline

  • Dataset: Google ASL Signs (250 signs, 21 signers, ~94K MediaPipe-landmark clips)
  • Split: 17 train / 2 val / 2 test signers, no leak
  • Hand-only MLP: 32.1% ± 1.6 (3 seeds)
  • Temporal 1D-conv: 36.4% ± 1.5 (3 seeds)
  • Full confusion matrix + failure gallery published

The next training plan, now that the data is staged:

I just pulled four image datasets to run the next phase:

Dataset Size Purpose
HaGRID 384p 509K imgs, 18 gestures, COCO-annotated Hand detector backbone
Kaggle ASL Alphabet 87K imgs, A–Z + control Static fingerspelling classifier
Sign Language MNIST 35K imgs, A–Z grayscale Robustness check
ayuraj/asl-dataset 5K imgs, 0–9 + A–Z cropped Backbone fine-tune

Pipeline (each box is a separate model on its own dataset):

Camera frame
  → RT-DETRv2-S hand detector       (trained on HaGRID, single "hand" class)
  → MediaPipe landmark extraction
  → ConvNeXt-Tiny static classifier (trained on combined letter datasets)
  → Temporal 1D-conv / transformer  (Google ASL Signs, signer-holdout)
  → Sentence assembler              (later)

Why RT-DETRv2 and not YOLO: YOLOv5+ is AGPL-3.0. We need a permissive (Apache-2.0) detector for any commercial path. RT-DETRv2-S is the cleanest option that actually competes on edge silicon.

Honesty discipline I'm holding myself to (every notebook):

  • ≥3 seeds, mean ± std reported
  • Signer-holdout split or stratified-k-fold, never random when signers are involved
  • Baseline + best model both published
  • Failure gallery (not just confusion matrix)

Open questions I'd love feedback on:

  1. Is anyone training RT-DETRv2 specifically for fine-grained hand detection? Curious about anchor / query count tradeoffs at small object size.
  2. For the static handshape classifier — would you bet on a small ViT, ConvNeXt-Tiny, or a hand-pose-aware MLP head on top of MediaPipe landmarks?
  3. Is there a cleaner public continuous-signing benchmark than RWTH-PHOENIX-2014T that anyone uses with a signer-holdout?

Code, datasets, and methodology will keep landing on Kaggle as I go.

reddit.com
u/FewConcentrate7283 — 24 days ago
▲ 6 r/kaggle

I’ve been writing a slow-release research arc on ASL recognition, and before any modeling, I wanted to actually look at Google’s Isolated Sign Language Recognition dataset the way it should’ve been looked at before every Kaggle winner reported 83% accuracy on it.

Notebook 00 of a nine-phase project: What does the Google ASL Signs data actually look like?

https://www.kaggle.com/code/truepathventures/parley-notebook-00-islr-eda

The sharp opinion, drawn from the EDA itself:

The Kaggle-default random 80/10/10 split — which every public winning solution used — puts the same signer’s clips in train, val, and test. That’s measuring how well the model memorizes each signer’s specific missing-landmark pattern, not how well it generalizes. Three numerical reasons:

  1. Missing-landmark patterns are structural per-sign, not random. The sign × landmark-type heatmap shows clear one-hand-missing signatures for bilateral-handshape signs and face-adjacent signs. Fork the notebook and scroll to §3.

  2. Median clip length varies 2×+ across the 21 signers. Fixed-length padding normalizes away signer-specific timing the model won’t see at inference.

  3. Per-signer coverage of signs is high but not uniform. Leave-one-signer-out evaluation is feasible — the coverage histogram in §6 is how we know.

Recommended split: signer-holdout — 17 train / 2 val / 2 test. Notebook 01 (next month) quantifies the accuracy gap against random-split, with error bars across 3+ seeds.

This is notebook 1 of 9. Not a competition entry — a slow-release research project. Feedback welcome, especially from anyone who’s worked with ISLR before or runs signer-holdout evaluation in their own sign-language ML work.

reddit.com
u/FewConcentrate7283 — 30 days ago