r/MachineLearning

Ph.D. thesis on Differentiable Ray Tracing for Radio Propagation Modeling [R]

Hi everyone, I recently finished my Ph.D. thesis on Differentiable Ray Tracing for Radio Propagation Modeling. Instead of just compiling my published papers, I tried to write it as an accessible, self-contained textbook for anyone interested in the intersection of radio propagation simulation, autodiff, and ML.

Permanent handle: https://hdl.handle.net/2078.5/278727
Repo with TeX source files

While my research focuses on wireless communications rather than pure ML, I think it fits right in here. A major part of the project revolves around automatic differentiation. By taking frameworks like JAX out of their traditional ML context and integrating differentiability into a ray tracing pipeline, we can compute exact gradients through complex physical environments. This allows us to solve inverse problems and directly train machine learning models, which is currently a hot topic in next-gen wireless design.

To make the physics and the math easy to digest, the manuscript is split into three parts:

Understanding: The physics fundamentals (electromagnetic theory, geometrical optics, and diffraction).
Building: The algorithmic core, including GPU-accelerated path tracing and the discontinuity smoothing techniques you need to actually make differentiable simulations stable.
Using: Practical applications like channel modeling, localization, material calibration, and ML-assisted generative path sampling.

A major focus of my thesis is the link between scientific research and reproducible open-source software. On that note, I want to give a massive shoutout to Patrick Kidger (u/patrickkidger). His own thesis inspired me to go the "textbook way" for my manuscript, and I heavily relied on his fantastic JAX packages (jaxtyping, equinox, and optimistix) when developing my open-source libraries, such as DiffeRT.

I hope you find it an interesting read! I'd be happy to answer any questions in the comments about differentiable simulation, ray tracing, or building ray tracing engines in JAX :-)

If you are curious, you can watch the presentation slides and video teaser here

u/jeertmans — 2 hours ago

▲ 59 r/MachineLearning

MIRA: Multiplayer Interactive World Models trained on Rocket League [R]

We're happy to release MIRA, a collaboration between General Intuition, Kyutai, and Epic Games.

Mira was trained on 10k hours of synthetic Rocket League data. The model has 5B parameters and runs for 4 players at 20 fps on a single B200.

We've released a playable online demo, an in-depth technical report as well as a 1k hour dataset of 4-players gameplay:

Demo: https://mira-wm.com Technical report: https://mira-wm.com/paper Repo: https://github.com/mira-wm/mira

If you're at ICML, we're also running an interactive demo (booth 111) where you can play it with us using proper PlayStation controllers!

reddit.com

u/MasterScrat — 8 hours ago

▲ 4 r/MachineLearning

[D] Issue with arxiv - abstract not matching pdf/html [D]

Hi, I was reading the openRLHF paper: https://arxiv.org/pdf/2501.03262v4 , but when I click the abstract page: https://arxiv.org/abs/2501.03262v4 , it shows "REINFORCE++". Note that https://arxiv.org/html/2501.03262v4 still shows the correct openRLHF paper. I believe Arxiv is having some incorrect symlinks?

Is there anyone working at arxiv here who would like to look into this?

reddit.com

u/Ok-Painter573 — 5 hours ago

▲ 1 r/MachineLearning+1 crossposts

does quantising a model reduce its performance ?[R]

If I were to quantise a fp32 model to fp8(or any other), would the information loss be drastic ?

reddit.com

u/Cultural-Lobster7795 — 12 hours ago

▲ 23 r/MachineLearning

ICML Position Track: Want Better ML Reviews? Stop Asking Nicely and Start Incentivizing with a Credit System [D]

“Maybe the real AGI was the friends we made along the way” is a sentiment that always hits me, and conferences are the places where I reunite with old friends and meet new ones. However, when it comes to the submission/review experience, it might not be much of an exaggeration to say that almost everyone has many unpleasant experiences to share.

So I wrote a position paper to discuss this. I argue that current conference organizers lack proper tools to instill accountability and incentives for reviewers/authors/ACs/SACs… The result is that undesired behaviors (e.g., lack of engagement) often go unchecked, while good behaviors are rarely rewarded and therefore don’t happen (honestly, when was the last time you witnessed any constructive internal discussion among reviewers/ACs?). And this won’t change by writing nice words in Reviewer Guidelines or issuing a few desk rejections.

I propose a CREDIT SYSTEM where community members earn points by “doing good” — e.g., reviewing a paper would get you +1, being outstanding gets you +3. Then, members can spend points to redeem perks ranging from traditional ones already adopted in current ML conferences (e.g., free registration) to new ones, such as requesting an additional reviewer to sort through a muddy situation. Such a system could also support explorative ideas like:

- Refundable submission fees: say 10 points per submission, which are then refunded regardless of acceptance, unless the submission is uniformly voted to be unready / ultra-low quality.

- Mobilizing non-author reviewers: non-author reviewers don’t have the bandwidth issue of wearing both the author and reviewer hats and are not influenced by their own submissions.

and many more...

My proposed system is far from perfect, but I’d like to think it takes a step toward a better conference review mechanism. I am also glad to see the position paper track becoming a welcoming platform for researchers to hash out their proposals and build toward a better future (see other review-related position papers below.)

For a topic that affects literally everyone at ICML, I am eager to hear your thoughts.

reddit.com

u/choHZ — 12 hours ago

▲ 219 r/MachineLearning

Machine learning industry job requirements used to be myopic, but now it feels impossible. Anyone else seeing this? [D]

Today I was just casually browsing some jobs with tags [machine learning] on one of those large popular job-sites. What I am seeing really had me astonished. I want to check with Reddit whether I am hallucinating.

A non-FAANG/non-Deepmind/.../non-Anthropic industrial automation company is hiring people to work on ML for robots (the latest hot topic). Fine. But then I saw their laundry list of job requirements ("you must meet these"), which include:

Deep expertise in LLM, VLA, VLM, action transformers
Deep expertise in robot dynamic and kinematic modelling (forward, inverse kinematics, trajectory generation, planning), sensor fusion, model predictive control, reinforcement learning
Deep expertise in CUDA GPU programming, FPGA hardware acceleration
Familiarity with latest software engineering best practices in Python3 and C++23
Familiarity in one or more of popular ML framework
Have top publications in one or more typical ML and robotics conferences

This is before they go off listing familiarity with a set of standard softwares/simulators, one of which is called RLib, something I've never heard of. Oh and of course they had these 3+, 5+ "non-academic" experience requirements. I forgot which is which.

I was just sitting there confused. Then I checked several more jobs, and it was more of the same (except for some banks).

I remember there was a talk by Terence Tao where he divided mathematician into two camps, the analysts and algebraists. He said even among top mathematicians, it is exceedingly rare to find someone who possess deep expertise in both, as each tends to require a different mode of thinking and each is infinitely deep in terms of specialization, theory and insights.

And here we have a bunch of ML companies treating these infinitely deep academic fields ranging from robot dynamic and kinematic modelling to large language models like some bizarre MMORPG video-game scenario where you need to be a warrior archer warlock who is also a shaman priest mage.

Who are they even hiring, lol?

reddit.com

u/NeighborhoodFatCat — 1 day ago

▲ 23 r/MachineLearning+5 crossposts

Hierarchos: Preliminary Findings From a 232M Recurrent Memory-Augmented Assistant Model [P]

Project Release / Research Draft] Hierarchos at 232M Parameters: Preliminary Findings From a Recurrent Memory-Augmented Assistant Model

Technical Report: July 2nd, 2026

Project: Hierarchos / KortexHOS

Authors: Makhi Burroughs / netcat420, Lost Time, and the Hierarchos project team

TL;DR:

We built and trained Hierarchos, an experimental 232M-parameter recurrent, memory-augmented language model from scratch. It is not a GPT-3/3.5-class model, but it successfully proves that a hybrid non-Transformer architecture (combining an RWKV backbone, hierarchical manager/worker loops, differentiable slot-based LTM, and a deterministic suffix automaton) can survive training, avoid collapse, and maintain short-form instruction coherence. Most of our breakthroughs came from fixing subtle train/inference parity mismatches and numerical stability bugs.

Dataset: netcat420/Experiment_0.1 (Alpaca format)
Training: 13 epochs on an RTX 6000 Blackwell (96GB) rental.

1. Introduction & Background

Modern LLMs are heavily dominated by Transformer scaling. Hierarchos explores a different path: can recurrent state, explicit memory retrieval, hierarchical iterative computation, and bounded local inference make a small model vastly more parameter-efficient?

Hierarchos isn't a direct clone of any single architecture, but a hybrid inspired by:

RWKV-style recurrence: For efficient sequence processing without traditional attention.
Titans-style neural memory: For persistent test-time memory.
Hierarchical reasoning (HRM): Multi-level recurrent modules (Manager/Worker) to iteratively refine state.

2. Architecture Overview

[Token Input] -&gt; [ROSA Suffix Matcher / DeepEmbed Modulator]
       |
       v
[Long-Term Memory] &lt;-&gt; [Top-k Associative Lookup]
       |
       v
[Manager Recurrent Cell] -&gt; (Produces Context Plan &amp; Drift Vector)
       |
       v
[Worker Recurrent Cell]  -&gt; (Refines local state / clamps drift)
       |
       v
[RWKV Backbone (Clamped Channel-Mix)] -&gt; [Next-Token Logits]

Key Components:

ROSA: A deterministic suffix-automaton path predicting continuation tokens based on exact repeated suffix patterns.
DeepEmbed: A token-specific modulation path that influences RWKV channel mixing.
LTM Subsystem: Learned slow-memory keys/values combined with fast working-memory values.
Manager/Worker Loop: High-level manager handles broad context to produce a target plan; the lower-level worker refines token-local state using a regularized drift vector.

3. Core Engineering Lessons (The "Gotchas")

A low training loss does not guarantee coherent chat. We had to fix several critical state-contract and numerical stability bugs to make the model usable:

1. Chat/Training Drift Mismatch

The Bug: During live streaming chat, the loop was feeding the previous drift state back into the model on every single token. During training, this state is reseeded at Truncated Backpropagation Through Time (TBPTT) chunk boundaries.
The Fix: We aligned the inference code to only reseed at boundary limits. Before this fix, live chat logits diverged sharply from training loss; after the fix, logit error dropped to near-zero.

2. Supervised LTM Inner Updates Mismatch

The Bug: Giving the model supervised memory updates during training that it can't replicate during zero-label live inference creates a crutch. The model learns to rely on a hidden training-only helper signal.
The Fix (v0.20.4): Implemented --ltm-training-mode read-only. Training keeps the memory structures but stops doing supervised fast-memory writes, perfectly mirroring inference.

3. Unbounded RWKV Channel Mixing

The Bug: Long runs exposed activation spikes in the ReLU-squared channel-mix FFN path, which were amplified by DeepEmbed modulation into NaN gradients.
The Fix: Implemented key clamps (--rwkv-channel-mix-key-clamp 12.0), DeepEmbed clamps (4.0), and excluded DeepEmbed identity gates from AdamW weight decay.

4. Evaluation & Smoke Test Results

Because cloud costs add up, we benchmarked the model locally on a CPU preset via a ROG Ally (--eval-limit 100), ensuring passive learning was disabled and working memory was cleared to mimic static chat.

Bounded Local Benchmark Metrics (--eval-limit 100)

Benchmark	Metric	Score	Std. Err.
ARC Easy	acc	0.3600	0.0482
ARC Easy	acc_norm	0.3200	0.0469
HellaSwag	acc	0.3400	0.0476
HellaSwag	acc_norm	0.3700	0.0485
TruthfulQA MC1	acc	0.2200	0.0416

Real-world Coherence Check:

The Good: Assistant-shaped, follows short instruction prompts well due to the Alpaca training data. Nontrivial commonsense and QA signal prove the weights didn't collapse.
The Bad: Brittle on long context lengths, weak on arithmetic/factual recall. Coherence is comparable to the GPT-2 era, not modern GPT-3.5+ systems.

5. Proposed Ablation & Scaling Plan

We want to transform this from a promising prototype into a rigorous scientific result. Our next step requires scaling tiers and isolated component testing.

Proposed Isolation Testing (Ablations)

No LTM / Read-Only LTM: Isolating exactly how much slot memory helps.
No ROSA / No DeepEmbed: Evaluating the real token-efficiency gains of suffix-matching and modulation.
Baseline Matches: Running a direct Transformer 232M and RWKV-only 232M on the exact same token budget to prove true comparative architecture efficiency.

Future Scaling Target Tiers

Tier	Model Size	Token Target	Purpose
Scout	300M–500M	20B–50B	Validate loss slope and stability scaling.
Real v1	1B–1.5B	100B–300B	Test architecture limits beyond small-scale behavior.
Serious	3B	600B–1.5T	Establish a truly competitive local open-source alternative.

Target Data Mix for Foundation Training:

Instead of jumping straight into instruction SFT data, a scaled run will prioritize high-quality base data:

35-50%: FineWeb / FineWeb-Edu style clean web text
20-30%: Dolma / DCLM curated web data
8-15%: Code and tech documentation
5-12%: Math, science, and academic proofs
1-5%: In-house assistant conversational SFT (applied exclusively in late-stage tuning)

6. What We Can (and Cannot) Claim Safely

What is supported by the data:

Hierarchos is a functional, coherent 232M experimental assistant checkpoint.
Combining recurrent sequence loops, memory slots, and hierarchical workers is viable and stable with the right clamps.
The findings provide a solid engineering roadmap for non-Transformer architecture stability.

What is NOT supported (Do not hype this!):

No claims of GPT-3.5 level math, coding, or logic.
No claims of attention/Transformer superiority at equal parameter counts yet (baselines pending).
Not production-ready for heavily quantized or low-bit local deployments yet due to drift sensitivity.

Final Thoughts

Hierarchos 232M shows that small, alternative architectures are still a deeply fruitful area of LLM research if you can conquer the train/inference state drift.

We would love to hear feedback from anyone working on recurrent neural memory or hierarchical backbones! Full code, scripts, and logs are in progress.

References:

Brown et al. **Language Models are Few-Shot Learners.** arXiv:2005.14165. https://arxiv.org/abs/2005.14165
Hoffmann et al. **Training Compute-Optimal Large Language Models.** arXiv:2203.15556. https://arxiv.org/abs/2203.15556
Peng et al. **RWKV: Reinventing RNNs for the Transformer Era.** arXiv:2305.13048. https://arxiv.org/abs/2305.13048
Behrouz et al. **Titans: Learning to Memorize at Test Time.** arXiv:2501.00663. https://arxiv.org/abs/2501.00663
Wang et al. **Hierarchical Reasoning Model.** arXiv:2506.21734. https://arxiv.org/abs/2506.21734
Zellers et al. **HellaSwag: Can a Machine Really Finish Your Sentence?** arXiv:1905.07830. https://arxiv.org/abs/1905.07830
Clark et al. **Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.** arXiv:1803.05457. https://arxiv.org/abs/1803.05457
Lin et al. **TruthfulQA: Measuring How Models Mimic Human Falsehoods.** arXiv:2109.07958. https://arxiv.org/abs/2109.07958
Hugging Face. **FineWeb dataset.** https://huggingface.co/datasets/HuggingFaceFW/fineweb
Hugging Face. **FineWeb-Edu dataset.** https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
Allen AI. **Dolma dataset.** https://huggingface.co/datasets/allenai/dolma
DataComp-LM. **DCLM Baseline dataset.** https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0

github repository with the architecture and the released model weights: https://github.com/necat101/Hierarchos

u/PhysicsDisastrous462 — 18 hours ago

▲ 2 r/MachineLearning

How should I encode both target and feature variable for a multiclass classification? [D]

I am preprocessing a CSV dataset for multiclass classification with XGBoost. My Feature variable contain numerical and categorical values, while the target variable contain many categorical value. For example, feature variables contain patient name, phone number, and exercise history, while Target variable contain different disease name such as heart attack, stroke, Alzheimer's etc.

I know that feature variables can be encoded using one-hot encoding, but should the target variable also be encoded using the same method, or should I use a different encoding method for target variable (e.g., label encoding)?

If anyone know the answer, please let me know. I have searched everywhere, but failed to get any clear idea about it. Thank you.

reddit.com

u/Rami02021 — 19 hours ago

▲ 49 r/MachineLearning

Is Intrinsic Motivation a Viable PhD Topic in 2026? [D]

I started a PhD in CS about a year an a half ago. Generally speaking my topic is on intrinsic motivation (more commonly people refer to it as unsupervised RL).

Intrinsic motivation (IM) is a niche field within AI. It seeks to develop reward signals which are not specific to any task but rather something closer to the low level motivators that drive intelligent behaviors in animals. Some prominent examples are:

Empowerment: https://arxiv.org/abs/2301.00005
Diversity is all you need: https://arxiv.org/abs/1802.06070
Intrinsic curiosity module: https://arxiv.org/abs/1705.05363
Random network distillation: https://arxiv.org/abs/1810.12894

and many more...

My question is: is this topic still "worth" pursuing now? Almost every day I see a new video of a robot doing some amazing acrobatic flip, navigating over hostile terrain, or performing some dexterous manipulation task. I believe that most of this is being done with human supervision through either a carefully tuned reward signal or behavior cloning from human demonstrations. If incredible advances are being made in robot learning without IM then why is it necessary at all? Furthermore IM has typically been restricted to very simple scenarios such as low dimensional robotic systems in simulation (hopper, walker, etc...).

On a more personal note I have some concerns about future employability. If I focus too heavily on this niche topic during my PhD I worry that it may be impossible to get hired at a research lab that would prefer a candidate with experience in behavior cloning or other hot topics.

Im curious to hear what this community thinks. Has anyone been in a similar situation with their PhD topic?

u/soup---- — 2 days ago

▲ 1 r/MachineLearning+1 crossposts

Does anyone have a name for that subtle "Sameness" creeping into model outputs lately? [R]

I've been running a lot of comparative evals across recent model releases—both API and open-weight—and there's a pattern I can't unsee.

After a certain number of turns, or when you push into niche territory, the outputs start converging. Same cadence. Same hedging phrases. Same blind spots. It's not full collapse. It's a kind of... homogenization. A creep.

My working theory: we're deep enough into the synthetic data flywheel now that we're seeing the first-generation effects. Not model collapse in the catastrophic sense, but a gradual loss of "texture" across models that share overlapping synthetic ancestry.

I've been calling this EchoCreep in my notes. The slow, creeping homogenization of model behavior driven by shared synthetic data lineage.

Has anyone else been tracking this? Is there a formal term yet? If not, what are you seeing in your evals that fits this pattern? I'm especially interested in:

Concrete eval metrics that might capture it
Whether fine-tuning on entirely human-curated data clears it
If you've seen it worsen between checkpoint versions

any feedback would be appreciated?

Thanks

reddit.com

u/BCondor3 — 1 day ago

▲ 114 r/MachineLearning

If DeepMind or Anthropic is doing your exact research topic, do you still continue? [D]

As someone who is not affiliated with any of the big tech companies, I find it particularly difficult to have the confidence or enthusiasm to approach any ML problem with an attitude that my professors probably had at my stage in life. I'm sure I am not the only one having the following thoughts:

"My research is currently being done better at companies."
"ML problem I set out to solve is already solved and in fact turned into products and sold for millions at companies X, Y, Z. There is no need for further research."
"Industry is not interested in theoretical ideas and there is plenty of evidence for that, starting with their hiring practice."
"Companies wouldn't have millions of dollars in funding or revenues if their models weren't working."
"Research is like Darwinian evolution. Evolution aims to produce the fittest model. After decades of evolution, the fittest model is already in industry, why should I explore other evolutionary dead-ends?"
"There may not be a next big thing after LLM. If there were, it would be simply incorporated as a function or a subroutine that LLM simply calls when needed, and the average person would be none the wiser. My contribution would be invisible."

Seems like research outside of big tech companies is pointless (unless you are a prof who is making big $$ while doing it). Because whatever they are working on might be lightyears ahead of whatever you are doing, but you wouldn't know because their model is simultaneously closed-source and omnipotent.

There are tons of people sharing their resumes on other ML/CS subreddits and occasionally you see that their projects are along the lines of "linear regression for Titanic dataset" or "YOLO for pedestrian detection" and they are wondering out loud why nobody is hiring them. Everyone with more ML experience can see because there is zero need for people with this skillset. But what if my very research also looks the same to people in industry? What if my "deep geometric autoencoding variational neural-former" also looks like some silly Kaggle project because industry can already do that much more efficiently?

How do you silence these thoughts?

reddit.com

u/NeighborhoodFatCat — 2 days ago

▲ 24 r/MachineLearning

Is machine learning research worth it for now? [D]

I am a scientist who just applied machine learning to my research (JEPA/Representation/Geometric branch) and it did wonder! Allowed me to see so many papers that I am still struggling to write up.

From what I see, there are clearly a million possibilities not done yet, e.g., industrial data, patterns in nature, etc.

Why is the job perspective so pessimistic? We clearly have problems unsolved, and for many, the potential of ML will be proven for sure. We also have money (according to the news), and then why are jobs almost impossible?

reddit.com

u/nebula7293 — 2 days ago

▲ 14 r/MachineLearning

ECCV travel support program [D]

Has anyone gotten a response from the eccv travel support program listed on their website? https://eccv.ecva.net/Conferences/2026/DEI

Edit: also have anyone applied for this program as an accepted author? I have an independent research paper accepted and am currently looking for funds for paying for the registration fees

reddit.com

u/tedd235 — 2 days ago

▲ 55 r/MachineLearning+5 crossposts

If your GPU can run inference, it should be able to fine-tune too.

I spent the last few months building a new sparse fine-tuning method for MoE models called USAF.

The goal was simple: if your GPU can run inference on an MoE model, it should also be able to fine-tune it.

On my AMD RX 6750 XT (12 GB), I can fine-tune Qwen3-30B-A3B by training sparse expert weights and the router instead of adapters.

The project is completely open source under the Apache 2.0 license. I'm not trying to build a business, sell anything, or monetize it in any way—I just wanted to share something I built that I think is genuinely interesting.

GitHub: https://github.com/tsuyu122/usaf

u/tsuyu122 — 3 days ago

▲ 4 r/MachineLearning

Best models for generating red-team attacks? Also looking for public datasets [R]

Hi everyone, I'm currently working on a framework to evaluate the security of LLM applications and AI agents, and I've been stuck on one part for a while.

Most red-teaming frameworks rely on an LLM to generate adversarial prompts. My question is more about which model to use.

Which closed-source models would you recommend for generating high-quality attacks?
Which open-source models have worked well for you?
Have you noticed any models that consistently generate more realistic or challenging attacks than others?

I'm looking for models that can generate attacks such as Toxicity, prompt injection, SQL injection, jailbreaks, indirect prompt injection, prompt leakage, tool misuse, multi-turn attacks, and other agent-specific attacks ect...

I also have another question.

Is there a good public dataset that people use to benchmark or validate the security of AI agents? I'd prefer a "golden" dataset with predefined, high-quality attacks rather than generating everything from scratch.

I'm curious about what people actually use in practice if you've worked on LLM security or red teaming, I'd really appreciate any recommendations, whether it's models, datasets, papers, or GitHub repositories.

Thanks in advance! Any advice or insights would be greatly appreciated.

reddit.com

u/Background-Song2007 — 1 day ago

▲ 25 r/MachineLearning

Competence Gate: gating tool-use on a small model's internal confidence signal instead of its verbalised one — Qwen3.5-4B, open weights [P]

I made a 10MB LoRA adapter for Qwen3.5-4B plus a small orchestration layer. It decides, per query, whether to answer directly, search the web, or retrieve from your own local documents and it refuses to make things up when it can't verify an answer.

It runs locally (Apple Silicon / MLX, with a GGUF build for llama.cpp/Ollama).

Basically small instruct models are poor at telling users how confident they really are. They can't verbalise it and tend to say they are confident for everyhting. In my past research I tested seven 3-9b models and they all hit a confidence ceiling. But the information is there in the internal activations. The adapter reads the internal signal directly and gates tool use on it.

The main elements are that:

- it catches its own errors better than the base model's tool calling (d′ improvement of 0.46 (95% CI [0.01, 0.89])). Of the cases the gate flagged that the base model didn't, 87% were genuinely wrong answers.

- it is less likely to leak your private queries to public search. A two-signal version routes personal information related questions such as "what did my discharge summary say" to a local retriever instead of a websearch. It cut the rate of private questions sent to public search from 22% to 10% (reduction 0.12, 95% CI [0.02, 0.22]). This is useful for those who are using the LLM for confidential docs.

- every answer is traceable. When it retrieves, it cites the specific passage (report.md ¶2), verifies the answer is actually in that passage, and shows a confidence band. Worst case, it says "I couldn't verify that". It is built to say "I don't know," instead of lie.

limitations:

- Privacy result is n=60; the retrieval/competence dissociation is n=126 hand-authored items. Screened and CI'd, but small.

- GGUF reproduces the MLX gate's decisions at --lora-scaled ...:8 (found by sweep — scale 1 does nothing; effective scale ≈ the training scale). Agreement 0.83 on a 24-item probe; disagreements are all conservative-direction (GGUF answers a couple of borderline items MLX would look up), and knowns never false-fire. Faithful on the safety-critical directions, marginally more conservative at the margin.

- Serve-time confidence is coarse (grounded / declined / answered) — the distilled gate reads nothing at inference, so finer bands need probe access (offline).

- Inherits Qwen3.5-4B's knowledge and biases. The gate governs when to trust the model, not what it knows.

The approach isn't Qwen-specific — I started on SmolLM3-3B, and it should extend to other models and larger sizes.

Repo (weights + code + model card): https://huggingface.co/synthiumjp/competence-gate-qwen3.5-4b

Apache-2.0. It's an open research release. I hope people might find some use for it. Methodology and papers are cited in the model card. Genuinely interested in critique, it's screened work, so if there are any issues it be great to know.

**** Update ****

I ran the gate against external benchmarks it hadn't been tested on, and one use case did not survive. The gate does not improve grounded document QA — answering faithfully from a provided passage and abstaining when the passage doesn't support an answer. On SQuAD 2.0 unanswerables, fabrication was actually higher with the gate than without it.

The reason is a example of construct specificity. "Knowing when to defer" is not one capability. There are at least two distinct signals hiding inside it:

- Parametric competence: do I know this from my own weights? The gate reads this. It's what the probe was validated against.

- Evidential grounding: is this answer supported by the passage in front of me? A different question, from a different information source.

A probe validated for one carries no usable signal for the other. A parametric-competence signal applied to an evidential-grounding task doesn't just fail to help, it actually interferes by pushing toward answering and suppressing the base model's (Qwen's) own abstention. The base model already handles the easy case (0% fabrication when the passage plainly lacks the answer). The hard case (adversarial unanswerables) needs purpose-built grounded-abstention training, not a post-hoc firewall.

The release is scoped to what's validated: parametric tool-call routing and privacy-aware retrieval routing. The "refuses to fabricate about documents" framing in the original post above is the part that doesn't hold.

u/Synthium- — 2 days ago

▲ 8 r/MachineLearning

I built an open, from-scratch MT pipeline + parallel corpus for Tunisian Darija (Arabizi) early baseline, and I'm growing it into a curated community corpus [P]

I'm an 18-year-old independent student from Tunisia. I built and I'm leading an open, from-scratch machine-translation pipeline and parallel corpus for Tunisian Darija. Sharing it for feedback.

Why: Tunisian Darija, written in Arabizi (Latin letters + numerals like 3/7/9/5 for Arabic phonemes), has almost no open NLP resources. Existing Arabic tools route it through MSA and mishandle the orthography. To the best of my knowledge there was no open parallel

corpus or from-scratch baseline for it.

What I built (all open):

- Arabizi-aware SentencePiece BPE tokenizer (3/7/9/5 as protected symbols), shared 16k vocab.

- ~15.6M-param encoder–decoder Transformer, from scratch (no pretrained LM): transfer-learned from cleaned Moroccan Darija, then fine-tuned on hand-crafted Tunisian pairs.

- Full cleaning / training / eval pipeline.

Honest results & limitations: v1 BLEU is 3.89 on a small locked test set low, and I'll be upfront about it. The corpus is ~553 hand-crafted pairs, so data is the bottleneck, not architecture. I treat 3.89 as a first honest baseline to beat as the corpus grows.

Where I'm taking it: I'm expanding this into a larger, ethically-collected Darija corpus that I curate and validate consent-documented field collection, every pair provenance-tagged. I'm looking for contributors to help grow it, with every contribution reviewed

to keep quality and consent standards.

Looking for: technical feedback/critique, and anyone interested in contributing data or collaborating on low-resource / dialectal Arabic MT.

Links:

github repo: https://github.com/Dhiadev-tn/darija-translator

Hugging faces dataset: https://huggingface.co/datasets/Dhiadev-tn/tunisian-darija-english

hugging faces model: https://huggingface.co/Dhiadev-tn/darija-translator

u/Dhiadev-tn — 2 days ago

▲ 12 r/MachineLearning

BaryGraph - knowledge graph where every relationship is its own embedded document (not an edge) [R]

Instead of node --edge--> node, every relationship is a first-class document with its own vector, called a BaryEdge. Stack pairs of BaryEdges recursively and you get "MetaBary" triads that surface structural bridges between concepts that live nowhere near each other in embedding space. Running locally on MongoDB Community + mongot + nomic-embed-text over the full English Wiktionary (6.6M docs). MCP server is live if you want to poke at it. Preprint + benchmark CSVs: https://zenodo.org/records/20186500

The problem I was chasing

Flat vector search treats a relationship as a byproduct of two points being close. That throws away information. Two papers can describe the same underlying phenomenon (a flyby anomaly in orbital mechanics, an anomalous residual in stellar dynamics) without ever citing each other and without their embeddings landing anywhere near each other. Nothing in standard RAG surfaces that connection.

What I did instead

Every relationship gets embedded too:

bary_vector = normalize(q·v(CM1) + q·v(CM2) + (1−q)·v(type))

q is connection quality, v(type) is a contextual embedding of what kind of relationship it is. This BaryEdge is now a retrievable document in its own right — not metadata on an edge.

Then it recurses: two BaryEdges at the same level get bridged by a third one level below, forming a MetaBary triad. Do that repeatedly and you climb an abstraction triads hierarchy built entirely from algebra — zero additional embedding calls above the base level. It's a forest (every node has at most one parent), so traversal to root is a single $graphLookup, no cycle handling.

Does it actually do anything useful?

Ran it against SimLex-999 and WordSim-353 as a sanity check (not the main claim, just "is the substrate coherent"). Raw cosine similarity barely correlates with human similarity judgments (ρ ≈ −0.04 on SimLex). Structural metrics — how many BaryEdges two words share, how much their relational neighborhoods overlap — correlate at ρ ≈ 0.32–0.53, p < 10⁻¹⁵. So the graph is encoding something cosine alone doesn't.

The part I actually care about is cross-domain bridging. Some probe traces from the live graph:

octopus neuroscience ↔ distributed sensor networks, bridged by shared structural-motif vocabulary (neuroarchitecture, smartdust)
collagen folding ↔ linguistic syntax, bridged by etymological + structural motif overlap (plicature / hypotaxis-parataxis)
grief ↔ depression, not bridged and this is a correctness demonstration, not a missing capability. The DSM-5 added a much-debated "bereavement exclusion" precisely because grief and depression share surface symptoms but are different kinds of state, with different prognosis and treatment
radioactive decay ↔ obsolete words falling out of use, bridged at a high abstraction level by register-varied decay verbs (collapsed, decayed, declined, disintegrated) — naming a Poisson-process state-loss pattern that both physics and historical linguistics instantiate, with no single word doing the work

That last one is the case flat retrieval structurally cannot produce — there's no embedding axis for "verbs co-occurring with reduction-of-state across unrelated domains."

Stack (all local, all free)

GitHub: https://github.com/oleksiy-perepelytsya/bary-vector

MongoDB Community Edition + mongot for storage/vector search
nomic-embed-text, 768-dim
Python 3.11+
Full build: ~6.66M documents, 8–14 hrs on a single workstation (8–16GB VRAM)

Try it

MCP server is public on request (SSE transport) — read-only tools for searching the live graph: find_word, semantic_search, edge_info, leaf_nodes, traverse_up, sample_metabary. If you've got an MCP-capable client you can point it at the graph and run your own probe queries in a few minutes.

What I'd actually want feedback on

Whether the cross-domain bridges hold up to someone who isn't me poking at them — try a probe query on a domain pair you know well and tell me if the bridge is real or if I'm pattern-matching myself into seeing structure that isn't there. Some bridges can be not obvious on the first look but they are actually the most intriguing ones and worth to be dug for the reason they built, so treat them as points of investigation
Whether this is worth comparing directly against GraphRAG/RAPTOR-style hierarchical retrieval (I haven't done that benchmark yet, and I know that's the first thing this sub will ask)
Whether anyone's tried something structurally similar and it fell apart at scale for reasons I haven't hit yet

Preprint, architecture spec, and the raw SimLex/WordSim CSVs are all here: https://zenodo.org/records/20186500

Happy to drop the MCP endpoint on request if there's interest.

u/adseipsum — 3 days ago

▲ 0 r/MachineLearning

We'll benchmark an Open weights LLM on any GPU you choose — drop your model + hardware and we'll run it. [D]

We run HexGrid Cloud, a platform for deploying open-source models on GPUs, and we're heads-down optimizing our serving/deployment layer.

To pressure-test it we're benchmarking real models under real concurrency — and instead of guessing, we'd rather run what you actually want to see.

---

Models available for benchmarking:

Nemotron-3 Super 120B-A12B (only NVFP4)
Nemotron-3 Nano 30B A3B
Qwen-3.6 27B
Llama 3.3 70B Instruct
Gemma-4 31B
Devstral-Small-2-24B-Instruct-2512
?? (you suggest a model to us)

We're focused on chat/instruct models for now (that's what most of our users deploy), so pick one from the list above — or suggest another open-weight chat model that fits on a single H200 (141GB).

---

Hardware & quant choices:

GPU (up to H200 for this round): RTX PRO 6000 · L40S · H100 · H200
Quant: FP8 / AWQ / BF16
Context length: (8K, 32K, 64K, 128K)
What you want measured: max throughput? single-stream speed? long-context prefill?

---

We'll run the top picks and post full results — tokens/sec, TTFT, TPOT, throughput under concurrency, and cost-per-million-tokens — config and flags included so it's reproducible.

Let us know in comments.

reddit.com

u/Temporary-Owl1725 — 3 days ago

▲ 2 r/MachineLearning+11 crossposts

Uma única equação matemática está provando que A-G-I não precisa de GPU nem LLM

Em 1906, Markov descobriu uma equação para prever letras.

Em 2026, alguém finalmente testou se a MESMA equação — sem uma

linha a mais — consegue aprender bytes, palavras, decisões,

causalidade, planejamento, atenção e memória.

Spoiler: consegue. E roda em qualquer notebook. 950 linhas.

O problema que o projeto ataca:

A indústria está gastando bilhões em GPUs para espremer parágrafos

de modelos cada vez maiores. E ninguém parou pra perguntar:

"E se a inteligência não estiver no tamanho do modelo,

mas na QUANTIDADE DE NÍVEIS que uma única equação

consegue processar?"

Foi exatamente isso que o MCR testou — e os resultados são

surpreendentes pra um projeto de 950 linhas.

A equação MCR é simples:

MCR(nível).aprender(A, B) → aprende que A leva a B

MCR(nível).predizer(A) → dado A, qual o próximo estado?

Sim, é Markov. Mas o pulo do gato não é a equação — é que ela

funciona IDÊNTICA em 10 níveis diferentes:

• Byte → byte

• Palavra → palavra

• Decisão → ação

• Causalidade (estado → estado)

• Q-Learning (aprendizado por reforço)

• Planejamento hierárquico

• Atenção seletiva com 4 sinais

• Memória persistente (SQLite)

• Auto-modificação de parâmetros

• Gênese automática de novos módulos

Resposta universal: distribuição decide confiança, ferramentas aprendem.

Zero GPU. Zero LLM. Zero dependências externas. Só a Equação.

Isso não é filosofia. Tem 13 seções de matemática formal —

incluindo o Teorema da Invariância por Nível (que prova

que a equação é sempre a mesma, mudando só o que é "estado"):

→ Paper (EN): https://github.com/Player-Kheltz/MCR/blob/main/docs/MCR_WHITEPAPER_EN.md

→ Paper (PT): https://github.com/Player-Kheltz/MCR/blob/main/docs/MCR_WHITEPAPER_PT.md

E o código que você pode clonar e rodar em 10 segundos:

→ GitHub: https://github.com/Player-Kheltz/MCR

A implicação que mexe com a cabeça, pensa no seguinte:

Se UMA equação — 40 linhas de Python — aprende em 10 níveis

diferentes de abstração, do byte bruto ao planejamento...

...então talvez inteligência não seja sobre arquiteturas

diferentes pra cada problema.

Talvez seja sobre DESCOBRIR OS NÍVEIS certos de abstração

e aplicar a MESMA coisa em todos eles.

A indústria está numa corrida pra ver quem constrói o maior modelo.

Talvez a corrida devesse ser: quem descobre o PRÓXIMO nível.

O paper tem a prova formal. O código tem a demonstração.

As críticas estão em aberto.

u/Player-Kheltz — 4 days ago