r/MachineLearning

▲ 1 r/MachineLearning+1 crossposts

Looking for arXiv endorsement + sharing a preprint on homeostatic cognitive architecture for AI companions [R]

Hey r/ML — I just posted a preprint on SSRN for PHI // DRIFT, a cognitive architecture that gives an AI companion persistent internal state, salience-weighted memory retrieval, and a falsifiable continuity metric (PEDI). Ablation testing confirmed the DMU memory system injects 14.8% more context per prompt than cosine-only RAG — a structural finding that holds on CPU-only consumer hardware.

Also looking for an arXiv endorsement for cs.AI if anyone's willing. Happy to answer questions on the architecture.

here is my abstract

I present PHI // DRIFT, a cognitive middleware architecture designed to address a fundamental limitation in current large language model deployments: the absence of persistent internal state that evolves across interactions with a specific user over time. Existing systems process each interaction as an isolated probabilistic event — competent, but stateless. We describe this gap as talking to the statistics of a mind. DRIFT introduces five architectural contributions: the Decision Memory Unit (DMU), the Persistence-Embodiment-Drift Index (PEDI), a homeostatic regulation layer, a security defense layer, and a logic chain reasoning trace system. All development and evaluation were conducted on consumer hardware with no GPU acceleration. Ablation testing confirmed DMU re-ranking injects 14.8% more context per prompt than cosine-only retrieval. Live stress testing at 50-thread concurrency produced 100% success rate with no breaking point found. We do not claim PHI // DRIFT is conscious. We claim it produces measurably more continuous, contextually coherent output than stateless alternatives — and we provide a framework for testing that claim.

reddit.com
u/Interesting_Time6301 — 5 hours ago

Could ML be used to automate C-suite organizational duties? [D]

We often see worry from workers that ML techniques will either fully replace them, or jostle them violently economically such that their earnings and well-being are impacted. Concurrently, many tech companies resist unionization/"guild" efforts to protect the careers of technically capable employees, software engineers in particular. And cynically we might suspect a trend towards "corporatism" as companies grow larger, even if they're initially established by well-meaning, competent, and technical-minded people.

While I acknowledge a tongue-in-cheek quality to this discussion - versus efforts to automate software engineering, where is the SoTA on automating logistical decisions made be CEOs/CFOs/CTOs?

(I'm envisioning, idealistically, a "cooperative" or guild formed by equal contributors of technical content where the business itself is generically managed in a decentralized way, specifically where ML facilitates centralized decision making when it becomes strictly necessary. Frankly, a core advantage of this would be an ideal robustness to "adversarial" overtake of the cooperative, if the ML agent was explicitly pre-designed both to 1) prioritize the productivity and welfare of the employees and 2) to resist ML-space adversarial attacks trying to falsely incentivize it towards "selling out."

The human benefit to the employees here would be decision-making free of "The Mask of Sanity"-type behavioral failings, but perhaps also the facilitation of direct-democracy-at-scale. You could imagine teams electing representatives at only the scales they're comfortable with, and CEO-Bot managing the rest as a balanced-rewards problem.)

Intuitively, some might suspect C-suite employees are not meritorious, but I guess the question is, what functions do they perform that resist automation? Schmoozing, elicitation during funding rounds, having a keen eye to the business environment?

As silly as this is, humor me: the standard IMO wouldn't be to produce an ideal CEO, just a CEO-Bot that's less mercurial or self-centered than a CEO humans would prefer to avoid.

So: what concerns jump out at you? Biased hiring data, adversarial attacks, lack of capacity in XYZ direction?

reddit.com
u/RepresentativeBee600 — 6 hours ago

COLM 2026 ReviewsDiscussion [D]

Didn't see one so wanted to make one myself. Reviews are actually already out, curious what everyone thinks about the quality of the reviews? I've heard it's a mixed bag and apparently a concerning amount of AI generated reviews for some people.

reddit.com
u/RandomMan0880 — 9 hours ago

NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]

Disclaimer: I work for Numind, the company behind this open-weight model

We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from complex documents more practical with an open model: PDFs, screenshots, forms, tables, receipts, invoices, multi-page documents, and other visually structured inputs.

Try it, we have a huggingface space that is completely free (you don't even have to sign-up): https://huggingface.co/spaces/numind/NuExtract3

If you ever used NuMarkdown, NuExtract3 is the successor.

There are some examples to guide you. Feel free to re-use this model for any task.

https://preview.redd.it/pm2xbooyxn2h1.png?width=1672&format=png&auto=webp&s=1a8a7b262190c8325159496dae98c3d2dfab493c

https://preview.redd.it/b5z7ylfzxn2h1.png?width=1758&format=png&auto=webp&s=a07b3abd6e5065c2635de047bdf154357f903e4c

A few things it is designed for:

  • converting document images to Markdown
  • extracting structured data from documents using a target json template
  • handling tables, forms, and layout-heavy pages
  • working with both text and visual document inputs
  • serving as a local/open-weight alternative for document extraction pipelines

It was trained on a node of 8xH100 for 3 days to train on as much context as we could, so it should perform fairly well even on long document. For Markdown, we'd still recommend going page by page for the best results and inference speed, since you can parallelize better this way.

It's very easy to self-host, since we provide fairly extensive documentation, Safetensors, GGUF and MLX weights. With as little as 4GB of VRAM, you should be good to go. We provide multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6...) so you should be able to run it anywhere.

We mostly tried vLLM, SGLang, llama.cpp.

We have a blog post and a pretty decent model card:

I'm currently writing a paper on this model so I'll post it as soon as it's accepted. It's not yet on Arxiv yet as it has been submitted in a peer-review journal/conference.

I'll try to answer as many questions as possible if you have any. We would really appreciate feedback from the community.

We also have a discord if you're interested
https://discord.com/invite/3tsEtJNCDe

reddit.com
u/Gailenstorm — 19 hours ago

One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]

I've seen systems score well internally and then immediately fail under:

  • ambiguous user intent
  • messy real-world context
  • contradictory instructions
  • long-running sessions

Feels like evaluation still heavily rewards clean-task optimization instead of behavioral robustness.

What are people using beyond standard eval pipelines?

reddit.com
u/Bladerunner_7_ — 23 hours ago

Novel Problems in VLA [R]

I'm currently doing a research internship and my supervisor is constantly pushing me to have a novel idea, I've read about 15-20 papers about VLA and I think that most of the things are saturated, I thought about an equivariant VLA based on equivariant CNN which was published in 2016 and successfully implemented that, and then I found that someone published that too, do you guys have any advice on what I should do next,? Any suggestions are welcome!

reddit.com
u/No_Mixture5766 — 1 day ago

Can liveness detection models generalise to synthetic media generation techniques they were never trained on? [D]

Most liveness detection systems in production today were built around a threat model where the attacker is submitting a static image or a basic replay video. The generation quality of current synthetic media is categorically different from what those training datasets captured.

The question I keep coming back to is whether a model trained on historical deepfake samples can generalise to generation techniques that did not exist when the training data was assembled. And if the answer is no, what does the update cycle look like for vendors claiming deepfake detection as a core capability.

I asked two identity verification vendors this directly and got answers that sounded confident without addressing the temporal gap between training data and current generation quality.

reddit.com
u/Unique_Buy_3905 — 1 day ago

Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]

The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models?

I imagine not, and I'm trying to think why:

- marginal gains?

- pipelines needing a fixed number of tokens per image upfront for efficiency reasons (or even harder limitations)?

- scaling laws are not well understood for input-adaptive patching therefore big players do not bet on this?

or am I simply totally wrong and under the hood all the big players are doing dynamic tokenization for vision?

reddit.com
u/howtorewriteaname — 1 day ago
▲ 8 r/MachineLearning+2 crossposts

Live Human Detector on Outbound Phone Calls [R]

Goal
To save humans wasting time sitting in Call Centre queues waiting to be answered

To have tool listen in on the audio stream of a live call, post IVR Navigation - to determine whether the call has transitioned out of the queue and to a live person.

 Requirements

The tool must be able to classify the audio within a sub 1-2 seconds contextual window with as high confidence level as possible.

This is not a typical AMD tool, we are not just detecting machine audio vs human speech

 

Assumed Challenges

  1. It may be difficult to determine between a pre-recorded RVA (Recorded Voice Announcement) and a human speaking.  RVA typically are professionally recorded with distinct pitches and emotional queues, have clean audio with no background noise or silence before and after the message.  This is not always the case, especially if announcements are recorded in house by the general staff.
  2. When a call is transitioning and 'Answered' there is usually a distinct soft click and or some background noise before the agent starts speaking.  This silence period, whilst a good indication a call has been answered could be confused with quiet periods between music or RVA announcements in the queue.
  3. It may be difficult to determine if we have been answered by Voicemail - whilst there is usually a beep at the end, the message itself would also start with a silence period followed by audio sounding similar to an RVA.
  4. A single short beep tone could mean Voicemail, Answered or it could mean the call is being recorded
  5. Identifying we are in a queue based on TTS audio may be difficult to identify as TTS engines become more sophisticated
  6. Telephony or G711a is in the frequency band of 300–3400 Hz @ 8000hz - 64 kbit/s

 

Approach

To train via machine leaning using labelled data, an audio classification application that analyses the acoustics, wav form or spectrograph (via Fast Fourier Transform) of the audio stream

At this stage I do not want to use STT to determine the phase or label - Although this will likely be added at a later stage as an additional layer in the pipline to increase confidence in some of these labels such as RVA/TTS/Voicemail/Call Screening

  

Phase

Queuing

Labels

Music, TTS, RVA (Recorded Voice Announcement)

 

Transitioning

Labels

Ringback, Answered, Machine Beep

 

Connected

Labels

Human, Fax, Voicemail, Call Screening

 

Disconnected

Labels

Engaged Tone

 

References

https://www.mdpi.com/2076-3417/12/7/3293 - YOHO You only here once
https://www.vicidial.org/VICIDIALforum/viewtopic.php?t=42330

https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline

https://www.youtube.com/watch?v=m3XbqfIij_Y&t=32s

https://google-ai-edge.github.io/mediapipe-samples-web/#/audio/audio_classifier

https://scikit-learn.org/stable/machine_learning_map.html

https://arxiv.org/pdf/2410.08235

 

Question

Seeking assisance on where to actually start.  Yes I be relying heavily on claude code to build this so apologies in advance

What is the best framework / algo rhythm / approach to start solving this problem.  I have seen existing frameworks like YamNet work well and fast on classifying audio - however other suggest Whisper and ASR

What is the best way of tagging or labelling data.  Do I label existing full length recordings with stop/start timestamps or each label or do I need to split each label into its own file - resulting in a loss of context. 

Are there obvious existing data sets I should be using for some of my labels

reddit.com
u/Bucky102 — 24 hours ago
▲ 10 r/MachineLearning+2 crossposts

Scaling LLMs horizontally: hidden-state coupling without weight modification [R]

Residual Coupling (RC) connects frozen language models in parallel using small, learned linear bridge projections. These bridges read hidden states from one model and inject additive updates into the residual stream of another at intermediate layers. In bilateral setups, simultaneous return bridges form a feedback loop that stabilizes both streams without altering base weights.

This architecture establishes a two-step paradigm where base models function as memorizers, while lightweight linear bridges handle cross-domain generalization. Constraining the bridges to purely linear maps prevents overfitting because they can only map existing geometric relationships between the frozen representation spaces. As the bridges are optimized against ground-truth target data, they have no incentive to map ungrounded features such as individual models' hallucinations.

Keeping the base weights completely frozen eliminates catastrophic forgetting. The system maintains operational closure, transforming inputs through its existing structure rather than changing to accommodate them.

Evaluating bilateral RC against Mixture-of-Experts (MoE) routing across the same frozen models shows these results:

  • Medical (3-model): Reduces perplexity to 11.02, compared to 56.80 for MoE and 57.08 for the frozen baseline. This represents an 80.7% reduction.
  • TruthfulQA Health (MC1): Improves accuracy by 9.1 percentage points over the baseline. Independent models have uncorrelated hallucinations, allowing the bridge gates to amplify consistent cross-model updates while suppressing individual errors.
  • Coding Test: CodeGPT-small-py and GPT-2 use different tokenizers, causing a 7-million baseline perplexity on mismatched text. MoE reaches 878, but RC achieves 5.91 by reading hidden states before the output projection collapses.

This framework introduces a horizontal scaling axis for multi-model systems, moving beyond vertical scaling via larger monolithic models. Latency remains bounded by the slowest single model. Specialists can be added or removed without retraining the remaining system. In some scenarios, this architecture could replace multi-turn text prompting in agentic workflows with a single parallel forward pass, allowing models and/or bridges to run on separate nodes or edge devices without a central bottleneck. By decoupling memorization from relational alignment, RC bridges provide a framework for scaling multi-model systems and offer a path toward native multi-modal integration.

Paper: https://ssrn.com/abstract=6746521

Code: https://github.com/pfekin/residual-coupling/

i.redd.it
u/kertara — 1 day ago

OpenAI claims a general-purpose reasoning model found a counterexample to Erdos's unit-distance bound [D]

OpenAI posted a math result today claiming that one of its general-purpose reasoning models found a construction disproving the conjectured n^{1+O(1/log log n)} upper bound in Erdős’s planar unit-distance problem.

Announcement:

https://openai.com/index/model-disproves-discrete-geometry-conjecture/

Proof PDF:

https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-proof.pdf

Abridged reasoning writeup:

https://cdn.openai.com/pdf/1625eff6-5ac1-40d8-b1db-5d5cf925de8b/unit-distance-cot.pdf

The mathematical claim, as I understand it, is that there are finite planar point sets with more than n^{1+δ} unit distances for some fixed δ > 0 and infinitely many n. That would rule out the expected near-linear upper bound, though it does not determine the true asymptotic growth rate.

What seems especially relevant for this subreddit is the process claim: OpenAI says the solution was produced by a general-purpose reasoning model, then checked by an AI grading pipeline and reviewed/reworked by mathematicians. The proof PDF also includes the original prompt given to the model, but not the full experimental details: no model name, sampling setup, number of attempts, compute budget, hidden system prompt, or full grading pipeline.

Curious how people here read this as an ML result. Is this best viewed as evidence of frontier models doing genuine autonomous research, or as a cherry-picked but still important sample from a large search process? What kind of disclosure would you want before treating this as a reproducible AI-for-math milestone?

reddit.com
u/NutInBobby — 2 days ago

using .npy dataset with 3D models [R]

Hello guys , i am trying to work on ADNI dataset to get 90% accuracy , but it keeps getting stuck at 55%. any tip to improve results ?

reddit.com
u/LahmeriMohamed — 1 day ago

How competitive are PhD admissions currently [D]

Hi,

how hard is it currently to get a PhD position in machine Learning? Like what are the requirements to get to a decent mid tier program (= they publish regularly at respected journals and their work gets read my some people)? How is it in different regions e.g US, Europe, etc..

I am about to finish my masters and am wondering if I need to sweep in an unpaid guided research project to extend my network.

reddit.com
u/strammerrammer — 2 days ago

Lisbon Machine Learning School (LxMLS 2026) [D]

Hi did anyone apply it, or attended it previously?
How was the experience?

I got the acceptance but no scholarship, is it worth going self sponsored?

reddit.com
u/Icy-Solid-4159 — 1 day ago

Should I accept a PhD offer in NeuroAI [D]

Hi everyone.

I am recent CS grad and I have received a PhD offer from a school in states. However I am deeply confused if I should accept it or not.

My hesitation comes from the interdisciplinary nature of the program. It will be jointly supervised by the two professors, one from biomedical and one from ML domain. I always wanted to work on the foundational aspect of the AI and to publish in A* conferences in AI, so I am not sure if it is a right choice.

The other option for me is to wait and work on enhancing my profile. Get another paper or two published in respected venues and apply again. I have a decent profile, with couple of internships and research papers, and >90% cgpa.

Moreover, I believe I can do foundational stuff much better than the applied one so my biggest fear is that I accept the offer and later get to know that the AI part is very trivial and minimal. It might lead to the mental frustration and lower productivity.

What should I do in this case? If anyone has been a part of such a interdisciplinary programs, please do share your experience.

Thanks!

reddit.com
u/ProfessionalDue369 — 2 days ago

Looking for real world comparisons between WALL OSS pi0.6 and OpenVLA[D]

I am choosing a baseline for a real manipulation stack and trying not to lose a month on setup that someone here has already done.

Shortlist is OpenVLA, pi0.6, and WALL OSS from X Square Robot. OpenVLA is still the easiest reference point with lots of reproductions. pi0.6 looks strong from recent public updates but I have not seen many fully transparent ablations. WALL OSS looks promising in LeRobot and I can run inference on UR5 plus parallel gripper without issues, around 70 ms on a 4090 in my local setup.

What I need is less paper score discussion and more deployment reality.
If you have run a controlled comparison on LIBERO or ManipArena style tasks, I would really value failure modes and data budget details.
If you have fine tuned any of these on real hardware, which one was least painful on demonstration volume.
If you run continuous updates, how often do you retrain and how bad is drift over a few weeks.

I can post my own table once I finish, but if there is existing work I should read first that would save a lot of duplicated effort.

reddit.com
u/Dense-Sir-6707 — 2 days ago

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]

Autoregressive LLM world models factorize next-state generation left-to-right, preventing them from conditioning on globally interdependent anchors (tool schemas, trailing status fields, expected outcomes) and yielding prefix-consistent but globally incoherent rollouts. MDLMs' any-order denoising objective sidesteps this by learning every conditional direction from the same training signal. Empirically, fine-tuned MDLMs (SDAR-8B, WeDLM-8B) surpass AR baselines up to 4x their total parameter count on BLEU-1, ROUGE-L, and MAUVE across in- and out-of-domain splits, with lower Self-BLEU and higher Distinct-N confirming reduced prefix mode collapse. GRPO training on MDLM-generated rollouts shows up to +15% absolute task-success gains over AR generated training on held-out ScienceWorld, ALFWorld, and AppWorld across 1.2B–7B backbones (LFM2.5, Qwen3, Mistral) in a zero-shot transfer setting.

zenodo.org
u/MegixistAlt — 2 days ago

Machine Learning on Spherical Manifold [R]

Hi, I'm interested in geometric deep learning (due to Michael M. Bronstein's book and Maurice Weiler's PhD thesis), and in order not to write projects to nowhere, I decided to keep a technical blog. I started with a short note about machine learning on spherical manifolds, but it's a pretty simple thing.

Is there a list of some open problems on the topic of GDL, or maybe some of you are doing something in this direction and can suggest which GDL problems are relevant in the research community.

eesuck1.github.io
u/eesuck0 — 3 days ago
▲ 14 r/MachineLearning+1 crossposts

Open-source GPU observability with workload attribution - maps DCGM metrics to pods/jobs/teams (K8s + Slurm, OTLP)

A common pain point in multi-team GPU clusters: DCGM tells you a node is at 90% utilization. It doesn't tell you which team, pod, or job is driving that.

We open-sourced l9gpu to solve this. It's a node-level agent that emits GPU metrics via OTLP with full workload attribution baked in.

Kubernetes: maps metrics to pod, namespace, and deployment

Slurm: maps to job, user, and partition

What's included:

- NVIDIA, AMD MI300X, Intel Gaudi support

- LLM inference metrics (vLLM, SGLang, TGI)

- Vendor-neutral OTLP export

- Pre-built Grafana dashboards

- 17 Prometheus alert rules

- MIT licensed, derived from Meta's gcm project

https://github.com/last9/gpu-telemetry

How are others handling GPU cost attribution and chargeback in shared clusters?

u/bakibab — 2 days ago