u/MiroMindAI — reddlx

MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware

As usual, disclosure first: I'm on the team that built this.
Our MiroThinker-1.7-deepresearch and 1.7-mini-deepresearch API went live, mini is a deep research agent built on Qwen3 MoE (30B total, 3B active for mini). Weights on HuggingFace: huggingface.co/miromind-ai/MiroThinker-1.7
Posting here because the open-weight agent conversation mostly happens in this sub and I'd genuinely like feed because commenting in reddit and discussing did get me some feedback, but it was actually not enough. Tried to load a github APP on our DC server to get PR notified faster but realized there was actually not enough and one was a promo.
Benchmarks (arxiv Table 1, cherry-picked to fit a table but full comparison in paper):

Model	BrowseComp	BrowseComp-ZH	HLE-Text	GAIA	xbench-DS	SEAL-0
MiroThinker-1.7	74.0	75.3	42.9	82.7	62.0	53.0
MiroThinker-1.7-mini (30B/3B active)	67.9	72.3	36.4	80.3	57.2	48.2
Qwen3.5-397B	78.6	70.3	48.3	–	–	46.9
DeepSeek-V3.2	67.6	65.0	40.8	–	–	49.5
GPT-5 (closed, for context)	54.9	65.0	35.2	76.4	75.0	51.4

Two things I'd specifically want this sub to push back on:

The mini model is only 3B active params — anyone tried running it locally yet? Curious what tok/s people are getting on consumer hardware.
Our context management (sliding window K=5 + episode restarts) is opinionated. If you've run long-context agents locally you probably have opinions on this.

Paper: arXiv:2603.15726

See y'all in the comments, will reply tomorrow~ please don't downvote me, for a genuinely good open-source project we ARE not getting enough dev feedback and Reddit has been a good source so far.

u/MiroMindAI — 6 days ago

▲ 10 r/huggingface+1 crossposts

MOOSE-Star (ICML 2026): 7B model + 108K-paper dataset for scientific hypothesis discovery — full collection on HF

Disclosure first: I work on community at MiroMind. One of our researchers just dropped the full MOOSE-Star collection on Hugging Face —

a 7B model post-trained for scientific hypothesis discovery, plus the dataset behind it. Paper accepted at ICML 2026.

🤗 Collection: https://huggingface.co/collections/ZonglinY/moose-star-models-and-data

Inside:

MS-IR-7B / MS-HC-7B / MS-7B: 7B models for inspiration retrieval, hypothesis composition, and joint use. Base: R1-Distilled-Qwen-7B.
TOMATO-Star: 108,717 NCBI papers decomposed into (background, hypothesis, inspirations), every inspiration anchored to a real citation. Covers biology, chemistry, medicine, medical imaging, psychology, cognitive science. ~38,400 A800 GPU-hours of preprocessing went into building it.

Strict temporal split for evaluation: train ≤ Sep 2025, test = Oct 2025 (after the base model's knowledge cutoff).

Numbers (Oct 2025 held-out papers, ~3K-paper inspiration corpus):

Model	IR accuracy
R1-Distilled-Qwen-7B (base)	28.4%
MS-7B (7B)	54.4%
GPT-5.4	51.5%
Gemini-3 Pro	54.9%
Claude Sonnet 4.6	45.0%
DeepSeek-R1	45.1%

MS-7B reaches 54.4% inspiration retrieval accuracy: Beats GPT-5.4 (51.5%), matches Gemini-3 Pro (54.9%), up from base 28.4%.

r/huggingface is where this release feels most at home — the entire collection lives on HF, and the temporal-split evaluation setup may be useful to anyone here training or fine-tuning on scientific corpora.
Questions welcome below. First author ZonglinYang is around to answer.

📄 https://arxiv.org/abs/2603.03756
💻 https://github.com/ZonglinY/MOOSE-Star

u/MiroMindAI — 9 days ago

▲ 6 r/AutoGPT+1 crossposts

How do you handle agents that need 200+ tool calls per task? We tried one approach, looking for critique

Working on agent chains here, so this is the first sub I wanted to bring this to. Disclosure: I work at MiroMind, this is our checkpoint but I am posting because the design tradeoff is the interesting part, not the brand.

The problem we kept hitting on deep-research chains:

Long horizons. Real research tasks routinely cross 100+ tool calls. Most agent frameworks degrade hard past 50 because of context drift and tool-result noise.
Disconnects. A 20-minute run that dies on socket reset is an expensive way to learn your retry logic is broken.
Trace amnesia. You finish a run, the answer is wrong, and you have no way to see at which tool call the chain went sideways.

What we tried with MiroThinker 1.7 deep-research: - A single run can execute up to 300 tool interactions within a 256K context window, using recency-based retention (only the latest K tool results stay in-context). Not "everything must live in one fragile HTTP session."

Submit / resume / cancel are first-class, the agent keeps executing on our side, you reconnect to it - Every step is logged. Useful when a chain fails on step 187 of 240 and you need to know why Numbers if useful for the architecture choice.

Things I am still unsure about: - Whether the 300 tool-call ceiling is actually the right shape, or whether most of you cap chains way before that and use sub-agents instead

- How you handle resumable execution today

— are you rolling your own job queue, or is there a pattern I am missing?

Would love war stories from anyone running long chains in production.
BTW API Launch pricing is 25 percent off, pre-freeze billing means if the platform fails you do not pay.

reddit.com

u/MiroMindAI — 11 days ago