u/ai-lover

r/pdf r/AIDeveloperNews r/LocalLLM r/OpenSourceeAI r/AIAGENTSNEWS r/machinelearningnews

Synthetic Sciences Releases OpenScience: An Open-Source, Model-Agnostic AI Workbench for Machine Learning, Biology, Physics, and Chemistry Research

Most "AI for science" tools are one vendor's model, wrapped in one company's idea of which research is allowed. That's a gatekeeping layer — and Synthetic Sciences just drew a clear line by open-sourcing the alternative.

They released OpenScience — an Apache-2.0 AI workbench that runs the full research loop (literature → hypothesis → code → experiment → analysis → write-up) on any model you point it at, with your own keys, on your own infrastructure.

Here's what's actually interesting:

→ Model-agnostic by design — Claude, GPT, Gemini, GLM, Kimi, DeepSeek, or your own fine-tune, switched from the model selector, per request

→ 250+ editable skills across training (DeepSpeed, PEFT, TRL), cheminformatics, and molecular + clinical biology — all readable and forkable

→ Scientific databases wired in as agent tools: UniProt, PDB, ChEMBL, arXiv, and ~30 more, queried directly

→ Runs on your infra — keys and data stay on your machine, and bring-your-own-key is free and never gated

→ Positioned as an open alternative to Anthropic's Claude Science, which is Claude-only and subscription-gated

Full analysis: https://www.marktechpost.com/2026/07/05/synthetic-sciences-releases-openscience-an-open-source-model-agnostic-ai-workbench-for-machine-learning-biology-physics-and-chemistry-research/

GitHub Repo: https://github.com/synthetic-sciences/openscience

u/ai-lover — 1 day ago

▲ 22 r/AIDeveloperNews

100+ Agentic AI and Agents Tutorials/Implementations and Notebooks [Colab Notebooks with Full Codes are included]

My dev team has been quietly building one of the most complete open-source AI agent and Agentic tutorial libraries on GitHub. It just crossed 2.7K stars.

Every notebook is a full, runnable implementation — not a snippet. Here's what's inside:

Multi-agent orchestration across every major framework→ LangGraph, CrewAI, AutoGen, SmolAgents, Google ADK, CAMEL, OpenAI Agents, Mistral Agents → Planning, tool-calling, sub-agents, and critique-driven refinement
Agent memory that actually persists→ Mem0, Memori, EverMem-style hierarchical memory with FAISS + SQLite → RL agents that learn which long-term memories to retrieve
Reasoning and decision loops→ Tree-of-Thoughts with beam search and pruning → Streaming decision agents with online replanning and mid-execution adaptation
The agent stack, end to end→ MCP servers, OAuth 2.1 for MCP, A2A communication protocols → Agentic RAG, knowledge graphs, cost-aware LLM routing
Beyond agents→ 30+ topic folders: RAG, RL, Deep Learning, Computer Vision, NLP, Robotics, Voice AI, LLM Evaluation, Federated Learning, and more

Every tutorial pairs a Colab notebook with a full written walkthrough, so you can read the theory and run the code in the same sitting.

985 commits. 585 forks. All free.

If you're building agents in 2026, this is a resource worth bookmarking.

Full Repo: https://github.com/MARKTECHPOST-AI-MEDIA-INC/AI-Agents-Projects-Tutorials

u/ai-lover — 1 day ago

▲ 36 r/machinelearningnews

NVIDIA HORIZON: A Hands-Free Agent that Evolves Git Worktrees and Hits 100% RTL Benchmark Completion

We covered a new paper from NVIDIA Research that moves agentic coding into hardware design.

HORIZON treats hardware design as repository-level code evolution. A human writes a Markdown harness. A bootstrap agent compiles it into a project pack, then a hands-free loop evolves an isolated git worktree until an acceptance gate passes.

Here's what's actually interesting:

Git is the interface, not bookkeeping

Each accepted repair becomes a commit. Git notes carry the evaluator verdict and reward. Rejected attempts are logged as negative examples. The repository history becomes the experience buffer.

The verifier harness is the real contract

The project pack bundles an executable evaluator, an acceptance predicate, a git policy, and domain skills. For RTL that means compile, simulate, coverage, and assertion checks. Any backbone can plug in.

The results

→ 100% completion across ChipBench, RTLLM-2.0, Verilog-Eval, and nine CVDP categories

→ 47.8% aggregate pass rate at the first iteration, before the loop closes the gap

→ 82 iterations for the hardest category (RTL code completion), its long tail the single largest cost

→ ~210M tokens total, ~91% cached input

→ GPT-5.3 as a fixed backbone, single-agent, hands-free

My takeaway: once executable feedback makes correctness converge, the bottleneck shifts to token efficiency and verification quality, not pass rate.

Full analysis: https://www.marktechpost.com/2026/07/04/nvidia-horizon-a-hands-free-agent-that-evolves-git-worktrees-and-hits-100-rtl-benchmark-completion/

Paper: https://arxiv.org/pdf/2606.28279

u/ai-lover — 3 days ago

▲ 30 r/machinelearningnews

NVIDIA AI Introduces ASPIRE: A Self-Improving Robotics Framework Reaching 31% Zero-Shot on LIBERO-Pro Long Tasks

Most robot-coding agents throw away everything they learn. Solve a task, discard the fix, start the next one cold — the agent on its 100th task is no smarter than on its first. NVIDIA's ASPIRE draws a clean line between that and an agent whose experience actually compounds.

They introduced ASPIRE (Agentic Skill Programming through Iterative Robot Exploration) — a code-as-policy system where a coding agent (Claude Code, Claude Opus 4.6, 1M-token context) writes and debugs its own robot programs against a fixed perception/planning/control API, and distills every validated fix into a reusable skill library, with no fixed perception-plan-execute pipeline anywhere in the loop.

Here's what's actually interesting:

→ The execution engine logs per-primitive multimodal traces — RGB keyframes, grasp candidates, object poses, motion plans, return status — so the agent localizes the failing primitive, not just the failed rollout

→ Validated fixes distill into a text skill library (failure signature + when-to-apply guard + repair sketch), not weights — and the agent is barred from reading sim ground truth, so the skills transfer to real hardware

→ Evolutionary search proposes K candidate programs per round, conditioned on surviving programs + residual failure traces — beyond single-trajectory tuning

→ LIBERO-Pro Object under perturbation: 98 vs 22 for CaP-Agent0

→ Robosuite bimanual handover: 92 vs 20 for CaP-Agent0

→ LIBERO-Pro Long zero-shot: 31 vs 4 for prior methods (skills learned on LIBERO-90, no test-time retries)

On a real bimanual robot with a different embodiment and API (OpenAI Codex GPT-5.5), transferred skills took soda-can lifting to 19/20 at ~10x fewer tokens, and drawer opening from 0/20 to 11/20.

The core bet: compound debugging experience into an explicit skill library, not the weights.

Full analysis: https://www.marktechpost.com/2026/07/03/nvidia-ai-introduces-aspire-a-self-improving-robotics-framework-reaching-31-zero-shot-on-libero-pro-long-tasks/

Paper: https://arxiv.org/pdf/2607.00272

Project page: https://research.nvidia.com/labs/gear/aspire/

u/ai-lover — 3 days ago

▲ 22 r/OpenSourceeAI+1 crossposts

Mistral AI Releases Leanstral 1.5: An Apache-2.0 Lean 4 Code Agent Model Solving 587 of 672 PutnamBench Problems

Most AI theorem proving is a language model generating a proof in one shot, with a verifier bolted on at the end to check it. That's autocomplete with a grader — and Mistral just drew a clear line between that and an actual proof agent.

They released Leanstral 1.5 — a 119B MoE with 6.5B active parameters, trained as a code agent that lives inside the Lean 4 compiler loop: propose a proof, read the compiler's goals and errors, refine, repeat until it compiles or the budget runs out. Verification isn't the eval here. It's the training signal.

Here's what's actually interesting:

→ Test-time scaling behaves like a dial: PutnamBench Pass@8 climbs 44 → 244 → 493 → 587 solved as the per-attempt token budget moves 50k → 200k → 1M → 4M

→ 587/672 on PutnamBench at ~$4 per problem, versus an estimated $300+ for Seed-Prover 1.5 high (a 10 H20-days-per-problem budget)

→ Saturates miniF2F: 100% on both validation and test sets

→ Two RL environments in training — a multiturn prover, and a raw-filesystem code agent that edits files, runs bash, and queries the Lean language server for live goals and types

→ Not just math: an Aeneas (Rust → Lean) pipeline flagged 11 genuine bugs across 57 repos, 5 previously unreported — including an integer overflow in datrs/varinteger when (value + 1) hits Std.U64.MAX

Apache 2.0 weights, free API endpoint

Full analysis: https://www.marktechpost.com/2026/07/03/mistral-ai-releases-leanstral-1-5-an-apache-2-0-lean-4-code-agent-model-solving-587-of-672-putnambench-problems/

Model weights: https://huggingface.co/mistralai/Leanstral-1.5-119B-A6B

Project: https://docs.mistral.ai/models/model-cards/leanstral-1-5

Technical Details: https://mistral.ai/news/leanstral-1-5/

u/ai-lover — 3 days ago

▲ 19 r/AIAGENTSNEWS+3 crossposts

Meet WebBrain: An Open-Source, Local-First AI Browser Agent That Reads Pages and Automates Tasks in Chrome and Firefox

WebBrain lives inside your browser and can run entirely on your own local model — no cloud, no account, no data leaving your machine.

Most "AI browser agents" are a chat box that pastes your page into someone else's server. That's not an agent that lives where you browse — and WebBrain draws a very clear line between the two.

It's an open-source (MIT), local-first browser agent for Chrome and Firefox. It runs inside your existing authenticated session, on a model you pick — so with llama.cpp or Ollama, nothing leaves your machine.

Here's what's actually interesting:

→ Two modes, cleanly separated. Ask reads the page (read-only, content scripts). Act clicks and types through the Chrome DevTools Protocol (chrome.debugger) — trusted input events that modern sites honor, reaching cross-origin iframes and shadow DOM.

→ UI-first by design. For anything that submits, sends, or buys, it drives the visible UI and refuses to hit REST/GraphQL endpoints directly. It starts read-only and asks before consequential actions.

→ Bring any model. llama.cpp, Ollama, LM Studio, vLLM — or OpenAI, Claude, Gemini, DeepSeek, Groq, OpenRouter. Recommended local: Qwen 3.6 35B (Qwen3.6-35B-A3B), which beat Gemma 4 on the project's screenshot benchmark.

→ Tuned for cost and privacy. Token-conscious screenshots, oldest-first context trimming, a dedicated vision model, 40+ tools (~20 in Compact mode). No telemetry. No accounts.

Full analysis: https://www.marktechpost.com/2026/07/02/meet-webbrain-an-open-source-local-first-ai-browser-agent-that-reads-pages-and-automates-tasks-in-chrome-and-firefox/

GitHub Repo: https://pxllnk.co/wdva98c

Chrome Extension: https://pxllnk.co/p4mn8

Firefox Add-on: https://pxllnk.co/m6k7c5w9

Portal: https://pxllnk.co/rlifl7h

u/ai-lover — 4 days ago

▲ 2 r/AIDeveloperNews

Meet Alibaba’s Page Agent: A JavaScript In-Page GUI Agent That Controls Web Interfaces With Natural Language Through the DOM

https://reddit.com/link/1uluc82/video/ml81ua82rvah1/player

Most browser agents drive the page from the outside. An external process, a headless browser, screenshots piped to a multimodal model. That's automation pointed at your app — not something living inside it. Alibaba's Page Agent flips the direction.

They shipped an open-source JavaScript GUI agent that runs client-side, inside the webpage itself. It reads the live DOM as text — a "dehydrated" FlatDomTree — then clicks, types, and scrolls as the real user. No screenshots, no multimodal model, no backend rewrite, and it inherits the user's existing session and auth.

Full analysis: https://www.marktechpost.com/2026/07/02/meet-alibabas-page-agent-a-javascript-in-page-gui-agent-that-controls-web-interfaces-with-natural-language-through-the-dom/

GitHub Repo: https://github.com/alibaba/page-agent

reddit.com

u/ai-lover — 5 days ago

▲ 6 r/OpenSourceeAI+1 crossposts

Using Lift to Turn Research PDFs into Structured JSON with Controlled, Schema-Guided Field-Level Evaluation

Most "PDF extraction" is a text dump with a regex bolted on top. That's not document mining — and it breaks the moment a paper puts its real number in a table three pages away from the abstract.

So we built a full tutorial around Lift, an open PDF-to-structured-data model, treating it as a controlled benchmark instead of a one-off demo.

The setup is synthetic multi-page research reports with deliberate traps: validation-vs-test metric ambiguity, baseline-vs-proposed comparisons, papers that release no code, and boolean state-of-the-art claims. A JSON Schema then tells Lift exactly which fields to recover — title, authors, datasets, metrics, hyperparameters, limitations, code URL.

Here's what's actually interesting:

→ 4-bit NF4 loading fits the ~10B model on a 16 GB T4/L4 — no A100 required

→ Schema descriptions do the disambiguation: test number vs. validation number, proposed method vs. baseline, released code vs. explicit null

→ Field-level scoring against ground truth, with numeric tolerance and abstention handling — not a vibe check

→ Extractions roll up into a queryable knowledge base, one row per mined paper

→ Datalab report Lift at ~90.2% field accuracy on their 225-doc benchmark

Full tutorial: https://www.marktechpost.com/2026/07/01/using-lift-to-turn-research-pdfs-into-structured-json-with-controlled-schema-guided-field-level-evaluation/

GitHub Repo: https://pxllnk.co/rc5yap

pxllnk.co

u/ai-lover — 6 days ago

▲ 19 r/OpenSourceeAI+1 crossposts

NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model Built on a Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone

Most diffusion language models make one network do two jobs at once — represent the clean context and denoise the noisy tokens. Those two goals pull the same weights in different directions. NVIDIA just split them apart.

They released Nemotron-Labs-TwoTower — a block-wise autoregressive diffusion model built on the Nemotron-3-Nano-30B-A3B hybrid Mamba-2/attention/MoE backbone. It runs two towers: a frozen autoregressive context tower that processes clean tokens causally, and a trainable diffusion denoiser tower that refines noisy blocks via cross-attention to that context. Only the denoiser is trained — on ~2.1T tokens, a fraction of the backbone's 25T.

Here's what's actually interesting:

→ Two towers, not one: a frozen AR context tower and a trainable diffusion denoiser, connected layer-by-layer — denoiser layer i attends to context layer i, not just the last hidden state

→ 98.7% of the autoregressive baseline's quality at 2.42× generation throughput (γ=0.8, block size 16, 2×H100)

→ It commits multiple tokens per denoising step early in decoding — that's where the one-token-per-step AR bottleneck breaks

→ One checkpoint, three decoding modes: mask diffusion, mock-AR, and standard AR

→ Ablations: causal Mamba beats bidirectional Mamba, and tying the two towers under a joint loss is substantially worse

Full analysis: https://www.marktechpost.com/2026/07/01/nvidia-releases-nemotron-labs-twotower/

Paper: https://arxiv.org/pdf/2606.26493

Weights: https://huggingface.co/collections/nvidia/nemotron-labs-twotower

https://reddit.com/link/1ukfnsq/video/t43wdu4gukah1/player

reddit.com

u/ai-lover — 6 days ago

▲ 15 r/OpenSourceeAI+1 crossposts

Google AI Introduces TabFM: A Hybrid-Attention Tabular Foundation Model for Zero-Shot Classification and Regression

Most tabular ML in production is still XGBoost plus hours of hyperparameter tuning and feature engineering. That's not a foundation-model workflow — and Google Research just brought the zero-shot idea to tables.

They introduced TabFM — a foundation model for tabular classification and regression that reads your entire dataset as a single prompt and predicts in one forward pass, with no per-dataset training, tuning, or feature engineering anywhere in the loop.

Here's what's actually interesting:

→ In-context learning, not fine-tuning: training rows and test rows go in as one context, and the model learns the task at inference time

→ Hybrid attention: alternating row/column attention (TabPFN-style) → row compression into a dense vector → in-context learning over compressed rows (TabICL-style)

→ Trained entirely on hundreds of millions of synthetic datasets generated by structural causal models — no proprietary tables required

→ TabArena (38 classification + 13 regression datasets, 700–150,000 samples): Google reports it consistently outperforms heavily tuned supervised baselines

Full analysis: https://www.marktechpost.com/2026/07/01/google-ai-introduces-tabfm-a-hybrid-attention-tabular-foundation-model-for-zero-shot-classification-and-regression/

Technical Details: https://research.google/blog/introducing-tabfm-a-zero-shot-foundation-model-for-tabular-data/

Repo: https://github.com/google-research/tabfm

https://preview.redd.it/eam5uqurqkah1.png?width=2026&format=png&auto=webp&s=aa79748af4ea0353ec930d645c5f91a2963c0939

reddit.com

u/ai-lover — 6 days ago

▲ 2 r/OpenSourceeAI

OpenClaw Releases iOS and Android Companion Node Apps That Connect a Phone to a Self-Hosted AI Agent Gateway

Most "AI assistant" apps are a chatbot in a sandbox, calling someone else's API. OpenClaw's iOS and Android apps draw a very clear line away from that model.

They're companion nodes, not standalone apps. Each phone pairs to a self-hosted OpenClaw Gateway over a WebSocket (default port 18789) with role: "node". The Gateway — the single control plane for sessions, routing, channels, and events — runs on macOS, Linux, or Windows (WSL2). The phone gives the agent a body: camera, location, voice, notifications, and a live Canvas.

Here's what's actually interesting:

→ The assistant runs on your machine — chat messages land on the Gateway, never on the phone

→ Nodes expose a command surface (canvas., camera., device., notifications., system.*) through node.invoke

→ Privacy-heavy commands like camera.snap and screen.record stay off until you allowlist them via gateway.nodes.allowCommands

→ Camera and screen capture run foreground-only; pairing needs explicit approval (openclaw devices approve)

→ Both store listings declare no data collection; ws:// is LAN-only, remote needs a wss:// TLS endpoint via Tailscale

Full analysis: https://www.marktechpost.com/2026/06/29/openclaw-releases-ios-and-android-companion-node-apps-that-connect-a-phone-to-a-self-hosted-ai-agent-gateway/

Android app: https://play.google.com/store/apps/details?id=ai.openclaw.app

iOS App: https://apps.apple.com/us/app/openclaw-ai-that-does-things/id6780396132

https://reddit.com/link/1uj9096/video/x662yks27bah1/player

reddit.com

u/ai-lover — 7 days ago

▲ 8 r/AIDeveloperNews

Meet EverOS: An Open Source Markdown-First Agent Memory Runtime With Hybrid BM25 + Vector Retrieval and Self-Evolving Skills

Most agent "memory" is a vector database you can't open, read, or correct.

That's not memory — it's a black box you query and hope. EverMind just drew a clear line between the two.

They open-sourced EverOS — a local-first memory runtime that stores every agent memory as plain Markdown, indexed by SQLite and LanceDB, with no MongoDB, Elasticsearch, or Redis anywhere in the stack.

Here's what's actually interesting:

→ Markdown is the source of truth — every memory is a .md file you can open, edit, grep, Git-version, or view in Obsidian

→ Hybrid retrieval in a single LanceDB query: BM25 + vector + scalar filtering, marketed as mRAG

→ Two separate tracks: user-side (Profiles, Episodes, Facts) and agent-side (Cases, Skills)

→ Self-evolving Skills: repeated Cases distill offline into reusable procedures — no hardcoding

→ EverMind-reported 93.05% on LoCoMo, sub-500ms p95 retrieval, Apache 2.0, ~9.7K stars

Full analysis: https://www.marktechpost.com/2026/06/29/meet-everos-an-open-source-markdown-first-agent-memory-runtime-with-hybrid-bm25-vector-retrieval-and-self-evolving-skills/

Repo: https://github.com/EverMind-AI/EverOS

Technical Details: https://evermind.ai/everos

u/ai-lover — 8 days ago

▲ 16 r/machinelearningnews

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

Most "edge AI" is a big cloud model, quantized down and hoped for the best. A 230M model designed to run the agent loop on the phone itself is a different thing — and Liquid AI just shipped one.

They released LFM2.5-230M — their smallest model yet. It's a 230M-parameter, open-weight model on the LFM2 architecture (8 double-gated LIV convolution blocks + 6 GQA layers), pre-trained on 19T tokens, then post-trained by distilling from the larger LFM2.5-350M.

Here's what's actually interesting:

→ 213 tok/s decode on a Galaxy S25 Ultra CPU, 42 tok/s on a Raspberry Pi 5 — at a 293–375 MB memory footprint (4-bit)

→ Beats Qwen3.5-0.8B and Gemma 3 1B IT, both larger, on instruction following — IFEval 71.71 vs 59.94 vs 63.49

→ Tool use holds up: BFCLv4 21.03, ahead of Qwen3.5-0.8B's 18.70

→ Runs a Unitree G1 humanoid on-device on a Jetson Orin, turning one instruction into a sequence of tool calls via NVIDIA's SONIC framework

Full analysis: https://www.marktechpost.com/2026/06/27/liquid-ai-ships-lfm2-5-230m-with-llama-cpp-mlx-vllm-sglang-and-onnx-support-for-on-device-inference/

Model on HF: https://huggingface.co/LiquidAI/LFM2.5-230M

Docs: https://docs.liquid.ai/lfm/models/complete-library

Technical details: https://www.liquid.ai/blog/lfm2-5-230m

u/ai-lover — 9 days ago

▲ 25 r/OpenSourceeAI+1 crossposts

DeepSeek Releases DSpark, a Speculative Decoding Framework That Accelerates DeepSeek-V4 Per-User Generation 60–85% Over MTP-1

Most speculative decoding makes you pick one: a fast parallel drafter, or an accurate sequential one. is that a false choice? — and DeepSeek's DSpark just showed why.

They released DSpark — a speculative decoding framework, not a new model — that attaches a draft module to existing DeepSeek-V4 weights. It pairs a heavy parallel draft backbone with a tiny Markov head that nudges each token's logits using only t-1, then schedules how many tokens get verified based on real-time GPU load.

Here's what's actually interesting:

→ Semi-autoregressive drafting: parallel backbone for speed, lightweight sequential head to cut suffix decay — the rank-256 Markov head adds almost nothing to latency (0.2–1.3%)

→ Confidence-scheduled verification: a calibrated confidence head plus a hardware-aware scheduler verify more tokens when GPUs are idle, fewer when they're busy

→ Accepted length: +26–31% over Eagle3 and +16–18% over DFlash across Qwen3-4B / 8B / 14B

→ Production on DeepSeek-V4: 57–85% faster per-user generation over the MTP-1 baseline at matched throughput

→ Output stays lossless, and the training repo (DeepSpec) ships under MIT

Full analysis: https://www.marktechpost.com/2026/06/27/deepseek-releases-dspark-a-speculative-decoding-framework-that-accelerates-deepseek-v4-per-user-generation-60-85-over-mtp-1/

Paper: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

GitHub Repo: https://github.com/deepseek-ai/DeepSpec

Model weights on HF: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark

https://reddit.com/link/1uh83aw/video/h8e31q4dxu9h1/player

reddit.com

u/ai-lover — 10 days ago

▲ 27 r/OpenSourceeAI+1 crossposts

Meet container: Apple’s Open-Source Swift Tool for Running Linux Containers as Lightweight VMs on Apple Silicon

Running Linux containers on a Mac has always meant one big shared VM, with every container packed inside it. Apple just inverted that model with the release of container 1.0.

container is an open-source CLI written in Swift and optimized for Apple silicon. It runs each Linux container inside its own lightweight virtual machine, and it consumes and produces standard OCI images.

Here's what's actually interesting:

→ Each container gets its own lightweight VM — isolation at the VM boundary, not a shared kernel

→ Built on Apple's open-source Containerization package, using the Virtualization and vmnet frameworks

→ OCI-compatible: pull from any standard registry (Docker Hub, GHCR), push back the same way, no conversion

→ New "container machines": persistent Linux environments with your home directory mounted and the login user matching your Mac account

→ 1.0 also moved settings to a TOML config and added structured JSON/YAML/TOML output for list and inspect

→ Apache 2.0, Apple silicon only, best on macOS 26 — and past 30,000 GitHub stars within days of release

Full analysis: https://www.marktechpost.com/2026/06/26/meet-container-apples-open-source-swift-tool-for-running-linux-containers-as-lightweight-vms-on-apple-silicon/

GitHub: https://github.com/apple/container

u/ai-lover — 11 days ago

▲ 25 r/OpenSourceeAI+1 crossposts

Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

Most end-to-end OCR models slow down the longer they read. Every token they generate adds to the KV cache — so memory climbs and parsing dozens of pages becomes impractical. Baidu's Unlimited OCR attacks that at the attention layer, not with engineering workarounds.

They open-sourced Unlimited OCR — a 3B MoE model with 500M active parameters, built on DeepSeek OCR, that replaces every decoder attention layer with Reference Sliding Window Attention (R-SWA). Each token attends to all reference tokens (visual tokens + prompt) plus only the last 128 generated tokens. Everything older is evicted, so the KV cache stays constant instead of growing with output length. MIT-licensed, weights public.

Here's what's actually interesting:

→ The full decode runs on a constant KV cache (L_m + n) — memory and per-step latency stay flat the whole way

→ DeepEncoder compresses a 1024×1024 page to 256 visual tokens (16×), so the prefill stays small

→ Continue-trained from the DeepSeek OCR checkpoint for just 4,000 steps with the encoder frozen — the gains come from R-SWA, not scale

→ OmniDocBench v1.5: 93.23 vs. 87.01 for the DeepSeek OCR baseline (+6.22)

→ 40+ pages parsed in one forward pass, edit distance still under 0.11; 35% throughput lead at 6,000 output tokens

Full analysis: https://www.marktechpost.com/2026/06/24/baidu-releases-unlimited-ocr-a-3b-model-that-keeps-the-kv-cache-flat-for-long-document-parsing/

Paper: https://arxiv.org/pdf/2606.23050

Model weights on HF: https://huggingface.co/baidu/Unlimited-OCR

Repo: https://github.com/baidu/Unlimited-OCR

https://preview.redd.it/l99dpg19ad9h1.png?width=1814&format=png&auto=webp&s=e585f2a073d1b599eb13d957668e5b1880ddd062

reddit.com

u/ai-lover — 12 days ago

▲ 12 r/OpenSourceeAI+1 crossposts

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Most speculative decoding still drafts tokens one at a time. That's not parallel generation — it just hides the serial loop behind a smaller model.

UC San Diego's z-lab just drew a clear line between the two. They released DFlash — a lightweight block diffusion model that drafts a whole block of tokens in a single forward pass, then lets the target model verify the block in parallel. Up to 15× higher throughput for gpt-oss-120b on NVIDIA Blackwell. No token-by-token drafting anywhere in the speculative path.

Here's what's actually interesting:

→ The drafter is conditioned on the target model's own hidden features, injected into the Key/Value cache of every draft layer — so acceptance length scales with draft depth instead of diluting away

→ A 5-layer drafter replaces the 7B diffusion drafters that capped earlier methods near 3–4×

→ MATH-500 speedup: 6.08× vs. 1.81× for EAGLE-3 (4.86× average vs. 1.76×, Qwen3-8B, greedy)

→ Up to 15× higher throughput for gpt-oss-120b on NVIDIA Blackwell — at the same interactivity target

→ Lossless: the target still verifies every token, so output quality is preserved

Full analysis: https://www.marktechpost.com/2026/06/24/dflash-speculative-decoding-drafts-whole-token-blocks-in-parallel-for-up-to-15x-higher-throughput-on-nvidia-blackwell/

Paper: https://arxiv.org/pdf/2602.06036

NVIDIA's metrics: https://developer.nvidia.com/blog/boost-inference-performance-up-to-15x-on-nvidia-blackwell-using-dflash-speculative-decoding/

Project: https://z-lab.ai/projects/dflash/

Model weights: https://huggingface.co/collections/z-lab/dflash

Repo: https://github.com/z-lab/dflash

https://reddit.com/link/1ue6r7w/video/cfkba395o69h1/player

reddit.com

u/ai-lover — 13 days ago

▲ 3 r/OpenSourceeAI

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

Most "structured extraction" is a general LLM asked nicely to return JSON, with a retry loop bolted on. That's not a guarantee — and Datalab just drew a very clear line between the two.

They just released lift as open weights — a 9B vision model that decodes directly against your JSON schema, so the output is valid by construction. It reads whole multi-page documents in a single pass, including values that span pages. The structural guarantee lives in the decoder, so you don't need a parse-validate-retry loop to get well-formed JSON.

Here's what's actually interesting:

→ Schema-constrained decoding: your schema is compiled to a grammar, and tokens that would break it are masked at every step. Structure is enforced as it generates, not validated after the fact.

→ It guarantees shape, not meaning — a field typed "number" holds a number, just not necessarily the right one. Validity ≠ correctness.

→ Trained abstention: every field is made nullable, so it returns null instead of hallucinating a tax ID that isn't on the page.

→ The trap: hand it enum / ref / anyOf and the schema won't compile — lift silently drops the guarantee and free-generates. No hard error. Validate downstream.

→ 90.2% field accuracy on a 225-doc, ~11,000-field adversarial benchmark — the highest of any self-hostable model they tested.

→ 9.5s median/doc: ~3x faster than Gemini Flash 3.5, and within a point of it on field accuracy.

→ Built on Qwen 3.5 — the base scores 76.3%, lift hits 90.2%. Same size, so the gain is the training, not the parameters.

→ The honest catch: full-document accuracy is 20.9% — near the bottom of the table. Getting every field right across a 64-page doc is brutal; even the hosted leaders top out at 44.4% / 40.0%.

Full analysis: https://www.marktechpost.com/2026/06/23/datalab-releases-lift-a-9b-open-weights-vision-model-that-extracts-structured-json-from-pdfs-using-schemas/

Repo: https://pxllnk.co/nmpjxqn

Model weights on HF: https://pxllnk.co/t0x8a0r

Playground: https://pxllnk.co/mf4o7kl

https://preview.redd.it/gsr0tv46a39h1.png?width=1438&format=png&auto=webp&s=66ba3395cf90415900f7509c08c5620d81d1a57c

reddit.com

u/ai-lover — 14 days ago

▲ 10 r/pdf+1 crossposts