▲ 6 r/LLMeng+2 crossposts

Software Architecture In the Age of AI: Sessions, Workshops, and Roundtables from Industry Leaders

Hi Everyone,

This post is just to raise awareness about our upcoming flagship conference, which is on Software Architecture.

We have industry leaders from top organizations like AWS, Netflix, Google, DeepMind, and Salesforce, and bestselling authors as speakers who will be talking about their architectural approach in the Age of AI.

We have a special discount (discussed with Mod) for the community using the code: ARCH50

u/Opposite_Toe_3443 — 4 days ago

▲ 9 r/LLMeng+8 crossposts

[ Removed by moderator ]

[supprimé]

reddit.com

u/Traditional_Honey858 — 6 days ago

▲ 57 r/LLMeng+8 crossposts

Built a 135M looped transformer with custom Muon+AdamW optimizer routing, per-sequence Poisson depth sampling, and truncated BPTT. Here's what the training code looks like.

Built a 135M dense looped LLM from scratch. Spent 2 weeks debugging Parcae's LTI stability mechanisms across 5 ablations. None of them beat the naive baseline at this scale. Trained for real anyway. SFT'd it. Shipped it. Here's the full honest story.

What I built

A 135M parameter looped transformer trained from scratch on FineWeb (4.6B tokens), inspired by the Parcae paper (arXiv:2604.12946 — "Scaling Laws For Stable Looped Language Models").

🤗 Base model: huggingface.co/harims95/LoopLM-135M-naive
🤗 SFT model: huggingface.co/harims95/LoopLM-135M-naive-sft
📂 Code: github.com/harims95/LoopLM
💰 Total cost: ~$51 (Modal H100s + free Lightning H200)

Architecture

Input → [Embedding] → [Prelude: 4 blocks] → e (injection)
     → [Loop block × T loops, T~Poisson(μ=6)] → [Coda: 2 blocks] → logits

d_model 1024, GQA 16/8 heads, RoPE, QK-norm, SwiGLU FFN 2816
Update rule: h_{t+1} = block(h + e) (naive) or with LTI stability (Parcae)
Muon + AdamW optimizers, truncated BPTT (μ_bwd=3), bf16
Trained on 2× H100 on Modal, ~3 hours wall clock

The Parcae investigation (the interesting part)

The paper claims LTI stability constraints on the recurrent state dramatically improve looped LM training. I tried to reproduce it. Here's what actually happened:

Ablation	Description	Val loss
1. Naive looped	`h = block(h + e)`	3.84
2. + A matrix	LTI decay constraint	3.84 (tied)
3. + Input norm v1	Wrong arch flow	Diverged
4. + LTI before block	Fixed arch, B=identity	Worse
5. + B→AdamW, init=0.447	Matched official repo	Dramatically worse

Every single "fix" — bringing my implementation closer to the official Parcae code — made things worse. After consulting:

The paper's Appendix Q (optimizer routing)
Official sandyresearch/parcae repo (injection.py)
Two rounds of ChatGPT + Gemini debugging sessions

My conclusion: Parcae's stability improvements are a large-scale phenomenon. The paper's 1.3B model trains for 170k+ steps before stability mechanisms kick in. At 135M / 17.5k steps, naive looped is competitive enough that the extra complexity hurts more than it helps.

Comparison with sibling MoE

My brother built HobbyLM — a 500M MoE on the same infrastructure. For apples-to-apples comparison, I ran naive looped 135M on the same FineWeb data:

Model	Architecture	Tokens	Val loss
LoopLM-135M (mine)	Dense looped	4.6B	3.95
HobbyLM-130M MoE (bro)	Sparse MoE	10B	3.30

Dense looped loses to MoE at this scale/budget. Sparse MoE is more sample-efficient. Not surprising but now I have the data to confirm it.

SFT results (bonus)

Fine-tuned on Alpaca 52k using Lightning AI's free H200. Took 6 minutes (bf16 on H200 is insane).

Before SFT:

After SFT:

Improvement in format, not in facts. At 135M / 4.6B tokens, SFT teaches format, not knowledge. The model still hallucinates — that's a base model capacity problem, not a fine-tuning problem.

What I learned

On Parcae: Small-scale reproductions of large-scale papers are dangerous. The paper's key contribution (stability at 170k+ steps) is invisible at hobby budgets. Naive looped is a legitimate architecture for anyone training sub-1B models.

On MoE vs looped: At matched parameter count and token budget, MoE wins on sample efficiency. Looped models need more tokens to show their advantage, or need to be much bigger to amortize the loop cost.

On debugging: When 3 independent LLMs (me, ChatGPT 5.5, Gemini) all agree on a fix and it makes things worse — the paper's regime assumption is probably wrong, not your code.

On SFT: H200 on Lightning AI is free (2 hours/month) and runs 6 minutes of SFT for free. Use it. Colab Free disconnects at 3 hours. Don't use it for long jobs.

On honest publishing: val 3.95 is not impressive. The architecture exploration is. Shipping anyway with full documentation of what failed is more valuable than hiding failures.

Stack

Training: Modal (H100s), Lightning AI (H200 for SFT)
Framework: PyTorch, HuggingFace Transformers
Optimizer: Muon (matrices) + AdamW (rest)
Data: FineWeb via kjj0/fineweb10B-gpt2 shards
Infra forked from: github.com/harishsg993010/HobbyLM (my brother's 500M MoE project)

Happy to answer questions about any part of this. The code is fully open, reproducible, and documented.

u/Hariharanms — 6 days ago

▲ 12 r/LLMeng+9 crossposts

Open handoff: Thought Tree, a markup/spec idea for modular LLM workflows

I’m releasing an open handoff draft of a framework I’ve been developing called the Thought Tree AI Framework.

At its core, the framework uses a simple pattern:

Data Units → Operations → Data Units

A Thought Tree program applies this recursively. Complex cognitive work is decomposed into named artefacts, transformations, contracts, modules and traces.

It came out of experiments with Auto-GPT-style agents, creative production pipelines and the need to separate what LLMs are good at from what deterministic code should handle.

I don’t currently have time to continue developing it properly, so I’m releasing it as an open handoff for anyone who wants to critique, fork, implement or reinterpret it.

The repo includes:

- a concise README;

- one-page summary;

- draft TTML schema;

- minimal example workflow;

- roadmap;

- original long-form explainer.

I’m especially interested in whether people see value in Thought Tree as:

- an intermediate representation for LLM workflows;

- a design vocabulary for structured AI production;

- a small open-source executor;

- or something that could map onto LangGraph / LlamaIndex / other orchestration tools.

Repo: https://github.com/RobertBateman/thoughttree-framework

Feedback, criticism, forks and maintainers welcome.

u/xavier1764 — 7 days ago

▲ 1 r/LLMeng

At what point do you think Anthropic and OpenAI will open-source their models?

wsj.com

u/Danny_simmon — 7 days ago

▲ 27 r/LLMeng+2 crossposts

What Local LLM are you using for simple tasks?

I've been using GPT-OSS-120B via Groq in a Chrome extension, and it's been working well so far.

I'm curious what local LLMs people are actually using day-to-day in local. If you had to pick a model for productivity tasks rather than coding or benchmarks, what would you choose?

My most common use cases are:

Fixing grammar and improving writing
Reading a job description and generating a tailored cover letter from my CV
Extracting action items from emails
Summarising documents and web pages
Rewriting text in different tones

For people running local models (Ollama, LM Studio, Open WebUI, etc.), what's your current go-to model and why?

Are there any models that noticeably similar to GPT-OSS-120B for these kinds of tasks but run locally (apple M4)?

reddit.com

u/AlbertoCubeddu — 12 days ago

▲ 0 r/LLMeng

Build 10 AI Agents in Just 5 Hours

My team at Packt is hosting a 5 Hour Workshop with a bestselling Author/Instructor for building 10 Essential AI Agents across different domains like Healthcare, Finance, Education, and Beyond.

Building domain-specific Agents is the future as Agents become popular in different industries. From the bestselling Author of "30 AI Agents Every Engineer Must Build," where he teaches you to build Agents based on different patterns, this workshop helps you.

There is an early bird discount running on it, so it's worth exploring!

u/Opposite_Toe_3443 — 10 days ago

▲ 2 r/LLMeng+2 crossposts

teams think they are evaluating an agent when they are only evaluating the final answer

Many teams think they’re evaluating their AI agents when they’re really only evaluating the final answer. That works for chatbots. But agents are clearly different.

An agent plans, chooses tools, passes arguments, reads tool outputs, retries, and sometimes takes actions. A lot happens between the prompt and the answer.

The problem is that an agent can return a correct answer after calling the wrong tool, taking unnecessary steps, misreading a result, or recovering from an earlier failure.

If you’re only looking at the final output, you won’t see most of that.

Your assumption becomes: “The answer was correct, so the agent worked.”

But if an agent is going to run real workflows, the answer isn’t the only thing that matters. You also need to know whether the path it took was valid, efficient, grounded, and safe.

How are people here evaluating agents today? Are you looking at execution traces, or mostly the final output?

reddit.com

u/ExplorerRin — 12 days ago