r/pytorch

▲ 88 r/pytorch+3 crossposts

[VisualTorch] How to generate architecture diagrams from PyTorch models

I built a small tool to auto-generate architecture diagrams directly from PyTorch models, which I originally built for my own research paper.

26k+ PyPI downloads, already used in publications (Nature, IEEE, MDPI), check out some use cases here: https://visualtorch.readthedocs.io/en/latest/markdown/showcase/index.html

It traces an actual forward pass, so it correctly captures branching, skip connections, and multi-input models, not just flat sequential stacks.

import visualtorch
import torchvision.models as models

model = models.resnet18()
img = visualtorch.render(model, input_shape=(1, 3, 224, 224), style="graph", show_neurons=False, layer_spacing=60)
img.save("resnet18.png")

Three rendering styles depending on what you want to show:

graph: node/edge diagram, good for showing branching/skip connections clearly
flow: stacked volumetric boxes, closer to the classic CNN-paper look
lenet: the classic LeNet stacked-plane style

GitHub: https://github.com/willyfh/visualtorch | Docs: https://visualtorch.readthedocs.io/en/latest/

Open to feedback, especially if you hit a model it renders weirdly :)

u/LostDistance9365 — 6 hours ago

▲ 18 r/pytorch+4 crossposts

H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch

Hi everyone,

I built H64LM, a research project to better understand modern LLMs by implementing one from scratch in PyTorch.

Instead of relying on high-level training frameworks, I implemented the core components myself attention, MoE routing, normalization, and the training loop.

Features

249M-parameter Transformer
Grouped Query Attention (GQA)
Sparse Mixture-of-Experts (8 experts, Top-2 routing) with 3 auxiliary routing losses
SwiGLU, RoPE, RMSNorm
Sliding-window attention
Mixed-precision training, gradient accumulation
Custom training loop (no Trainer abstractions)
Checkpointing and resume support

The included checkpoint was trained on a subset of WikiText-103 to validate the pipeline end-to-end, not to be a strong model it's visibly overfit past epoch 10 (best val PPL ~40.5).

Known limitations are documented in the README, including batch-size-1-only generation and no true DDP (falls back to DataParallel).

GitHub: https://github.com/Haiderkhan64/H64LM

Feedback on the implementation or architecture is very welcome.

u/Loose_Literature6090 — 6 hours ago

▲ 174 r/pytorch+20 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 1 day ago

▲ 0 r/pytorch+1 crossposts

I got tired of manually benchmarking ONNX vs CoreML vs PyTorch every project, so I built a CLI for it

Every time I ship a YOLO model I end up asking the same question should this be ONNX, CoreML, or just PyTorch? Does FP16 actually help here or is it just marginal?

I've answered this by hand, badly, on four different projects this year, and thrown the results away every time. First i have to optimize a model for my liking and then figure a way to reduce its size.

So I'm building exportrace - you run one command, it benchmarks your model across every export backend available on your actual machine (PyTorch, ONNX, CoreML, CUDA, TensorRT depending on your setup), and gives you FPS, latency, and accuracy delta vs FP32, plus a ranked recommendation.

Consumer hardware only - your laptop or dev box, not Jetson/Pi. It's open source (MIT), runs fully offline, no accounts. Still pre-launch, landing page + waitlist here if you want to see the concept and maybe kill the boredom of doing this by hand too: https://exportrace.vercel.app/

Curious if others hit this same wall, and what backends/hardware you'd actually want covered first.

reddit.com

u/Particular-Abies-123 — 1 day ago

▲ 61 r/pytorch+40 crossposts

Ask questions across your Markdown notes using a fully local Graph RAG engine. Built for Obsidian vaults, works with any folder of Markdown files. Extracts entity-relation triples from wikilinks & YAML frontmatter, retrieves answers via hybrid search (vector + BM25 + temporal). Multilingual. No cloud. Runs on Ollama.

https://github.com/benmaster82/Kwipu

u/WritHerAI — 2 days ago

▲ 16 r/pytorch+1 crossposts

We do everything in the terminal now — so why not look at TensorBoard there too?

https://preview.redd.it/w97sw1d9y0bh1.png?width=1690&format=png&auto=webp&s=e62fcad6f8a190879ab4c2f8777c0b93d74fd594

Open source (MIT), a solo side project: https://github.com/dongfangyixi/terminalboard
PyPI: terminalboard

These days I run basically my whole workflow in the terminal — vim/nvim, tmux, lazygit, k9s, btop, files, git, SSH into GPU boxes… everything. The one thing that kept kicking me out of it was

TensorBoard: forward a port (ssh -L 6006:localhost:6006), switch to a browser, and open that in there.

So I and (claude code of course), built terminalboard: it reads the events.out.tfevents.* files directly and draws everything in the terminal, as Unicode/braille text. No browser, no X11, no port-forwarding — a plain SSH session (or your local shell) is all you need.

Optional LLM assistant (off until you set it up): press "a" to chat with your runs — it can analyze

("which run is overfitting?") and drive the dashboard ("show val losses, smoothed").

Bring-your-own-model via LiteLLM incl. local Ollama/vLLM; the key stays on your machine and its

actions are a fixed typed whitelist (no shell).

Try it:

pip install terminalboard

terminalboard path/to/logs (where your tensorboard logs save to)

Once it open type H (shift + h) for Help document.

Hope you have fine in there. And this is a new project, so welcome to fock and pull request to it if you want some more features.

It is still early — feedback very welcome:

- Does it handle your logs (weird tags, huge runs, many experiments)?

- What's missing for your terminal workflow?

- Is the AI part useful, or noise you'd turn off?

reddit.com

u/GrExplanation — 3 days ago

▲ 56 r/pytorch+8 crossposts

Built a 135M looped transformer with custom Muon+AdamW optimizer routing, per-sequence Poisson depth sampling, and truncated BPTT. Here's what the training code looks like.

Built a 135M dense looped LLM from scratch. Spent 2 weeks debugging Parcae's LTI stability mechanisms across 5 ablations. None of them beat the naive baseline at this scale. Trained for real anyway. SFT'd it. Shipped it. Here's the full honest story.

What I built

A 135M parameter looped transformer trained from scratch on FineWeb (4.6B tokens), inspired by the Parcae paper (arXiv:2604.12946 — "Scaling Laws For Stable Looped Language Models").

🤗 Base model: huggingface.co/harims95/LoopLM-135M-naive
🤗 SFT model: huggingface.co/harims95/LoopLM-135M-naive-sft
📂 Code: github.com/harims95/LoopLM
💰 Total cost: ~$51 (Modal H100s + free Lightning H200)

Architecture

Input → [Embedding] → [Prelude: 4 blocks] → e (injection)
     → [Loop block × T loops, T~Poisson(μ=6)] → [Coda: 2 blocks] → logits

d_model 1024, GQA 16/8 heads, RoPE, QK-norm, SwiGLU FFN 2816
Update rule: h_{t+1} = block(h + e) (naive) or with LTI stability (Parcae)
Muon + AdamW optimizers, truncated BPTT (μ_bwd=3), bf16
Trained on 2× H100 on Modal, ~3 hours wall clock

The Parcae investigation (the interesting part)

The paper claims LTI stability constraints on the recurrent state dramatically improve looped LM training. I tried to reproduce it. Here's what actually happened:

Ablation	Description	Val loss
1. Naive looped	`h = block(h + e)`	3.84
2. + A matrix	LTI decay constraint	3.84 (tied)
3. + Input norm v1	Wrong arch flow	Diverged
4. + LTI before block	Fixed arch, B=identity	Worse
5. + B→AdamW, init=0.447	Matched official repo	Dramatically worse

Every single "fix" — bringing my implementation closer to the official Parcae code — made things worse. After consulting:

The paper's Appendix Q (optimizer routing)
Official sandyresearch/parcae repo (injection.py)
Two rounds of ChatGPT + Gemini debugging sessions

My conclusion: Parcae's stability improvements are a large-scale phenomenon. The paper's 1.3B model trains for 170k+ steps before stability mechanisms kick in. At 135M / 17.5k steps, naive looped is competitive enough that the extra complexity hurts more than it helps.

Comparison with sibling MoE

My brother built HobbyLM — a 500M MoE on the same infrastructure. For apples-to-apples comparison, I ran naive looped 135M on the same FineWeb data:

Model	Architecture	Tokens	Val loss
LoopLM-135M (mine)	Dense looped	4.6B	3.95
HobbyLM-130M MoE (bro)	Sparse MoE	10B	3.30

Dense looped loses to MoE at this scale/budget. Sparse MoE is more sample-efficient. Not surprising but now I have the data to confirm it.

SFT results (bonus)

Fine-tuned on Alpaca 52k using Lightning AI's free H200. Took 6 minutes (bf16 on H200 is insane).

Before SFT:

After SFT:

Improvement in format, not in facts. At 135M / 4.6B tokens, SFT teaches format, not knowledge. The model still hallucinates — that's a base model capacity problem, not a fine-tuning problem.

What I learned

On Parcae: Small-scale reproductions of large-scale papers are dangerous. The paper's key contribution (stability at 170k+ steps) is invisible at hobby budgets. Naive looped is a legitimate architecture for anyone training sub-1B models.

On MoE vs looped: At matched parameter count and token budget, MoE wins on sample efficiency. Looped models need more tokens to show their advantage, or need to be much bigger to amortize the loop cost.

On debugging: When 3 independent LLMs (me, ChatGPT 5.5, Gemini) all agree on a fix and it makes things worse — the paper's regime assumption is probably wrong, not your code.

On SFT: H200 on Lightning AI is free (2 hours/month) and runs 6 minutes of SFT for free. Use it. Colab Free disconnects at 3 hours. Don't use it for long jobs.

On honest publishing: val 3.95 is not impressive. The architecture exploration is. Shipping anyway with full documentation of what failed is more valuable than hiding failures.

Stack

Training: Modal (H100s), Lightning AI (H200 for SFT)
Framework: PyTorch, HuggingFace Transformers
Optimizer: Muon (matrices) + AdamW (rest)
Data: FineWeb via kjj0/fineweb10B-gpt2 shards
Infra forked from: github.com/harishsg993010/HobbyLM (my brother's 500M MoE project)

Happy to answer questions about any part of this. The code is fully open, reproducible, and documented.

u/Hariharanms — 6 days ago

▲ 1 r/pytorch+1 crossposts

Autonomous Brain Tumor MRI Classification (95% Accuracy / ResNet50 Backbone) - Deployed via PyTorch by a 14-Year-Old Independent Researcher

Hello Reddit Community,

I am excited to officially share my independent research and live deployment for autonomous neuro-oncology diagnostics. For the past two months, I have been engineering deep learning computer vision pipelines using PyTorch to solve a major clinical challenge: visual boundary overlaps and structural confusion within the confusion matrix between Glioma and Meningioma tissue anomalies.

By implementing Transfer Learning via a ResNet50 Backbone, combined with a custom deep multi-layer feature classifier and cost-sensitive class-weight optimization, the architecture successfully separated identical gray-level tissue features.

📊 Evaluation Metrics (Validated rigorously on 1,600 unseen clinical images):

- Overall Test Accuracy: 95%

- Glioma Precision: 1.00 (Zero false-positive alarms)

- Meningioma Recall: 0.99 (Flawless sensitivity)

- Normal Tissue Protection: 0.99 Recall (Ensuring diagnostic patient safety against false negatives)

I have packaged the finalized weights and successfully deployed a secure, live production environment on Hugging Face for instant expert audit and real-time medical image rendering.

As a 14-year-old independent researcher, my ultimate ambition is to push the boundaries of computational biomedicine and data science. I would be deeply grateful to receive your engineering feedback, architecture critiques, or any suggestions on how to further scale this pipeline!

🌐 Live Production System: https://github.com/IsraBouchentouf-researcher

💻 Open-Source Codebase Audit: https://github.com/IsraBouchentouf-researcher

Thank you for your time and professional guidance!

u/isra-bouchentouf — 8 days ago

▲ 8 r/pytorch+1 crossposts

ScratchTorch - Pytorch but implemented from scratch using numpy

i was js trying to learn about AI and thought the best way would be to learn by actually building and implementing rather than js reading docs,

i have implemented Most of the tensor applications and can build cnn using the library alone... its not yet optimised and i was wondering if you ave any suggestions as to how i can make it better and what future things will help me learn and i can build.

here,s the link open to suggestions and criticism thanks!

https://github.com/rishit836/neural-network-from-scratch/tree/main/ScratchTorch

reddit.com

u/GamerMePro — 9 days ago

▲ 9 r/pytorch+1 crossposts

Is streaming LLM weights from SSD → RAM → GPU a practical way to train or run models larger than VRAM?

I came across a project called AethelStream that proposes virtualizing model weights by streaming them layer-by-layer from SSD to RAM to GPU instead of loading the entire model into VRAM.

The idea is to overlap I/O and computation so only the layer currently being executed lives in VRAM, while the rest stays on disk or in RAM. It also uses activation recomputation to reduce memory usage during training.

On paper, it sounds like an interesting way to make experimentation with larger models possible on consumer GPUs.

I'm curious what people here think:

- Is this technically feasible at scale?

- Would PCIe/NVMe bandwidth become the main bottleneck?

- How does this compare with approaches like DeepSpeed ZeRO, FSDP, or vLLM?

- Are there existing projects doing something similar?

I'd love to hear opinions from people who've worked on LLM infrastructure.

reddit.com

u/Adam161000 — 14 days ago

▲ 4 r/pytorch+1 crossposts

Simvascular-VMR-Numpy-Data-Processing-for-Machine-Learning

It saves the features of the models in the Simvascular VMR database and the simulation results with data mining, and adds various features with VMTK.

https://github.com/ix-46-S/Simvascular-VMR-Numpy-Data-Processing-for-Machine-Learning

u/ix46 — 11 days ago

▲ 19 r/pytorch+4 crossposts

I built using claude a 35-stage course where you reimplement PyTorch from scratch — no autograd libraries allowed

I kept noticing that I could use PyTorch fine but couldn't actually explain what .backward() does under the hood. So I built a project-based curriculum to fix that for myself, and cleaned it up enough to share.

The idea: you rebuild a deep learning framework from zero, one concept at a time. The only libraries you're allowed are NumPy (for forward array math — never to compute a gradient for you), Matplotlib, and pytest. No torch, no autograd, no micrograd. The rule is: you don't get to import a concept until you've built it by hand in an earlier stage. You are the autodiff library.

How it's structured — 35 stages, each a folder with exactly 3 files:

README.md — the intuition, the key gradient equations, a video or two to watch, and one unambiguous exercise
code.py — a skeleton: full interfaces, docstrings, and TODOs, but no working bodies
test.py — pytest tests, including numerical gradient checks (central differences) so you know your backward pass is correct, not just plausible

You fill in code.py until pytest goes green, then move to the next stage. Crucially, each stage imports and extends the code you wrote in earlier stages — so the framework genuinely grows under your hands instead of being 35 disconnected toy scripts.

The arc:
scalar backprop → a reverse-mode autodiff engine → an N-dimensional Tensor → layers, losses, optimizers (SGD/momentum/Adam) → a real training loop → BatchNorm/Dropout → CNNs (Conv2d via im2col, with the backward derived by hand) → attention → a full Transformer → a Vision Transformer → packaging it all into a small PyTorch-like framework → capstone projects.

By the end you've written every gradient and every chain-rule accumulation yourself.

It's free and open source. Feedback very welcome — especially if you work through a stage and find something unclear or a test that feels off.

👉 https://github.com/roiamiel1/Build-Deep-Learning-From-Scratch

u/VeterinarianLow6908 — 13 days ago