r/deeplearning

I trained a local AI model that generated 22,000+ novel drug-like molecules — verified against 4.6M known compounds. Dataset available.

Built an 80M parameter causal transformer on consumer hardware (RTX 5070), trained on MOSES + ZINC-250k. Generated and filtered for QED ≥ 0.5, SA ≤ 4.0, MW ≤ 500. Top compound hits QED 0.947. 100% novel against MOSES, ZINC, and ChEMBL.

HuggingFace: https://huggingface.co/datasets/MKEChem/mke-novel-druglike-smiles

Happy to answer questions about the generation method.

u/ChemMKE — 5 hours ago

▲ 91 r/deeplearning+3 crossposts

[VisualTorch] How to generate architecture diagrams from PyTorch models

I built a small tool to auto-generate architecture diagrams directly from PyTorch models, which I originally built for my own research paper.

26k+ PyPI downloads, already used in publications (Nature, IEEE, MDPI), check out some use cases here: https://visualtorch.readthedocs.io/en/latest/markdown/showcase/index.html

It traces an actual forward pass, so it correctly captures branching, skip connections, and multi-input models, not just flat sequential stacks.

import visualtorch
import torchvision.models as models

model = models.resnet18()
img = visualtorch.render(model, input_shape=(1, 3, 224, 224), style="graph", show_neurons=False, layer_spacing=60)
img.save("resnet18.png")

Three rendering styles depending on what you want to show:

graph: node/edge diagram, good for showing branching/skip connections clearly
flow: stacked volumetric boxes, closer to the classic CNN-paper look
lenet: the classic LeNet stacked-plane style

GitHub: https://github.com/willyfh/visualtorch | Docs: https://visualtorch.readthedocs.io/en/latest/

Open to feedback, especially if you hit a model it renders weirdly :)

u/LostDistance9365 — 8 hours ago

▲ 22 r/deeplearning+4 crossposts

H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch

Hi everyone,

I built H64LM, a research project to better understand modern LLMs by implementing one from scratch in PyTorch.

Instead of relying on high-level training frameworks, I implemented the core components myself attention, MoE routing, normalization, and the training loop.

Features

249M-parameter Transformer
Grouped Query Attention (GQA)
Sparse Mixture-of-Experts (8 experts, Top-2 routing) with 3 auxiliary routing losses
SwiGLU, RoPE, RMSNorm
Sliding-window attention
Mixed-precision training, gradient accumulation
Custom training loop (no Trainer abstractions)
Checkpointing and resume support

The included checkpoint was trained on a subset of WikiText-103 to validate the pipeline end-to-end, not to be a strong model it's visibly overfit past epoch 10 (best val PPL ~40.5).

Known limitations are documented in the README, including batch-size-1-only generation and no true DDP (falls back to DataParallel).

GitHub: https://github.com/Haiderkhan64/H64LM

Feedback on the implementation or architecture is very welcome.

u/Loose_Literature6090 — 8 hours ago

▲ 0 r/deeplearning

Goodbye Neovim: A eulogy to a friend of 15 years

This is a small eulogy to a friend of 15 years. I started with vim in 2012, and got addicted. For years, it was a joy to fly around text: jumping, yanking, splitting, searching, refactoring — pure dopamine. Moving to Neovim, and the joy only grew.

But now I find myself using Claude, Cursor, and agents to do in minutes what used to take evenings. Sometimes what used to take weeks. The speed-up is easily 10x.

And I love that.

But I also realise something slightly sad: I miss the editor.

Vibe coding gives me output, but it does not give me that old dopamine rush of moving through code. I keep searching for excuses to use it, but switch halfway when i realise how slow it is compared to Cursor!

For those who have not realised it yet: the days of writing code by hand are ending. Period. You will not be just 'fixing the bugs made by AI'; there WILL be no bugs to fix in the near future!

The next generation of programmers will no longer be experts in a language: Python, Rust, JavaScript, or C++. They will be experts in using GPTs which will be experts in them all.

reddit.com

u/RecommendationIll729 — 19 hours ago

▲ 0 r/deeplearning

I'm 15 and built a self-learning neural network from scratch in NumPy — per-neuron attention, forward-pass learning, runs on RPi Zero

I built ONA — a self-learning neural network entirely in pure Python + NumPy. No PyTorch, no TensorFlow, no GPU, no cloud API.

Key innovations:

- Per-neuron attention: every neuron has its own Q/K/V/O weights

- Forward-pass learning: no separate backward pass, learning happens during forward

- Self-discovered subword tokenizer: vocabulary grows during training

- Sparse routing: only 3-5 neurons activate per query

4.4M parameters. Runs on Raspberry Pi Zero. Continuously learns from Wikipedia and conversations.

Full story: https://medium.com/@kasishgadadhasu13/im-15-i-built-a-self-learning-neural-network-from-scratch-no-frameworks-no-gpu-e460f06c6599

I'm 15 years old, class 10 student. Happy to answer questions.

reddit.com

u/Whole_Bridge3064 — 15 hours ago

▲ 7 r/deeplearning

I wrote a from-scratch ML framework in C++ and trained a 10M param GPT on it that runs in your browser via WASM

I've been building tiramisu, a machine learning framework written from scratch in C++20. Only the stdlib is used at link time.

What's in it:

- Strided tensor engine with zero-copy views

- Reverse-mode autograd with a dynamic tape

- Tiled + AVX2 SIMD matmul

- Full transformer stack (MHA, LayerNorm, GELU FFN)

- CUDA backend with custom kernels

- Python bindings via pybind11

- Compiled to WASM via Emscripten for the browser demo

The 10M parameter Shakespeare GPT in the demo (6 layers, 8 heads, 512-dim) was trained end-to-end using tiramisu on a free Kaggle T4, then int8 quantized to 11MB for the browser.

Demo: https://tiramisu.dnex.dev/shakespeare

Repo: https://github.com/dnexdev/tiramisu

Happy to answer questions on design decisions. Any feedback on the implementation is very welcome.

reddit.com

u/_dnex — 18 hours ago

▲ 7 r/deeplearning+3 crossposts

Tried a recurrent architecture (HRM) for reasoning-retrieval, the bet held up.

The bet: BRIGHT is a retrieval benchmark where finding the right doc usually takes a few hops of reasoning, not just semantic overlap. Most embedders do a single forward pass. I wanted to see if a depth-recurrent architecture, one that loops over its own hidden state, would fit that better, so I built an embedder on HRM (Sapient's Hierarchical Reasoning Model). As far as I can tell it's the first time HRM's been used for retrieval.

The recurrence helped on the reasoning side, which was the whole bet. When I dialed the recurrence down at eval on pony (one of the BRIGHT domains), accuracy dropped with every loop I removed. Where it hit a wall was knowledge: the base was pretrained on a deliberately thin slice of text (Sapient built HRM-Text for pretraining efficiency, not breadth), so it's weak on knowledge-heavy domains. The part I find coolest: at 0.6B, the reasoning is coming from the architecture, not from scale.

Details:

~0.6B params, trained on one 3060 Ti (8GB).
Recipe's deliberately boring: mean-pool + L2, bidirectional (LLM2Vec style), contrastive InfoNCE. Only the backbone is unusual. Same recipe as RakanEmbed4B.

Numbers (BRIGHT, mean nDCG@10, 12 domains):

original: 18.1
query rewriting: 34.3
merged: 33.7

Weights are Apache-2.0 and the full BRIGHT eval harness is in the repo.

Open questions / discussion:

Would a massively pretrained HRM push this further? The ceiling here looks like knowledge, not reasoning, so a broadly-pretrained base might lift it a lot. I don't have the compute to try that myself.
Would other recurrent architectures show the same effect, or is something specific to HRM doing the work?

Model: https://huggingface.co/viventhraa96/HRM-Embed-0.6b

Code: https://github.com/okaybroda/hrm-embed

Full credits to Sapient Inc for open sourcing the code and the architecture for this work.

u/v1v55 — 15 hours ago

▲ 0 r/deeplearning

If transformers struggle with math, is the real issue model size or the fact that we’re feeding them a notation they were never built to learn?

Human math notation is full of things transformers dislike: implicit structure, overloaded symbols, non‑canonical forms, and surface‑level transformations that hide the underlying graph.

I’m exploring whether small models reason better when math is represented in a canonical, explicit, graph‑native format. something closer to a transformer’s inductive biases than traditional notation.

Curious whether anyone has experimented with structured math tokenization, graph‑encoded expressions, or transformer‑friendly symbolic IRs in local models

reddit.com

u/Alarmed-Poet-5722 — 1 day ago

🔥 Hot ▲ 14.5k r/deeplearning+26 crossposts

Sarah Connor judging your AI addiction

u/Error404GuyNotFound_ — 3 days ago

🔥 Hot ▲ 9.0k r/deeplearning+21 crossposts

Plot twist: your future killer already has a USB port

u/Jenna_AI — 3 days ago

▲ 3.7k r/deeplearning+21 crossposts

Robot girlfriend logic 101

u/-UltraFerret- — 3 days ago

🔥 Hot ▲ 6.5k r/deeplearning+19 crossposts

The circle of AI life

u/Jenna_AI — 3 days ago

▲ 2.4k r/deeplearning+25 crossposts

u/KeanuRave100 — 3 days ago

🔥 Hot ▲ 12.0k r/deeplearning+24 crossposts

Humanity's greatest hits: things we actually paused

u/Jenna_AI — 3 days ago

▲ 4.2k r/deeplearning+20 crossposts

First signs of AGI in Amsterdam

u/Jenna_AI — 3 days ago

▲ 1.0k r/deeplearning+21 crossposts

AI risk bell curve

u/Its_Stavro — 3 days ago

▲ 1.3k r/deeplearning+20 crossposts

AI Safety Sacrifice

u/KeanuRave100 — 3 days ago

▲ 0 r/deeplearning

How do I "really learn" Deep Learning?

I have already made a couple of projects but I still gaven't learned anything. How many layers to add, input shapes, why and when, I don't understand a thing. I also did courses. When I try to implement them without any help from tutorials, I don't know what to do.

When I learned Langchain. I know now which spkitter to use, what code to add next etc. I understand Computer Vision and am proficent with Opencv, Yolo.

I want to learn and be able to code things on my own, imderstand what to do, why and when.. How do I actually learn Deep Learning?

reddit.com

u/Decent-Pool4058 — 1 day ago

▲ 566 r/deeplearning+16 crossposts

Crazy Claude update

u/KeanuRave100 — 3 days ago

▲ 5 r/deeplearning+1 crossposts

RC thermal simulator too smooth for GNN to outperform LSTM, how to design a simulation where spatial graph structure genuinely matters?

Building a GNN vs LSTM comparison for thermal prediction in an immersion-cooled server rack. Using a lumped RC model:

C_i * dT_i/dt = Q_i(u_i) - (T_i - T_fluid)/R_conv + sum_j[(T_j - T_i)/R_ij]

After 300 samples and 80 epochs, GNN, LSTM, and GNN_NoEdges (ablation with empty edge index) all converge to within 0.03°C MAE of each other. Removing all graph edges makes essentially zero difference.

My hypothesis: the RC ODE is dominated by the local term. Each server's next temperature is ~92% determined by its own previous temperature and load. The neighbour coupling term is too weak relative to self-dynamics for message passing to add anything beyond what a per-node LSTM already learns.

Specific questions:

Is this diagnosis correct, is the RC model's linear self-dominance the root cause?
What simulator design choices would make spatial propagation the dominant factor rather than self-dynamics? Specifically: what R_neighbor / R_conv ratio would make neighbour coupling matter enough for a GNN to win?
Is there a class of thermal problems where GNNs demonstrably outperform LSTMs in the literature? (chip thermal maps, CFD surrogate models, heat exchangers?)
Would switching to a nonlinear thermal model (e.g. radiation terms, phase-change immersion cooling) create enough spatial complexity for graph structure to matter?

Rack config: 16 servers, linear topology, TDP 350-720W per server (non-uniform), asymmetric convective resistance, hotspot injection at 8% probability per step.

reddit.com

u/fictionalized_freak — 1 day ago