r/pytorch

▲ 4 r/pytorch+1 crossposts

Has anyone quantified the actual compute waste from training divergence at scale? Trying to understand how common rollback and restart really is in practice.

reddit.com
u/Radianis — 3 days ago
▲ 21 r/pytorch+17 crossposts

New Academic Research: “Zombies in Alternate Realities: The Afterlife of Domain Names in DNS Integrations”

Interesting paper on a fairly under-discussed issue in DNS: what happens to expired or repurposed domain names that remain embedded in DNS dependencies across systems. The core finding is that these “orphaned” or changed domains can persist in resolution paths and integrations long after their original context is gone, creating real security and reliability implications.

My take: this becomes even more relevant in modern AI systems, where agents, tools, plugins, and third-party APIs are rapidly stitched together. In that environment, domain names and DNS-level dependencies can quietly extend the AI supply chain attack surface in ways that are easy to overlook.

Paper: https://arxiv.org/abs/2605.06880

reddit.com
u/VincentADAngelo — 4 days ago
▲ 22 r/pytorch+1 crossposts

Built an open source GPU bottleneck analyzer for PyTorch/CUDA. Looking for honest feedback

I've been building an open source tool called Fournex that turns Nsight Compute output into specific, evidence-backed optimization suggestions for CUDA kernels.

What it does

You give it an NCU profile (or a PTX file), and it:

  • classifies bottlenecks from hardware-counter evidence
  • ranks issues by severity
  • generates concrete optimization recommendations tied directly to the metrics that triggered them

What it currently detects

  • Uncoalesced global memory access (sectors/request ratio)
  • L1/L2 cache thrashing
  • Memory bandwidth saturation
  • Tensor core underutilization
  • Warp stall patterns:
    • barrier stalls
    • memory throttle
    • scoreboard stalls
  • Low issue-slot utilization
  • Register pressure / spills (via PTX static analysis)

Concrete example

I tested it on a deliberately broken GEMM kernel with four planted flaws:

  1. stride-K uncoalesced access
  2. no shared memory tiling
  3. FP32 only execution (tensor cores idle)
  4. unnecessary __syncthreads() calls inside the reduction loop

It correctly identified all four and recommended:

  • improving memory coalescing
  • adding shared memory tiling
  • enabling AMP / tensor core usage
  • removing unnecessary barriers

Each recommendation includes:

  • the exact metric that triggered it
  • why the metric matters
  • numbered remediation steps

Workflow

# Analyze existing Nsight Compute CSV output
frx profile --ncu profile.csv

# Or let frx run NCU for you
# (Linux only, may require sudo for hardware counters)
frx profile -- ./my_binary

# Static PTX analysis
frx profile --ptx kernel.ptx

On Windows, you can export the CSV from Nsight Compute and pass it to:

frx profile --ncu profile.csv

No GPU is required at analysis time.

One thing I'm intentionally trying not to do

I don't want this to become an LLM wrapper that generates plausible sounding optimization advice.

Every recommendation is triggered by explicit thresholds on measured hardware counters. If the metric evidence isn't present, the recommendation doesn't fire.

Repo

https://github.com/jorgevee/fournex

Would appreciate feedback from people who profile CUDA workloads seriously or hobby:

  • What bottlenecks are hardest to diagnose today?
  • What’s missing from existing tooling?
  • Would you trust automated optimization suggestions? Under what conditions?
  • What would make something like this useful in your workflow?

And if the direction seems interesting, don't be shy to star the repo

u/jvbiz — 4 days ago
▲ 3 r/pytorch+1 crossposts

Personal continual learning for LLMs without GPU — position paper [OC]

I proposed two architectures for enabling LLMs to learn daily from personal interactions:

Internal KV-Sphere Architecture (IKSA)

Background Micro Fine-Tuning (BMFT) Both work with zero GPU and zero catastrophic forgetting.

Full paper:

huggingface.co/spaces/Persak/continual_learning_position_paper

https://github.com/paras2l/Continual-Learning-in-Large-Language-Models-.git

https://zenodo.org/records/20234100?token=eyJhbGciOiJIUzUxMiIsImlhdCI6MTc3ODkzODg2NiwiZXhwIjoyNTM1NzUzNTk5fQ.eyJpZCI6IjY4OTMxZTBmLWM0YTQtNDg2ZC05OGJhLTk0ZDQ2ZTVjNDJkOSIsImRhdGEiOnt9LCJyYW5kb20iOiJkYmQwM2ExZjk4ZmZiNWM1NTFlNDZlN2QzNTY5ZTA0YiJ9.n5VgFWg5SsC5L6KvZGZhsSK\_lll4syeSnvghb6uyAKBAZiOyd15Ov\_Ps6awungKdfVsdEE0GuvOWggspQuQDfw

Twitter thread: [ https://x.com/ParasLashkarin/status/2055644988592247081?s=20 ]

Looking for researchers to validate or disprove these ideas! — Paras Lashkari

reddit.com
u/Early-Importance8582 — 5 days ago
▲ 3 r/pytorch+1 crossposts

Struggling with Overfitting on Medical Imaging Task

Hi everyone,

I’m working on a 2-class classification problem (LCA vs. RCA coronary arteries) using 2D X-ray angiograms. I’m currently stuck in a cycle of extreme overfitting and could use some advice on my training strategy.

The Setup:

  • Dataset: Small (~900 training frames from ~300 unique DICOMs).
  • Architecture: InceptionV3 (PyTorch).
  • Input: Grayscale .npy arrays converted to 3-channel, resized to 299x299.
  • Current Strategy: Transfer learning from ImageNet. I’ve tried full unfreezing and partial unfreezing (last blocks).

The Problem: My training accuracy hits ~95-99% within a few epochs, but validation accuracy peaks early (around 74-79%) and then collapses toward 30-40% as the model starts memorizing the specific textures of the training patients.

What I’ve Tried So Far:

  1. Normalization: Standard ImageNet mean/std (applied at load time).
  2. Class Weights: Handled 2:1 imbalance (LCA:RCA).
  3. Regularization: Added Dropout (tried 0.3 to 0.6) and Weight Decay (1e-4).
  4. Augmentation: Flips, 25deg rotations, and translation.
  5. Schedulers: ReduceLROnPlateau (factor 0.5, patience 8).

Would love any insights or papers you'd recommend for small-sample medical classification. Thanks!

reddit.com
u/Future-Structure-296 — 6 days ago

AMD VS Nvidia for ML training

Hello everyone,

I need opinions. In my country, RTX5060(new) 8gb costs almost $350 and RX9060XT(new) 16gb costs almost $440. RTX5060ti(new) 16gb cost almost $585. Now, I was planning to buy a GPU for ML training and inference. I am a little bit confused here. I know that CUDA is much more mature than ROCM. I don't have the budget to buy RTX5060ti 16gb. I am confused between 5060 and 9060xt. 9060xt have more vram than 5060. But 5060 has better support for ML. What should I do here ? I will train CNN and LLM(small ones) models with a good amount of data which one should I choose here ? Is there any possibility of ROCM to be more optimized for ML in future ?

reddit.com
u/Specialist-Zone-8296 — 7 days ago
▲ 1 r/pytorch+1 crossposts

There's a specific kind of frustration that every developer knows. You're in the middle of something, you hit a wall, and you open the PyTorch docs. Twenty minutes later, you've read three pages, followed two rabbit holes, and you still haven't found the one line you needed.

I got tired of that. So I built something about it.

In four hours on a single GPU instance, I put together a system that lets you ask plain English questions and get answers pulled directly from real documentation — cited, grounded, no hallucination. Ask it "how do I move a model to GPU?" and it tells you .to(device), points you to exactly which file that came from, and moves on.

Here's how it went.

Result First

Before anything else, this is what it actually looks like in practice.

Q: How do I move a PyTorch model to a GPU?

https://preview.redd.it/vyl685rl8wzg1.png?width=1280&format=png&auto=webp&s=4e907f63366d625f716b68575fd5a42e269e04cb

Q: How do I use a tokenizer with Hugging Face Transformers?

https://preview.redd.it/qmf7j0sq9wzg1.png?width=1280&format=png&auto=webp&s=b733b846938047989e0f254b9fb1e411330c0cfa

Q: How do I use Dataloader in PyTorch?

https://preview.redd.it/cpny6k8r9wzg1.png?width=1280&format=png&auto=webp&s=da5f2bb7c9ed9e53d3e1ff4088f5c25156e6d8be

I also built a second version of this — the same architecture, but pointed at internal office documents instead of PyTorch. HR policies, IT procedures, and finance reimbursement guides.

An employee asks, "How do I request annual leave?" and gets a cited answer in under 2 seconds. Same idea, completely different world.

Q: How do I request annual leave?

https://preview.redd.it/5m6dedju9wzg1.png?width=1280&format=png&auto=webp&s=96b22caf0fb282af286421e08d594a3e7c7c9038

Q: How do I submit a travel reimbursement?

https://preview.redd.it/tivazy7v9wzg1.jpg?width=1280&format=pjpg&auto=webp&s=31d162aef9b12430ccd1c96c943021f6ac5f2a1c

Q: Who should I contact for IT support?

https://preview.redd.it/t0supqhx9wzg1.png?width=1280&format=png&auto=webp&s=86ba7fff79775a14e44793ea4a0da1936c49765a

Both versions. One afternoon. One GPU. This becomes genuinely useful anywhere people are tired of manually searching through documentationwhether that’s developers jumping between hundreds of pages to find a single method, teams building internal assistants that understand their own codebase or company policies, or new hires trying to onboard into an unfamiliar framework, tool, or organization without constantly asking someone else for help.

The Concept

This pattern is called RAG — Retrieval-Augmented Generation. The name is a mouthful, but the idea is simple: instead of asking a language model to answer from memory (where it might hallucinate), you first retrieve the relevant text from a real source, then ask the model to generate an answer based only on what was retrieved.

It's the difference between asking someone to guess an answer and handing them the right page of a textbook first.

Here's the full flow:

https://preview.redd.it/nqrma0jy9wzg1.png?width=628&format=png&auto=webp&s=121bb05e3a3040e9c8161d1339b968acb10c81f7

The key insight: the LLM never has to know PyTorch from memory. It only has to read what you hand it. That's what keeps the answers grounded and the sources honest.

Step by Step

1. Setup

Everything ran on a single nstance for a Cloud GPU platform, One GPU. No cluster. No expensive infrastructure. That matters — it means this is something you can actually replicate.

https://preview.redd.it/knx64osz9wzg1.png?width=1057&format=png&auto=webp&s=78cb2d609f26bdf929580804fa6d4d5516c5d662

GPU NVIDIA RTX 5090 — 32GB VRAM
CUDA 13.0
Framework PyTorch
Cost $0.38 / hour
Region Singapore-A

2. Data

Developer Assistant: I pulled the actual source repositories — not a curated sample, the real thing:

git clone --depth 1 https://github.com/huggingface/transformers.git
git clone --depth 1 https://github.com/pytorch/pytorch.git
  • 884 corpus files across both repos
  • ~6.2 million characters of raw text
  • 9,192 chunks after splitting

That's a realistic knowledge base. Not a demo dataset with 20 files. The kind of scale where retrieval actually has to work.

https://preview.redd.it/6uwuycx0awzg1.png?width=499&format=png&auto=webp&s=97a84989b930eab5aa9f7359a44bfba459c4afa0

Office Knowledge Assistant: For the internal office version, I generated a structured synthetic dataset simulating a real company's internal documentation:

  • 300 documents across HR, IT, Finance, Operations, and Admin
  • Topics: leave policy, remote work, reimbursement, VPN access, onboarding, and more
  • ~600 chunks after splitting

Smaller scale, but deliberately structured to mirror the messy reality of how internal company knowledge actually lives — spread across departments, sometimes overlapping, never perfectly organized.

https://preview.redd.it/f6ivxnx1awzg1.png?width=547&format=png&auto=webp&s=82286915d69c1d43b9415a826064c142ec1ff161

3. Collect and Prepare the Documents

For the developer assistant, this meant cloning the repos. For the office assistant, generating the document set. Either way, the output is the same: a folder of raw text files representing your knowledge domain.

The preprocessing step strips noise, normalizes whitespace, and converts everything to clean UTF-8 text. Nothing fancy — just making sure the data is consistent before it gets split up.

# prepare_corpus.py — simplified version
import os

def prepare_corpus(source_dir, output_dir):
    files_processed = 0
    total_chars = 0
    
    for root, _, files in os.walk(source_dir):
        for fname in files:
            if fname.endswith(('.rst', '.md', '.txt')):
                with open(os.path.join(root, fname), 'r', errors='ignore') as f:
                    text = f.read()
                
                # Clean and normalize
                text = clean_text(text)
                
                # Write to output
                out_path = os.path.join(output_dir, fname)
                with open(out_path, 'w') as f:
                    f.write(text)
                
                files_processed += 1
                total_chars += len(text)
    
    print(f"Prepared corpus files: {files_processed}")
    print(f"Total characters: {total_chars:,}")

Output when you run it:

Prepared corpus files: 884 
Total characters: 6,264,627
Output directory: /root/dev_doc_rag/corpus

3. Chunking

You can't embed an entire 50-page document as a single vector — the signal gets lost. You split it into chunks, small enough to be semantically focused, large enough to contain a complete thought.

The important detail: overlapping chunks. If you split cleanly at every 512 tokens, you'll sometimes cut a sentence right in the middle of the answer. Overlap means each chunk shares some content with its neighbors, so nothing falls through the cracks.

def chunk_text(text, chunk_size=512, overlap=50):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    
    return chunks

Result: 9,192 chunks from 884 files. Each chunk is one searchable unit.

4. Embedding

Every chunk gets converted into a dense vector — a list of numbers that represents its semantic meaning. Chunks that mean similar things will have similar vectors, even if they use different words. That's what makes semantic search work.

FAISS (Facebook AI Similarity Search) stores all those vectors and makes it fast to find the closest matches to any new query.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed all chunks
embeddings = model.encode(chunks, show_progress_bar=True)
embeddings = np.array(embeddings).astype('float32')

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"Indexed {index.ntotal} chunks")

Indexing 9,192 chunks on the RTX 5090 took ~13.43 seconds. Once the index is built, it lives in memory and queries hit it in milliseconds.

5. Query Processing

When a user asks a question, the same embedding model converts it to a vector, FAISS finds the top-k most similar chunks, and those chunks get handed to the LLM as context.

def answer_question(query, index, chunks, model, llm, k=5):
    # Embed the query
    query_embedding = model.encode([query]).astype('float32')
    
    # Retrieve top-k chunks
    distances, indices = index.search(query_embedding, k)
    relevant_chunks = [chunks[i] for i in indices[0]]
    
    # Build prompt
    context = "\n\n".join(relevant_chunks)
    prompt = f"""Answer the following question based only on the provided context.
    
Context:
{context}

Question: {query}

Answer:"""
    
    # Generate answer
    response = llm.generate(prompt)
    sources = [chunk_sources[i] for i in indices[0]]
    
    return response, sources

The LLM doesn't browse the internet. It doesn't guess. It reads what FAISS found and answers from that. That's the whole trick.

The Performance

Developer Documentation Assistant

Metric Result
Indexing time 13.43 seconds
Query latency 2.3 – 2.6 seconds
Files indexed 884
Total chunks 9,192

Office Knowledge Assistant

Metric Result
Indexing time 8.35 seconds
Query latency 0.15 – 1.96 seconds
Files indexed 300
Total chunks ~600

The office assistant is faster because it's a smaller index, fewer vectors to search. The developer assistant handles a 15x larger dataset and still responds in under 3 seconds. Both are interactive. Neither requires a cluster.

https://preview.redd.it/a87mfw33awzg1.png?width=565&format=png&auto=webp&s=a0919280ac3a94404028af8eb48ae4533f7a11c8

Warning Honestly

  • Smaller models drift. When I used a lighter LLM for generation, the answers occasionally padded themselves with unnecessary detail or made small inferential leaps that weren't in the source text. Bigger models stay closer to the retrieved content.
  • Similar documents confuse retrieval. If you have 10 files that all describe the same leave policy with slightly different wording, FAISS might return 5 of them as top-k for one query. The answer might be fine, but the sources feel redundant.
  • Synthetic data has limits. The office assistant ran on documents I generated to simulate company policies. Real internal documents are messier — inconsistent formats, missing context, ambiguous wording. The system would need more careful preprocessing in a real deployment.

Both systems I built are deliberately simple. A developer doc assistant. An office knowledge base. But the same four steps — collect, chunk, embed, query — apply to a much wider surface area than that.

Think about what "documents your team is tired of searching" looks like in different contexts:

  • Legal teams have contracts, clauses, and precedents. Instead of a lawyer spending an hour locating a specific indemnification clause across 200 past contracts, a RAG system retrieves it in seconds.
  • Support teams have ticket histories, resolution logs, and product manuals. A RAG assistant trained on past resolved tickets can suggest answers to new ones automatically, cutting handling time dramatically.
  • Research teams have papers, notes, and literature reviews. Ask "what did the 2023 papers say about attention mechanisms in vision transformers?" and get a synthesized answer with citations, instead of manually rereading 40 PDFs.
  • Onboarding is a particularly compelling one. Every company has a mountain of documentation that new hires need to absorb in their first few weeks. Instead of burying them in Notion pages, give them a system they can just ask. The knowledge is already there — it just needs to be made queryable.

The architecture doesn't change. The embedding model doesn't change. FAISS doesn't change. What changes is the folder of documents you point it at.

That's the part I find genuinely interesting about this — it's a general-purpose tool dressed up as a specific solution. Once you understand the pipeline, you start seeing document retrieval problems everywhere.

Four hours. One GPU. Two working systems.

The developer documentation assistant handles 884 real files from PyTorch and Hugging Face, answers in under 3 seconds, and cites its sources. The office assistant handles 300 internal policy documents across 5 departments and responds in under 2 seconds.

Neither of these is rocket science. The pieces — FAISS, sentence transformers, a language model — are all open and well-documented. What this project is really about is putting them together in the right order and pointing them at a real problem.

If you're sitting on a pile of documentation that people in your team are tired of searching through manually, this is the pattern you want. The setup cost is one afternoon. The payoff is a system that keeps working after you've moved on to the next thing.

That's a trade I'll take every time.

reddit.com
u/Civil_Lingonberry678 — 9 days ago
▲ 14 r/pytorch+2 crossposts

I'm reproducing a published paper's hybrid Gabor + CNN architecture in PyTorch. The original implementation is in TensorFlow. My reproduction consistently lands ~4 pp below the paper's reported test accuracy on DermaMNIST (73-74% vs paper's 77.01%). I'd like to know which cross-framework differences are most likely to cause this gap.

Ahmed et al., "A Lightweight Hybrid Gabor Deep Learning Approach", IJCV 2026 (DOI: 10.1007/s11263-025-02658-2). The architecture is a fixed Gabor filter bank front-end followed by a small CNN with one SE block, one residual block, and three FC layers. ~340k parameters total. I've already tried Different sigma_factor values (1.0 vs 1.2) and Multiple random seeds (42, 0, 123) and tried diffrent sigma valyes of the lpf and hpf channels but its didnt close the gap.

please any idea on how to at least get a 76% to match the paper because i wanted to add improvements to see the diffrence, i would really appreciate it on how to fix this problem or any advice on what to do.

also here is just example of one epoch i have noticed that the test accuracy is lower than the validation accuracy: im i doing something wrong

[  47/100] Train: 75.70%  Val: 76.07%  Best: 76.97%  Loss: 0.6827

[paper] test acc = 0.7382

Code example:

python

class FixedGaborFrontEnd(nn.Module):
    def __init__(self, scales=(0.10, 0.20, 0.40), orientations=(4, 4, 4),
                 sigma_factor=1.0, input_size=224, output_size=56):
        super().__init__()
        # Build Gabor parameters (fixed buffers, not learnable)
        sigmas, thetas, freqs, kernel_sizes = [], [], [], []
        for f, o in zip(scales, orientations):
            sigma = sigma_factor / (math.pi * f)
            N = 2 * int(math.floor(3 * sigma)) + 1
            for k in range(o):
                sigmas.append(sigma)
                thetas.append(math.pi * k / o)
                freqs.append(f)
                kernel_sizes.append(N)
        # ... build real/imag kernels with zero-mean + L2 normalization ...

    def forward(self, x):
        # Convert RGB to grayscale
        if x.shape[1] != 1:
            x = 0.299 * x[:, 0:1] + 0.587 * x[:, 1:2] + 0.114 * x[:, 2:3]
        real = F.conv2d(x, self.real_kernels, padding=self.max_kernel_size // 2)
        imag = F.conv2d(x, self.imag_kernels, padding=self.max_kernel_size // 2)
        magnitude = torch.sqrt(real ** 2 + imag ** 2 + 1e-8)
        lpf = F.conv2d(x, self.lpf_kernel, padding=self.lpf_pad)
        hpf = F.conv2d(x, self.hpf_kernel, padding=self.hpf_pad)
        feats = torch.cat([magnitude, lpf, hpf], dim=1)
        feats = F.avg_pool2d(feats, 4, 4)  # 224 → 56
        return feats

# Standard backbone follows: SE → Conv-BN-ReLU → MaxPool → ResBlock → Dropout → GAP → FC × 3

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5
reddit.com
u/Plane_Stick8394 — 14 days ago

Free open cohort covering LLM inference, PyTorch training loops, DDP/FSDP, and capstone product — May–June 2026. No fees, no applications.

Sharing this for anyone in the community looking to go from conceptual understanding to hands-on implementation.
First Break AI is a structured, free cohort built around a public roadmap. The curriculum:
• Local model inference
• Tokenization, attention mechanisms, KV cache internals
• PyTorch training fundamentals
• Distributed training — DDP and FSDP
• Weights & Biases for experiment tracking
• Hugging Face ecosystem
• Shipping a complete AI product (capstone)
Tools used: PyTorch, Hugging Face, Modal, W&B, Cursor, Claude Code, GitHub.
Everything is in the open — roadmap, checklist, setup guide, lessons. Community-driven with weekly office hours (Fridays, 9–10 PM IST) on Discord.
Cohort runs 1 May – 30 June 2026. Already live.
For people tired of high-level tutorials and wanting to actually implement — this is worth your time. Search First Break AI on YouTube for the intro video.

reddit.com
u/ShinchanBoo08 — 13 days ago