r/pytorch
New Academic Research: “Zombies in Alternate Realities: The Afterlife of Domain Names in DNS Integrations”
Interesting paper on a fairly under-discussed issue in DNS: what happens to expired or repurposed domain names that remain embedded in DNS dependencies across systems. The core finding is that these “orphaned” or changed domains can persist in resolution paths and integrations long after their original context is gone, creating real security and reliability implications.
My take: this becomes even more relevant in modern AI systems, where agents, tools, plugins, and third-party APIs are rapidly stitched together. In that environment, domain names and DNS-level dependencies can quietly extend the AI supply chain attack surface in ways that are easy to overlook.
Built an open source GPU bottleneck analyzer for PyTorch/CUDA. Looking for honest feedback
I've been building an open source tool called Fournex that turns Nsight Compute output into specific, evidence-backed optimization suggestions for CUDA kernels.
What it does
You give it an NCU profile (or a PTX file), and it:
- classifies bottlenecks from hardware-counter evidence
- ranks issues by severity
- generates concrete optimization recommendations tied directly to the metrics that triggered them
What it currently detects
- Uncoalesced global memory access (
sectors/requestratio) - L1/L2 cache thrashing
- Memory bandwidth saturation
- Tensor core underutilization
- Warp stall patterns:
- barrier stalls
- memory throttle
- scoreboard stalls
- Low issue-slot utilization
- Register pressure / spills (via PTX static analysis)
Concrete example
I tested it on a deliberately broken GEMM kernel with four planted flaws:
- stride-K uncoalesced access
- no shared memory tiling
- FP32 only execution (tensor cores idle)
- unnecessary
__syncthreads()calls inside the reduction loop
It correctly identified all four and recommended:
- improving memory coalescing
- adding shared memory tiling
- enabling AMP / tensor core usage
- removing unnecessary barriers
Each recommendation includes:
- the exact metric that triggered it
- why the metric matters
- numbered remediation steps
Workflow
# Analyze existing Nsight Compute CSV output
frx profile --ncu profile.csv
# Or let frx run NCU for you
# (Linux only, may require sudo for hardware counters)
frx profile -- ./my_binary
# Static PTX analysis
frx profile --ptx kernel.ptx
On Windows, you can export the CSV from Nsight Compute and pass it to:
frx profile --ncu profile.csv
No GPU is required at analysis time.
One thing I'm intentionally trying not to do
I don't want this to become an LLM wrapper that generates plausible sounding optimization advice.
Every recommendation is triggered by explicit thresholds on measured hardware counters. If the metric evidence isn't present, the recommendation doesn't fire.
Repo
https://github.com/jorgevee/fournex
Would appreciate feedback from people who profile CUDA workloads seriously or hobby:
- What bottlenecks are hardest to diagnose today?
- What’s missing from existing tooling?
- Would you trust automated optimization suggestions? Under what conditions?
- What would make something like this useful in your workflow?
And if the direction seems interesting, don't be shy to star the repo
Personal continual learning for LLMs without GPU — position paper [OC]
I proposed two architectures for enabling LLMs to learn daily from personal interactions:
Internal KV-Sphere Architecture (IKSA)
Background Micro Fine-Tuning (BMFT) Both work with zero GPU and zero catastrophic forgetting.
Full paper:
huggingface.co/spaces/Persak/continual_learning_position_paper
https://github.com/paras2l/Continual-Learning-in-Large-Language-Models-.git
Twitter thread: [ https://x.com/ParasLashkarin/status/2055644988592247081?s=20 ]
Looking for researchers to validate or disprove these ideas! — Paras Lashkari
Struggling with Overfitting on Medical Imaging Task
Hi everyone,
I’m working on a 2-class classification problem (LCA vs. RCA coronary arteries) using 2D X-ray angiograms. I’m currently stuck in a cycle of extreme overfitting and could use some advice on my training strategy.
The Setup:
- Dataset: Small (~900 training frames from ~300 unique DICOMs).
- Architecture: InceptionV3 (PyTorch).
- Input: Grayscale .npy arrays converted to 3-channel, resized to 299x299.
- Current Strategy: Transfer learning from ImageNet. I’ve tried full unfreezing and partial unfreezing (last blocks).
The Problem: My training accuracy hits ~95-99% within a few epochs, but validation accuracy peaks early (around 74-79%) and then collapses toward 30-40% as the model starts memorizing the specific textures of the training patients.
What I’ve Tried So Far:
- Normalization: Standard ImageNet mean/std (applied at load time).
- Class Weights: Handled 2:1 imbalance (LCA:RCA).
- Regularization: Added Dropout (tried 0.3 to 0.6) and Weight Decay (1e-4).
- Augmentation: Flips, 25deg rotations, and translation.
- Schedulers: ReduceLROnPlateau (factor 0.5, patience 8).
Would love any insights or papers you'd recommend for small-sample medical classification. Thanks!
AMD VS Nvidia for ML training
Hello everyone,
I need opinions. In my country, RTX5060(new) 8gb costs almost $350 and RX9060XT(new) 16gb costs almost $440. RTX5060ti(new) 16gb cost almost $585. Now, I was planning to buy a GPU for ML training and inference. I am a little bit confused here. I know that CUDA is much more mature than ROCM. I don't have the budget to buy RTX5060ti 16gb. I am confused between 5060 and 9060xt. 9060xt have more vram than 5060. But 5060 has better support for ML. What should I do here ? I will train CNN and LLM(small ones) models with a good amount of data which one should I choose here ? Is there any possibility of ROCM to be more optimized for ML in future ?
There's a specific kind of frustration that every developer knows. You're in the middle of something, you hit a wall, and you open the PyTorch docs. Twenty minutes later, you've read three pages, followed two rabbit holes, and you still haven't found the one line you needed.
I got tired of that. So I built something about it.
In four hours on a single GPU instance, I put together a system that lets you ask plain English questions and get answers pulled directly from real documentation — cited, grounded, no hallucination. Ask it "how do I move a model to GPU?" and it tells you .to(device), points you to exactly which file that came from, and moves on.
Here's how it went.
Result First
Before anything else, this is what it actually looks like in practice.
Q: How do I move a PyTorch model to a GPU?
Q: How do I use a tokenizer with Hugging Face Transformers?
Q: How do I use Dataloader in PyTorch?
I also built a second version of this — the same architecture, but pointed at internal office documents instead of PyTorch. HR policies, IT procedures, and finance reimbursement guides.
An employee asks, "How do I request annual leave?" and gets a cited answer in under 2 seconds. Same idea, completely different world.
Q: How do I request annual leave?
Q: How do I submit a travel reimbursement?
Q: Who should I contact for IT support?
Both versions. One afternoon. One GPU. This becomes genuinely useful anywhere people are tired of manually searching through documentation — whether that’s developers jumping between hundreds of pages to find a single method, teams building internal assistants that understand their own codebase or company policies, or new hires trying to onboard into an unfamiliar framework, tool, or organization without constantly asking someone else for help.
The Concept
This pattern is called RAG — Retrieval-Augmented Generation. The name is a mouthful, but the idea is simple: instead of asking a language model to answer from memory (where it might hallucinate), you first retrieve the relevant text from a real source, then ask the model to generate an answer based only on what was retrieved.
It's the difference between asking someone to guess an answer and handing them the right page of a textbook first.
Here's the full flow:
The key insight: the LLM never has to know PyTorch from memory. It only has to read what you hand it. That's what keeps the answers grounded and the sources honest.
Step by Step
1. Setup
Everything ran on a single nstance for a Cloud GPU platform, One GPU. No cluster. No expensive infrastructure. That matters — it means this is something you can actually replicate.
| GPU | NVIDIA RTX 5090 — 32GB VRAM |
|---|---|
| CUDA | 13.0 |
| Framework | PyTorch |
| Cost | $0.38 / hour |
| Region | Singapore-A |
2. Data
Developer Assistant: I pulled the actual source repositories — not a curated sample, the real thing:
git clone --depth 1 https://github.com/huggingface/transformers.git
git clone --depth 1 https://github.com/pytorch/pytorch.git
- 884 corpus files across both repos
- ~6.2 million characters of raw text
- 9,192 chunks after splitting
That's a realistic knowledge base. Not a demo dataset with 20 files. The kind of scale where retrieval actually has to work.
Office Knowledge Assistant: For the internal office version, I generated a structured synthetic dataset simulating a real company's internal documentation:
- 300 documents across HR, IT, Finance, Operations, and Admin
- Topics: leave policy, remote work, reimbursement, VPN access, onboarding, and more
- ~600 chunks after splitting
Smaller scale, but deliberately structured to mirror the messy reality of how internal company knowledge actually lives — spread across departments, sometimes overlapping, never perfectly organized.
3. Collect and Prepare the Documents
For the developer assistant, this meant cloning the repos. For the office assistant, generating the document set. Either way, the output is the same: a folder of raw text files representing your knowledge domain.
The preprocessing step strips noise, normalizes whitespace, and converts everything to clean UTF-8 text. Nothing fancy — just making sure the data is consistent before it gets split up.
# prepare_corpus.py — simplified version
import os
def prepare_corpus(source_dir, output_dir):
files_processed = 0
total_chars = 0
for root, _, files in os.walk(source_dir):
for fname in files:
if fname.endswith(('.rst', '.md', '.txt')):
with open(os.path.join(root, fname), 'r', errors='ignore') as f:
text = f.read()
# Clean and normalize
text = clean_text(text)
# Write to output
out_path = os.path.join(output_dir, fname)
with open(out_path, 'w') as f:
f.write(text)
files_processed += 1
total_chars += len(text)
print(f"Prepared corpus files: {files_processed}")
print(f"Total characters: {total_chars:,}")
Output when you run it:
Prepared corpus files: 884
Total characters: 6,264,627
Output directory: /root/dev_doc_rag/corpus
3. Chunking
You can't embed an entire 50-page document as a single vector — the signal gets lost. You split it into chunks, small enough to be semantically focused, large enough to contain a complete thought.
The important detail: overlapping chunks. If you split cleanly at every 512 tokens, you'll sometimes cut a sentence right in the middle of the answer. Overlap means each chunk shares some content with its neighbors, so nothing falls through the cracks.
def chunk_text(text, chunk_size=512, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
Result: 9,192 chunks from 884 files. Each chunk is one searchable unit.
4. Embedding
Every chunk gets converted into a dense vector — a list of numbers that represents its semantic meaning. Chunks that mean similar things will have similar vectors, even if they use different words. That's what makes semantic search work.
FAISS (Facebook AI Similarity Search) stores all those vectors and makes it fast to find the closest matches to any new query.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Embed all chunks
embeddings = model.encode(chunks, show_progress_bar=True)
embeddings = np.array(embeddings).astype('float32')
# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
print(f"Indexed {index.ntotal} chunks")
Indexing 9,192 chunks on the RTX 5090 took ~13.43 seconds. Once the index is built, it lives in memory and queries hit it in milliseconds.
5. Query Processing
When a user asks a question, the same embedding model converts it to a vector, FAISS finds the top-k most similar chunks, and those chunks get handed to the LLM as context.
def answer_question(query, index, chunks, model, llm, k=5):
# Embed the query
query_embedding = model.encode([query]).astype('float32')
# Retrieve top-k chunks
distances, indices = index.search(query_embedding, k)
relevant_chunks = [chunks[i] for i in indices[0]]
# Build prompt
context = "\n\n".join(relevant_chunks)
prompt = f"""Answer the following question based only on the provided context.
Context:
{context}
Question: {query}
Answer:"""
# Generate answer
response = llm.generate(prompt)
sources = [chunk_sources[i] for i in indices[0]]
return response, sources
The LLM doesn't browse the internet. It doesn't guess. It reads what FAISS found and answers from that. That's the whole trick.
The Performance
Developer Documentation Assistant
| Metric | Result |
|---|---|
| Indexing time | 13.43 seconds |
| Query latency | 2.3 – 2.6 seconds |
| Files indexed | 884 |
| Total chunks | 9,192 |
Office Knowledge Assistant
| Metric | Result |
|---|---|
| Indexing time | 8.35 seconds |
| Query latency | 0.15 – 1.96 seconds |
| Files indexed | 300 |
| Total chunks | ~600 |
The office assistant is faster because it's a smaller index, fewer vectors to search. The developer assistant handles a 15x larger dataset and still responds in under 3 seconds. Both are interactive. Neither requires a cluster.
Warning Honestly
- Smaller models drift. When I used a lighter LLM for generation, the answers occasionally padded themselves with unnecessary detail or made small inferential leaps that weren't in the source text. Bigger models stay closer to the retrieved content.
- Similar documents confuse retrieval. If you have 10 files that all describe the same leave policy with slightly different wording, FAISS might return 5 of them as top-k for one query. The answer might be fine, but the sources feel redundant.
- Synthetic data has limits. The office assistant ran on documents I generated to simulate company policies. Real internal documents are messier — inconsistent formats, missing context, ambiguous wording. The system would need more careful preprocessing in a real deployment.
Both systems I built are deliberately simple. A developer doc assistant. An office knowledge base. But the same four steps — collect, chunk, embed, query — apply to a much wider surface area than that.
Think about what "documents your team is tired of searching" looks like in different contexts:
- Legal teams have contracts, clauses, and precedents. Instead of a lawyer spending an hour locating a specific indemnification clause across 200 past contracts, a RAG system retrieves it in seconds.
- Support teams have ticket histories, resolution logs, and product manuals. A RAG assistant trained on past resolved tickets can suggest answers to new ones automatically, cutting handling time dramatically.
- Research teams have papers, notes, and literature reviews. Ask "what did the 2023 papers say about attention mechanisms in vision transformers?" and get a synthesized answer with citations, instead of manually rereading 40 PDFs.
- Onboarding is a particularly compelling one. Every company has a mountain of documentation that new hires need to absorb in their first few weeks. Instead of burying them in Notion pages, give them a system they can just ask. The knowledge is already there — it just needs to be made queryable.
The architecture doesn't change. The embedding model doesn't change. FAISS doesn't change. What changes is the folder of documents you point it at.
That's the part I find genuinely interesting about this — it's a general-purpose tool dressed up as a specific solution. Once you understand the pipeline, you start seeing document retrieval problems everywhere.
Four hours. One GPU. Two working systems.
The developer documentation assistant handles 884 real files from PyTorch and Hugging Face, answers in under 3 seconds, and cites its sources. The office assistant handles 300 internal policy documents across 5 departments and responds in under 2 seconds.
Neither of these is rocket science. The pieces — FAISS, sentence transformers, a language model — are all open and well-documented. What this project is really about is putting them together in the right order and pointing them at a real problem.
If you're sitting on a pile of documentation that people in your team are tired of searching through manually, this is the pattern you want. The setup cost is one afternoon. The payoff is a system that keeps working after you've moved on to the next thing.
That's a trade I'll take every time.
I'm reproducing a published paper's hybrid Gabor + CNN architecture in PyTorch. The original implementation is in TensorFlow. My reproduction consistently lands ~4 pp below the paper's reported test accuracy on DermaMNIST (73-74% vs paper's 77.01%). I'd like to know which cross-framework differences are most likely to cause this gap.
Ahmed et al., "A Lightweight Hybrid Gabor Deep Learning Approach", IJCV 2026 (DOI: 10.1007/s11263-025-02658-2). The architecture is a fixed Gabor filter bank front-end followed by a small CNN with one SE block, one residual block, and three FC layers. ~340k parameters total. I've already tried Different sigma_factor values (1.0 vs 1.2) and Multiple random seeds (42, 0, 123) and tried diffrent sigma valyes of the lpf and hpf channels but its didnt close the gap.
please any idea on how to at least get a 76% to match the paper because i wanted to add improvements to see the diffrence, i would really appreciate it on how to fix this problem or any advice on what to do.
also here is just example of one epoch i have noticed that the test accuracy is lower than the validation accuracy: im i doing something wrong
[ 47/100] Train: 75.70% Val: 76.07% Best: 76.97% Loss: 0.6827
[paper] test acc = 0.7382
Code example:
python
class FixedGaborFrontEnd(nn.Module):
def __init__(self, scales=(0.10, 0.20, 0.40), orientations=(4, 4, 4),
sigma_factor=1.0, input_size=224, output_size=56):
super().__init__()
# Build Gabor parameters (fixed buffers, not learnable)
sigmas, thetas, freqs, kernel_sizes = [], [], [], []
for f, o in zip(scales, orientations):
sigma = sigma_factor / (math.pi * f)
N = 2 * int(math.floor(3 * sigma)) + 1
for k in range(o):
sigmas.append(sigma)
thetas.append(math.pi * k / o)
freqs.append(f)
kernel_sizes.append(N)
# ... build real/imag kernels with zero-mean + L2 normalization ...
def forward(self, x):
# Convert RGB to grayscale
if x.shape[1] != 1:
x = 0.299 * x[:, 0:1] + 0.587 * x[:, 1:2] + 0.114 * x[:, 2:3]
real = F.conv2d(x, self.real_kernels, padding=self.max_kernel_size // 2)
imag = F.conv2d(x, self.imag_kernels, padding=self.max_kernel_size // 2)
magnitude = torch.sqrt(real ** 2 + imag ** 2 + 1e-8)
lpf = F.conv2d(x, self.lpf_kernel, padding=self.lpf_pad)
hpf = F.conv2d(x, self.hpf_kernel, padding=self.hpf_pad)
feats = torch.cat([magnitude, lpf, hpf], dim=1)
feats = F.avg_pool2d(feats, 4, 4) # 224 → 56
return feats
# Standard backbone follows: SE → Conv-BN-ReLU → MaxPool → ResBlock → Dropout → GAP → FC × 3
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5
Free open cohort covering LLM inference, PyTorch training loops, DDP/FSDP, and capstone product — May–June 2026. No fees, no applications.
Sharing this for anyone in the community looking to go from conceptual understanding to hands-on implementation.
First Break AI is a structured, free cohort built around a public roadmap. The curriculum:
• Local model inference
• Tokenization, attention mechanisms, KV cache internals
• PyTorch training fundamentals
• Distributed training — DDP and FSDP
• Weights & Biases for experiment tracking
• Hugging Face ecosystem
• Shipping a complete AI product (capstone)
Tools used: PyTorch, Hugging Face, Modal, W&B, Cursor, Claude Code, GitHub.
Everything is in the open — roadmap, checklist, setup guide, lessons. Community-driven with weekly office hours (Fridays, 9–10 PM IST) on Discord.
Cohort runs 1 May – 30 June 2026. Already live.
For people tired of high-level tutorials and wanting to actually implement — this is worth your time. Search First Break AI on YouTube for the intro video.