
r/deeplearning

Microsoft economist's hot take: Let it burn first
Fine-tuned RAG: teaching your retriever which embedding dimensions matter (+11% hit rate, +12% completeness, +9% faithfulness)
Hi all,
I developed a fine-tuned retrieval head (neural net) for RAG that transforms query embeddings before retrieval, so the system learns which embedding dimensions actually matter for your corpus — rather than weighting them all equally as standard cosine similarity does.
The problem
In any domain-specific corpus, some embedding dimensions are highly predictive for matching queries to the right passages, while others are effectively noise. Standard cosine similarity can't distinguish between the two, so retrieval gets pulled toward superficially similar but substantively irrelevant passages. The fine-tuned RAG is designed to prevent exactly that.
How it works
- Synthetic question generation — An LLM generates multiple questions per chunk in the corpus, for which the answers can be inferred from that chunk. This creates a dataset of question-chunk pairs (QA-pairs). These are embedded using an embedding model and divided into a training and validation set.
- Neural net training — A lightweight neural network using MNR loss is trained on the training QA-pairs. After each epoch, the model is evaluated on the validation set by measuring retrieval hit rate: the proportion of validation questions for which the correct chunk appears in the top-5 retrieved results. Retrieval works by embedding the question, passing it through the neural network to transform the embedding, and ranking all corpus chunks by cosine similarity to the transformed embedding.
Through this mechanism, the projection head learns for these 'type of questions' which dimensions in the embeddings are informative for finding the best chunks — and which are irrelevant.
Results
To validate the architecture, I used the Legal RAG Bench dataset as a proof of concept — evaluating on 100 held-out test questions.
Retrieval Hit Rate:
- The fine-tuned retriever achieves 82% Hit Rate (k = 20), compared to 71% for the standard cosine retriever — an 11 percentage point improvement, meaning the correct chunk appears in the top 20 results significantly more often when the query embedding is first transformed through the fine-tuned retriever.
Answer quality (LLM-as-judge, 1–5 scale across 6 metrics):
- Outperforms traditional RAG (top-k cosine sim) on all 6 metrics
- Largest gains in completeness (+12%) and faithfulness (+9%)
- Consistent improvement across every metric — not just isolated gains — suggesting that retrieving more relevant context has a broad positive effect on answer quality
Code and full write-up available on GitHub: https://github.com/BartAmin/Fine-tuned-RAG
Could an AI 1000x smarter than us manipulate us?
Is this good for a pre-trained (it is training) model?
BTW the model is 15M param (Not-ordinary transformer), and is pre-training as we speak it's only has about 800 steps of it's max 20k training steps.
No SFT and all. I just wanted how the model stability holds, if you ever worked on pre-training LLMs is this what you see as well. (Mathematically makes sense to me as acc is about: 0.17, but I want to you know be sure, this one is expensive compute and as an independent researcher I have more to lose if I see it failing. So it is for me big loss if did not work out.)
When testing the scaled down 1k, 10k & 100k param architecture on set patterns the model showed high intelligence. Only trained on couple of steps <500 and the model learned the multiplication scheme taught to it in all test sizes and the 1k variant was perfect till it was trained but started failing as the model input was increased and was held out/never shown that data in training run (it did 64/100 on those unseen tests, still good considering a vanilla Transformer ~600k params did less than that) ; 10k and 100k showed sparks of supreme intelligence per param (outperforming pattern held out training by upto 10M digits more than it was ever trained on... the model was trained to multiply till 10000, it multiplied till `10000-(12 zeros more)` with 100% accuracy even surpassing CPU computation which is off by some float points. 10k/10k score for both 10k and 100k model. Idk how but 100k model somehow made a logical explanation on it's own for addition. It was able to add using multiplication.
I am really seeing this as something; this 15M param model as we speak outperforms Qwen-3-4B-base on this same training data in terms of same hyperparameter checks.
For training dataset being ~1.05B tokens of high quality general domain data, science/creative writing/maths/general school knowledge.
For what I can see the model is pattern recognition beast. Like it learns like crazy and at crazy fast speed. I was training it's 1M param model, you will not believe it, it learned the entire tinystories dataset which has like 2M rows (repetitive and close to `Once upon a time` types I know... since LLMs are normalised output machines "generalization" is obvious once saturation is reached.), back to the experience so it learned the format in 500steps (not accurate or too coherent) but dammit the model was really close (like even assumed the next character name perfectly) to the training data it never even get too see. those 500 steps were of 64k samples out of 2M samples.
This is why I am trying to scale as much as my budget allows me to and test this model. If it fails I may be a fool; I can only find out that after words (I may already be a stupid fool already) 😄
So if you see something strange help me please don't be afraid to ask questions apart from architecture details I can give you all the knowledge.
Paperclip energy: casual vs. doomer edition
Did anyone else underestimate how much random stuff there is to learn in Generative AI?
I started learning generative AI thinking most of my time would go into understanding models.
Ended up spending time on completely different things.
One day I was reading about prompts, then embeddings, then vector databases, then RAG, then trying to understand why a model was giving weird outputs even though everything looked fine.
I also realized building something yourself feels very different from watching tutorials. I'll watch a 20 minute video and think "okay that looks straightforward", then spend the next few hours trying to figure out why something isn't working.
Not complaining or anything, I actually like it. I just didn't expect the learning process to go like this.
Curious if anyone else had the same experience or if I just went down a weird path.
I need help with assignment. I don't know how to write an essay to make it sound good. Any tips?
I need to write an essay, and to be real, I suck at this kind of stuff. I’m more into technical fields, so writing pieces where you have to express your opinions is definitely not my vibe. Any tips on how to get better? My main problem is that my sentences feel completely disconnected, like they're from different papers. I have ideas in my head and want to blend them nicely, but the final result is just a mess. I also make a lot of grammar errors, but that’s an easy fix with a couple of rounds of proofreading
My First Youtube Video - Explaining Linear Regression from Scratch, Spelled Out
Hey Guys! I have been doing ML and AI Stuff for almost a year now. I have always wanted to create a Youtube channel, and wanted to share this with all of you. I explain Linear Regression, The Mean Squared Error Loss Function and Gradient Descent in excruciating detail. This is my first experience with video editing and content creation, so I would love feedback on what I can improve going forward. Here is the link of the video:
https://www.youtube.com/watch?v=rJdAvnocTMQ
Ps: I tried to replicate 3b1b (3 blue 1 brown)'s style of teaching. Tell me if II succeeded somewhat.
Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]
Autoregressive LLM world models factorize next-state generation left-to-right, preventing them from conditioning on globally interdependent anchors (tool schemas, trailing status fields, expected outcomes) and yielding prefix-consistent but globally incoherent rollouts. MDLMs' any-order denoising objective sidesteps this by learning every conditional direction from the same training signal. Empirically, fine-tuned MDLMs (SDAR-8B, WeDLM-8B) surpass AR baselines up to 4x their total parameter count on BLEU-1, ROUGE-L, and MAUVE across in- and out-of-domain splits, with lower Self-BLEU and higher Distinct-N confirming reduced prefix mode collapse. GRPO training on MDLM-generated rollouts shows up to +15% absolute task-success gains over AR generated training on held-out ScienceWorld, ALFWorld, and AppWorld across 1.2B–7B backbones (LFM2.5, Qwen3, Mistral) in a zero-shot transfer setting.
Dear DL researchers: how do you design your neural networks?
Genuine question,
how do you take some architectural decisions like the size of the neural network and the whole set of hyperparameters.
I get that there's brute forcing and hyperparameter search (which sometimes, really, it's a LOT), or some notes in literature regarding the choice of activations or loss based on context, but how would one really target some specific design choices when starting to explore efficiently, especially in terms of number of layers and latent space dimensions.
I appreciate your time, will take every tip into account
Ignore the tentacles, blame the firefighters
Congress's AI awakening: doubling every 5.5 months
[D] PINN loss functions: why physics-informed networks often fail to train
hysics-Informed Neural Networks are interesting because they break the standard ML paradigm: instead of approximating an unknown function from data alone, they exploit a known PDE constraint that the solution must satisfy. In principle this should make them converge faster and generalize better.
In practice the loss function makes them notoriously hard to train. The loss is a weighted sum of multiple terms (PDE residual, boundary conditions, initial conditions, data), each with different scales and gradient magnitudes. Several papers have characterized what goes wrong:
Wang, Teng & Perdikaris (2021) showed empirically and theoretically that during training, the gradients from different loss components become severely imbalanced. The optimizer follows whichever loss has the loudest gradient, regardless of which one matters most.
Wang, Yu & Perdikaris (2022) used Neural Tangent Kernel theory to show that the PDE residual term has much smaller eigenvalues than the boundary loss. The network learns boundaries quickly and interior physics slowly — often it never catches up.
Krishnapriyan et al. (NeurIPS 2021) demonstrated that even on simple PDEs like the convection equation, PINNs systematically fail to converge as the convection coefficient grows. This is on textbook problems with reasonable hyperparameters.
Mitigations exist (adaptive loss weighting, causal training, curriculum approaches, architectural fixes that hard-code boundary conditions) but none has fully solved the problem.
I wrote a longer version with full references and applications here: https://cristobalsantana.substack.com/p/the-pinn-loss-function-where-physics
Curious if anyone here has dealt with these training pathologies in production and what worked for you.