u/Amazing_Life_221 — reddlx

Typical Interview style is pretty bad to filter autistics

I've given many interviews in the last few years. Most of them suck. I used to feel so sad about it, but now I understand it's not just my fault. The interview checks if I can recall the facts correctly. Interviewers who are very particular feel so restrictive; usually, I can't judge what level of answer they want. Plus, I can't be highly technical (verbally). I understand the concept really well, so I put it down as if I'm explaining to a 15-year-old kid, but this doesn't work on particular interviewers who seek technical language and particular grammar. Sometimes I freeze up, sometimes I yap about stuff I don't have any control over. I just don't understand why that happens.

But there are some times when the interviewer is friendly and gives me a lot of data points to make up my answer. They understand my enthusiasm for the job, they understand my curiosity, and they also understand that it is not necessary for me to recall everything on the spot. Sadly, those are very few. I feel so frustrated by the fact that I suck at interviews even though I know a great deal about my field (that's just my interest).

reddit.com

u/Amazing_Life_221 — 11 days ago

▲ 16 r/computervision

I want to build/learn SLAM from scratch. Resources?

I want to build SLAM (learning on the fly). I have no clue where to start, can you please provide me resources?

reddit.com

u/Amazing_Life_221 — 12 days ago

▲ 5 r/computervision

Where am I going to use projective geometry in real life job?

I understand that it is important to learn it for 3D projection/reconstruction but aren’t these fields getting good DL models as well? So why should I invest so much time learning geometry?

This is a genuine question, it’s pretty challenging to learn these concepts and require significant time which could be invested in learning deep learning techniques (/experimentation).

Recently someone suggested me that “classical” CV jobs would get obsolete in next few years. That adds much more to chaos.

I like learning these concepts and would like to invest my working time. But if the industry is shrinking then there would be no benefit for me (as there would be people who would be much more experienced than me for those fewer jobs available).

Also a sub question:
Following multiview geometry book is challenging but doing only theory isn’t much fun. Can you please suggest me some problems I can solve alone with learning to deepen my understanding?

reddit.com

u/Amazing_Life_221 — 12 days ago

▲ 7 r/computervision

How do two completely different models end up understanding the same (embedding) space?

To answer this question, I build CLIP (Contrastive Language–Image Pretraining) from scratch.

MobileNetV3 processes pixels, convolutions, spatial hierarchies, no concept of language. DistilBERT processes tokens, attention over word sequences, no concept of vision. Neither was designed with the other in mind. And yet, after training, you can encode a text query and an image into the same 256-dimensional space and they land near each other if they match. That's not obvious. That's forced.

Here's how it works:

Every training step, both encoders project their outputs into the shared 256-dim space
Symmetric InfoNCE loss checks: does image_i land closest to text_i, and does text_i land closest to image_i? If not, both encoders get penalized
L2 normalization keeps embeddings on a unit hypersphere so dot products become cosine similarities
Learnable temperature controls how sharply the model separates correct pairs from wrong ones. Too soft and everything looks similar. Too sharp and gradients vanish

Both models converge on the same representation for meaning, not because they share weights or architecture, but because they're constrained by the same objective.

One thing that surprised me: removing the text-to-image direction from the loss noticeably degraded the embeddings. The symmetry isn't cosmetic. Same with temperature, it's a learnable parameter but it shapes the entire geometry of the space. And all of this runs on MobileNetV3 + DistilBERT on a laptop! (Apple silicon MPS).

Short Demo: type a text query at inference and it retrieves matching images zero-shot, on categories the model never explicitly saw during training.

Working code: https://github.com/Arshad221b/CLIP_from_scratch

u/Amazing_Life_221 — 24 days ago