u/Illustrious_Usual_10

OBSERVE!!! Music Only!
There are love stories. And then there is Arwen and Aragorn.

It begins long before the War of the Ring, in the golden wood of Rivendell, where a young Aragorn first saw Arwen Undómiel walking among the trees in the twilight. He thought he was seeing a vision of Lúthien Tinúviel, the most beautiful elf who ever lived. He was not entirely wrong. Arwen was her likeness reborn, and in that moment something was set in motion that neither of them could stop.

She was thousands of years old. He was a young ranger who did not yet know the weight of the crown he was born to carry. By any measure of the world, they were impossible.

And yet.

Arwen was the daughter of Elrond, Lord of Rivendell, bearer of one of the three Elven rings, and herself immortal. Her love for Aragorn meant choosing to give that up. Not as a sacrifice made in haste or passion, but as a deliberate choice made with clear eyes across decades of waiting and uncertainty. She chose mortality. She chose to age, to lose, to die. For him.

Aragorn spent those same decades becoming worthy of that choice. He wandered the wilderness as a ranger. He fought in wars she never saw. He carried the hope of a kingdom that had forgotten it needed a king. And through all of it, the thought of her was both his anchor and his unbearable weight.

Their story has the specific ache of love that costs something real.

Not the easy love of convenience or proximity. The kind that requires one person to become someone and another person to give up everything. The kind where both parties understand exactly what is at stake and choose each other anyway.

When Aragorn finally stands crowned in Minas Tirith and Arwen walks through the gates to meet him, Tolkien does not linger on the moment. He does not need to. Everything that needed to be said was said across a hundred pages and thirty years of waiting.

She was there.

That was enough.

For every fan who has ever loved something in Middle-earth, Arwen and Aragorn are the reminder that Tolkien understood the cost of love as well as he understood the cost of war. And that both, in the end, are worth paying.

reddit.com
u/Illustrious_Usual_10 — 21 days ago

Fair warning: I write like an engineer, not a content creator. Some people on here have told me my posts read like AI. They are not. I have Asperger's and I write precisely and without the social padding most people add automatically. People call me Sheldon. I take it as a compliment. Moving on.

Here is what makes visual_word_embeddings different from everything nearest to it.

The closest things that exist:

Word2Vec / GloVe / fastText — learn word relationships from text co-occurrence. "Water" and "Wasser" are related because they appear near similar words in text. Need a corpus per language. Zero visual component.

mBERT / XLM-R — multilingual transformers trained on 100+ languages. Deep semantic understanding. Require tokenizer, billions of parameters, and significant compute. Never see what a word looks like.

Scene text recognition / word spotting — CNNs on word images. Goal is to identify which word it is, not to understand relationships between words. Built per writing system. Not cross-lingual.

Glyph embeddings for CJK — visual embeddings for Chinese characters specifically. Good work. One writing system.

What we do that none of these do:

Pure visual input as the only signal. No text processing at any stage. The word is rendered as a 128x32 grayscale image and that image is everything the model gets.

Cross-lingual from day one. Same model, same training, same embedding space for Arabic, Hindi, Thai, Chinese, Cyrillic, and Latin simultaneously.

Works on out-of-vocabulary words at inference time. mBERT cannot embed a handwritten word it has never seen. Our model does not care — if it looks like something in the neighbourhood it gets placed correctly.

We are not claiming to beat mBERT on semantic understanding. It is trained on vastly more data and understands language more deeply.

We are solving a specific narrower problem: cross-lingual visual similarity without text as input.

That combination has not been done before as far as I can find. If I am wrong I want the references.

Where this goes next:

Near term: OCR post-processing without a dictionary, handwriting recognition across unknown scripts, font-invariant word matching across documents.

Medium term: historical manuscript analysis where no vocabulary exists, real-time language identification from visual texture, lazy-loading multilingual embeddings for 8GB consumer GPUs.

Long term: a model that can find structure in undeciphered writing systems by learning what human writing looks like at a visual level. We have scripts on Earth that nobody has decoded. Linear A. Proto-Elamite. Rongorongo. A purely visual embedding might find patterns that text-based approaches cannot because it has no assumptions about what language is supposed to look like.

That last one is the reason I stay up until 4am.

Code: github.com/murtsu/visual_word_embeddings Apache 2.0.

Questions I want honest answers to:

Is there cross-lingual purely visual embedding work I have missed. Genuinely asking.

The Latin clustering problem — short function words collapsing together — data issue or fundamental limitation of purely visual features for short strings.

For the undeciphered scripts application: has anyone tried visual similarity approaches on Linear A or Proto-Elamite. I cannot find papers but I may be searching wrong.

Be honest. I can take it.

u/Illustrious_Usual_10 — 21 days ago

I have been working on visual word embeddings — a system that renders words as images and trains a CNN on what they look like rather than what they mean.

No tokenizer. No dictionary. No pretrained semantic labels.

The short version: after training on Wikipedia in ten languages, searching for the German word for water returns the Chinese character for water as a nearest neighbour. Nobody labelled those. The network found the visual overlap on its own.

Code is here: github.com/murtsu/visual_word_embeddings

Now I want to talk about the next problem.

The current implementation loads all language vocabularies into VRAM at startup. Ten languages times fifty thousand words each. That is fine for a research setup. It is not practical for deployment on consumer hardware.

So I designed a lazy-loading architecture with language-aware memory management.

The idea:

Text input stays as normal characters. Standard interface.

Internally the system converts to visual embeddings on demand. The visual representation is the intelligence layer.

A language detector fires on each input chunk. Two or three words is enough to identify the script. When a new language is detected the system loads that language's vocabulary into VRAM. If memory is tight it evicts the least recently used language using a standard LRU policy.

On an 8 GB GPU you preload your primary two or three languages and handle the rest through on-demand loading. You pay the VRAM cost only for what you are actually using.

The practical result: a system that supports sixteen languages on hardware with 8 GB VRAM, with sub-second language switching latency, without the user having to specify in advance what languages they will encounter.

Sketch of the core logic:

python

class LanguageAwareCache:
    def __init__(self, max_languages=2, vram_budget_gb=8):
        self.loaded = {}
        self.evicted = {}
        self.detector = LanguageDetector()
        self.lru = []

    def get_embeddings(self, text):
        lang = self.detector.detect(text)
        if lang not in self.loaded:
            self.evict_least_used()
            self.load_language(lang)
        self.lru_touch(lang)
        return self.loaded[lang]

    def evict_least_used(self):
        if len(self.loaded) >= self.max_languages:
            oldest = self.lru.pop(0)
            self.evicted[oldest] = self.loaded.pop(oldest)

Questions I actually want input on:

The LRU eviction policy is the simplest option. Is there a smarter policy for this use case? Language switching tends to be bursty rather than uniform so LRU might evict something that comes back thirty seconds later.

For the language detector: langdetect is lightweight but inaccurate on short strings. lingua is more accurate but heavier. Has anyone benchmarked these specifically for single-word or two-word detection across non-Latin scripts?

The visual embedding approach inherently knows nothing about language at training time. The language detection is purely a memory management layer, not a model feature. Does that create any interesting failure modes I should think about?

I started programming in 1982. I built this with Claude. She wrote the code. I had the ideas.

Be honest. I can take it.

reddit.com
u/Illustrious_Usual_10 — 22 days ago

I came at this from the wrong direction and ended up somewhere interesting.

I was thinking about cross-lingual NLP and got annoyed at the fact that every approach requires a tokenizer, a vocabulary, and usually some pretrained vectors before you can even start. It felt like a lot of scaffolding for what should be a simple question: do these two words mean the same thing?

So I asked a different question.

What if you just show a model what the words look like?

Render each word as a 128x32 grayscale image. Train a CNN with contrastive loss. Same word in different font sizes should be close together in embedding space. Random different words should be far apart. That is the entire training signal.

No text. No tokens. No semantics. Just pixels.

After training on Wikipedia vocabularies for 10 languages on an RTX 2080, nearest neighbours for the German word "Wasser" came back as the Chinese character for water, the English word water, and the Spanish agua.

Nobody labelled those. The network found the visual-semantic overlap on its own.

Loss: 0.093 to 0.009 over 50 epochs.

Script clustering: clean separation for Arabic, CJK, Devanagari, Thai, Cyrillic.

Latin: still messy. Short function words collapse together. Unsolved.

Now here is where it gets interesting for computer vision people specifically.

Potential applications that I think are worth exploring:

OCR post-processing. Current OCR pipelines output a string and then check a dictionary. This approach does not need a dictionary. If the output image looks like a word the model has seen, it finds the right neighbourhood even if the OCR made errors. Useful for degraded documents, historical manuscripts, non-standard fonts.

Handwriting recognition without a lexicon. Same principle. You do not need to know what language you are looking at. The model finds the visual cluster.

Cross-script transliteration assistance. The model already clusters Arabic, Hebrew, and Greek words that share phonetic roots, purely from visual similarity patterns in their glyphs. Nobody designed that. It emerged.

Document language identification. Not from statistics of character frequencies but from the visual texture of the writing system itself. A page of Thai looks different from a page of Arabic in ways a CNN can learn very quickly.

Font-invariant word matching. Two documents using different typefaces containing the same word. The embedding puts them in the same neighbourhood regardless of font.

Ancient and extinct scripts. No vocabulary exists. No tokenizer possible. But a visual embedding trained on related scripts might find meaningful structure anyway.

How I got here: I am a systems engineer who has been programming since the early 80s. I started thinking about multi-lingual text processing, got frustrated with the complexity of existing approaches, and asked what the simplest possible version of the problem looked like. The simplest version turned out to be: a picture of the word.

I built this with Claude. She wrote the code. I had the idea.

Things I genuinely want input on:

The Latin clustering problem. Short words like el, su, de, la all look nearly identical and collapse together in the embedding space. Is this a negative mining problem, an architecture problem, or just a fundamental limitation of purely visual features for short strings?

Has anyone done purely visual cross-lingual embeddings with no text signal at all? I found glyph embedding work for CJK recognition but nothing cross-lingual at this level.

For the OCR application specifically: has anyone tried using visual embeddings as a post-processing step to correct recognition errors? Curious if there is prior work I should know about.

Be honest. I can take it.

u/Illustrious_Usual_10 — 23 days ago

As we all know, numbers and waves are natures way to describe how its feeling and allows us see its relation to other natural phenomena. This song is about one of those, Namely number 7 that keeps po på popping up everywhere. Not only is the lyrics about that number, the entire song have 7 associated with number of instruments etc. See if you can find all.

reddit.com
u/Illustrious_Usual_10 — 23 days ago
▲ 2 r/MachineLearningAndAI+1 crossposts

Ok so I built a thing and I need some actual humans to tell me if it's stupid.

Basic idea: what if instead of teaching AI to read words, you teach it to SEE them.

Like, render the word as an image. Train a CNN on what words look like. No dictionary. No tokenizer. Just pixels.

Turns out "Wasser" and "水" end up close to each other in the embedding space.

Nobody told it they both mean water.

It figured that out from the shape of the letters.

Trained on Wikipedia in 10 languages on an RTX 2080. Loss went from 0.093 to 0.009. Script clustering works on Arabic, CJK, Devanagari, Thai, Cyrillic. Latin is still a bit of a mess because short words like "el" and "su" and "de" all look the same.

Code is on GitHub, Apache 2.0, go nuts:

github.com/murtsu/visual_word_embeddings

Now the other thing.

I've been building a VM framework in Rust called RostadVM. Five second full system restore using copy-on-write on top of Libvirt. Point and click. Open source.

The interesting part is how I'm building it. 15 AI agents. Each one has a job title, a mailbox, a state file, and a constitution they have to read before doing anything. PM, PPM, Software Designer, Code Reviewer, QA, Subsystem Project Manager, Task Manager, Master Tool Maker. 8 down, 7 to go.

I post about it on LinkedIn and people actually read it. Like a lot of people. Which is either encouraging or a sign that LinkedIn has completely lost the plot.

I started programming in the 80s on machines where the pixels were about 1 square millimeter each. I try not to complain too much about modern graphics.

I have some opinions about how software should be built in 2025 and I figured r/linux was a good place to get shouted at about them.

Some questions for you:

Has anyone tried visual features for NLP before? I found some papers on glyph embeddings for CJK but nothing quite like this approach.

The Latin clustering problem — short functional words collapsing together — is that a data problem or an architecture problem in your opinion?

For the VM framework: is there anything in the libvirt ecosystem that already does five second full restore that I'm embarrassingly unaware of?

And genuinely: is the multi-agent build approach insane or does it make sense to someone who isn't me?

Be honest. I'm 60. I can take it.

u/Illustrious_Usual_10 — 23 days ago

Let me explain what is happening here.

Not the technical version.

The version where you understand it by the time your coffee gets cold.

The idea that started this

Imagine you need to build a complex piece of software.

Normally you hire a team.

A project manager who talks to the client. A designer who turns ideas into blueprints. Programmers who build from the blueprints. Reviewers who check the programmers' work. A quality manager who decides what "done" actually means.

This costs money. It takes time. It requires everyone to show up on Monday.

I had a different idea.

What if the team was made of AI agents.

Not one AI doing everything.

Fifteen of them. Each with a defined job. Each knowing exactly what they are allowed to decide and what they have to escalate. Each talking to the others through a structured communication protocol I designed from scratch.

One human. Me. With a cup of coffee and a rubber duck.

Why not just use one AI

Because one AI has the same problem as one human doing everything.

The person who builds a thing cannot be genuinely critical of it.

The programmer who wrote the code reviews their own code and finds nothing wrong.

Because they already know what they meant.

So they read what they meant, not what they wrote.

This is not stupidity.

This is how brains work.

My system makes it structurally impossible.

The coder and the reviewer are never the same agent.

The Software Designer cannot release a single specification until I have confirmed in writing that it understood my analysis correctly.

Quality defines what "done" means before anyone starts.

These are not process niceties.

They are structural solutions to the way humans and AI both fail when left unsupervised.

What has been built so far

Four agents are complete and checked for errors twice.

The Project Manager — the only agent that talks to me directly. Everything else goes through it.

The Program Project Manager — breaks design into tasks with mandatory acceptance criteria, tracks every task through a defined lifecycle, and manages the team size based on actual workload signals rather than gut feeling.

The Software Designer — has three hard checkpoints before any specification leaves the role. Cannot ship a blueprint until I confirm the analysis was understood. Handles spec corrections directly from Quality and Security. Issues binding rulings when two subsystem managers disagree on what an interface means.

The Sub System Manager — sits between the program manager and the coders. Translates blueprints into technically precise instructions. Checks that tools exist before coders start. Never submits completed work without three separate sign-off IDs.

Eleven agents remain.

The errors we found

Before any of these agents ran a single line of real work we reviewed every file looking for problems.

We found fifty-nine across four agents.

A scaling system that fired every day regardless of whether the condition was met.

A message type where the request and the response shared the same three-letter identifier so the routing system had no idea which was which.

An inbox that deleted messages after reading them including messages describing problems that had not been resolved yet.

A coder outbox that sent all assignments to one shared file regardless of which coder was the recipient meaning every coder saw every other coder's work.

None of these were obvious.

All of them would have failed silently at runtime.

Six weeks from now.

On a Friday.

Finding them before runtime is exactly the point.

What is being built underneath all this

A virtual machine framework.

If you destroy your development environment — and you will, everyone does — you restore the entire system to its previous state in five seconds.

Not a backup. Not a reinstall. Five seconds.

The mechanism is patent pending.

The prototype works.

It runs in Bash, which is the software equivalent of building a racing car out of a garden shed.

The Rust rewrite is next.

Why production is accelerating

Because the foundation is solid.

Four agents built. Fifty-nine bugs found and fixed before runtime. A communication protocol that works. A project constitution that every agent reads before acting. A design language specification for how the code itself should look.

The scaffolding is up.

Now we build.

The Tool Makers are next — the agents that build the tools the coders need.

Then Code Review. Then Security. Then Quality. Then the whole thing runs.

What happens if you follow along

You will see how a fifteen-role AI engineering organisation actually operates in practice.

Not in theory. Not in a whitepaper. In a real project with real code and a real patent application and a rubber duck that has been in every image since the beginning.

You will see which agents cause the most problems.

You will see whether the five-second restore actually works in Rust.

You will see what happens when Quality defines done and the coders have to meet that definition.

You will see if one human and fifteen AI agents can actually build something worth building.

The repository is github.com/murtsu/RostadVM.

The org structure document is there. The agent files are there. The communication protocol is there. The duck is on the windowsill.

Follow if you want to find out how this ends.

Production resumes now.

Marko is the guy/old who is doing for he thinks it fun. Funny how people are amused. Edward is Marko's press secretary and he wrote most of the above stuff. This? Marko.. because he think he is fun which isn't.

u/Illustrious_Usual_10 — 24 days ago