▲ 4 r/LanguageTechnology

I built BaryGraph - knowledge graph where every relationship is its own embedded document (not an edge)

Instead of node --edge--> node, every relationship is a first-class document with its own vector, called a BaryEdge. Stack pairs of BaryEdges recursively and you get "MetaBary" triads that surface structural bridges between concepts that live nowhere near each other in embedding space. Running locally on MongoDB Community + mongot + nomic-embed-text over the full English Wiktionary (6.6M docs). MCP server is live if you want to poke at it. Preprint + benchmark CSVs available in comments

The problem I was chasing

Flat vector search treats a relationship as a byproduct of two points being close. That throws away information. Two papers can describe the same underlying phenomenon (a flyby anomaly in orbital mechanics, an anomalous residual in stellar dynamics) without ever citing each other and without their embeddings landing anywhere near each other. Nothing in standard RAG surfaces that connection.

What I did instead

Every relationship gets embedded too:

bary_vector = normalize(q·v(CM1) + q·v(CM2) + (1−q)·v(type))

q is connection quality, v(type) is a contextual embedding of what kind of relationship it is. This BaryEdge is now a retrievable document in its own right — not metadata on an edge.

Then it recurses: two BaryEdges at the same level get bridged by a third one level below, forming a MetaBary triad. Do that repeatedly and you climb an abstraction triads hierarchy built entirely from algebra — zero additional embedding calls above the base level. It's a forest (every node has at most one parent), so traversal to root is a single $graphLookup, no cycle handling.

Does it actually do anything useful?

Ran it against SimLex-999 and WordSim-353 as a sanity check (not the main claim, just "is the substrate coherent"). Raw cosine similarity barely correlates with human similarity judgments (ρ ≈ −0.04 on SimLex). Structural metrics — how many BaryEdges two words share, how much their relational neighborhoods overlap — correlate at ρ ≈ 0.32–0.53, p < 10⁻¹⁵. So the graph is encoding something cosine alone doesn't.

The part I actually care about is cross-domain bridging. Some probe traces from the live graph:

octopus neuroscience ↔ distributed sensor networks, bridged by shared structural-motif vocabulary (neuroarchitecture, smartdust)
collagen folding ↔ linguistic syntax, bridged by etymological + structural motif overlap (plicature / hypotaxis-parataxis)
grief ↔ depression, not bridged and this is a correctness demonstration, not a missing capability. The DSM-5 added a much-debated "bereavement exclusion" precisely because grief and depression share surface symptoms but are different kinds of state, with different prognosis and treatment
radioactive decay ↔ obsolete words falling out of use, bridged at a high abstraction level by register-varied decay verbs (collapsed, decayed, declined, disintegrated) — naming a Poisson-process state-loss pattern that both physics and historical linguistics instantiate, with no single word doing the work

That last one is the case flat retrieval structurally cannot produce — there's no embedding axis for "verbs co-occurring with reduction-of-state across unrelated domains."

Stack (all local, all free)

GitHub: in comments

MongoDB Community Edition + mongot for storage/vector search
nomic-embed-text, 768-dim
Python 3.11+
Full build: ~6.66M documents, 8–14 hrs on a single workstation (8–16GB VRAM)

Try it

MCP server is public on request (SSE transport) — read-only tools for searching the live graph: find_word, semantic_search, edge_info, leaf_nodes, traverse_up, sample_metabary. If you've got an MCP-capable client you can point it at the graph and run your own probe queries in a few minutes.

What I'd actually want feedback on

Whether the cross-domain bridges hold up to someone who isn't me poking at them — try a probe query on a domain pair you know well and tell me if the bridge is real or if I'm pattern-matching myself into seeing structure that isn't there. Some bridges can be not obvious on the first look but they are actually the most intriguing ones and worth to be dug for the reason they built, so treat them as points of investigation
Whether this is worth comparing directly against GraphRAG/RAPTOR-style hierarchical retrieval (I haven't done that benchmark yet, and I know that's the first thing this sub will ask)
Whether anyone's tried something structurally similar and it fell apart at scale for reasons I haven't hit yet

Happy to drop the MCP endpoint on request if there's interest.

r/LanguageTechnology

I built BaryGraph - knowledge graph where every relationship is its own embedded document (not an edge)

Has anyone built a tool to find double meanings?

Looking for a PhD/Grad Mentor to help brainstorm a Master's Proposal (Paid Consultation)

🔬 Areas of Interest:

Working on a rust based version of spaCy that can run in browser, anyone here interested?

ARR review quality

Need a way to classify text based on context

I'm building an NLP engine that detects expressions in an English text. Can it be useful for someone? (Not trying to promote anything)

Would you recommend taking up a master degree in NLP?

Sentiment Analysis Library Recommendations for English and Roman Urdu

Attending ACL w/out paper?

Is there anything that Universal NER or S-BERT still does better than Gemma 4?

ArXiv preprint while under journal review?

Getting started with LLMs, Need few clarifications

Seeking collaborators for ambitious LLM research projects

Suggest some project ideas related to nlp &amp; music

Request for work communication datasets

Suggest some project ideas related to nlp & music