r/KnowledgeGraph

▲ 27 r/KnowledgeGraph+14 crossposts

Ask questions across your Markdown notes using a fully local Graph RAG engine. Built for Obsidian vaults, works with any folder of Markdown files. Extracts entity-relation triples from wikilinks & YAML frontmatter, retrieves answers via hybrid search (vector + BM25 + temporal). Multilingual. No cloud. Runs on Ollama.

https://github.com/benmaster82/Kwipu

u/WritHerAI — 1 day ago
▲ 144 r/KnowledgeGraph+14 crossposts

Glia – Local-first shared memory layer (SQLite-vec + FTS5 + Offline Knowledge Graph)

Hey everyone,

I wanted to share a project I've been working on called Glia. It is a 100% offline, local-first RAG and memory layer designed to connect your AI web chats (Claude, ChatGPT, DeepSeek) with your local developer tools (Claude Code, Cursor, Windsurf) using a unified local database.

I wanted something lightweight that did not require pulling heavy Docker containers or subscribing to third-party memory APIs. I settled on a Node.js + SQLite architecture running sqlite-vec (for 768-dim float32 embeddings) alongside SQLite FTS5 for hybrid search, powered completely by local Ollama instances.

We just launched a live website that outlines the details and demonstrates the features in action:

Technical Stack & Features:

  • Hybrid Search Retrieval: SQLite-vec (using nomic-embed-text locally) + FTS5 keyword prefix matching (porter stemmer).
  • Surgical Sentence-level Trimming: Chunks are sliced into sentences. When a prompt is intercepted, only the exact matching sentences are pulled out of the vector store instead of the whole paragraph. It cuts LLM prompt bloat by ~90-95% in my benchmarks.
  • Knowledge Graph Extraction: An offline task queue uses a local LLM (llama3.1:8b via Ollama) to extract entity triples (subject-relation-object). These are stored in a SQLite facts table (or Neo4j if you run the full Docker compose profile) and fused with the vector retrieval score.
  • HyDE (Hypothetical Document Embeddings): Queries are pre-processed to generate a hypothetical answer, which is embedded together with the original query to bridge semantic gaps.
  • Concurrency: Running SQLite in WAL (Write-Ahead Logging) mode allows the browser extension dashboard and active MCP sessions to read/write concurrently without locking.
  • PII Redaction: Aggressive scrubbing of JWTs, API keys, emails, and IPs in the extension before data is saved.

The extension works on Claude.ai, ChatGPT, DeepSeek, Gemini, Grok, and Mistral. The MCP server runs out of the same backend database for your terminal agent or Cursor.

You can set it up with a single command: npx glia-ai-setup

Glia is completely open-source (MIT). If you like the local-first approach or want to contribute to the SQLite vector pipeline, PRs are very welcome, and a star on GitHub helps the project get discovered!

I would appreciate any feedback on the SQLite hybrid search scaling, the scoring fusion algorithm (RAG pipeline details are in RAG_PIPELINE.md), or local graph extraction performance!

u/Better-Platypus-3420 — 2 days ago

Complete beginner here... what is the best roadmap to learn Knowledge Graphs from scratch?

Hi everyone,

I am completely new to the world of Knowledge Graphs and looking for a solid learning path or roadmap to get started with the basics.

To give you some context:

  • My background: Minimal knowledge of KGs :/ hoping to get some insight in the hopes of possibly starting a career.
  • My goal: I want to understand how KGs work because I am interested in connecting it to LLMs/RAG, and adding to my data engineering knowledge

I am a bit overwhelmed by the different technologies and terminology (RDF, OWL, Property Graphs, Neo4j/Arango vs. Ontologies).

Could you recommend:

  1. The best beginner-friendly books, courses, or YouTube channels?
  2. A simple hands-on project idea to practice the core concepts?
  3. Whether I should focus on semantic web standards (W3C/RDF) or property graphs first?

Thank you in advance for any guidance!

reddit.com
u/VisionaryPond — 3 days ago
▲ 11 r/KnowledgeGraph+1 crossposts

In-process and in-memory graph database for large knowledge graphs - no server needed with TuringDB v1.31

Hey again! Adam from TuringDB, posted here a few months back when we launched the community version.

Quick update on something we just shipped: in-process mode.

You can now embed TuringDB directly in your script or pipeline - no separate server, no socket, no daemon to manage. Just instantiate and query:

In python

from turingdb import TuringDB

db = TuringDB() db.load_graph('my_knowledge_graph') db.set_graph('my_knowledge_graph')

df = db.query('MATCH (n)-->(m) RETURN n,m') print(df)

Results back as a DataFrame, zero networking to manage.

Practically this means: if you're running a KG pipeline, a GraphRAG system, or just iterating locally on a large graph - you no longer need to spin up an instance of TuringDB to use it. It runs where your code runs.

Everything else from the previous post still applies - git-style versioning, zero-lock reads, vector search, Cypher. This just removes the last friction point for local and embedded workflows.

Docs at docs.turingdb.ai and source at github.com/turing-db/turingdb

Happy to answer questions 🙂

reddit.com
u/adambio — 3 days ago
▲ 0 r/KnowledgeGraph+1 crossposts

I'm really excited about knowledge graphs...

This is mostly about AI, but since all my samples are in Blazor I felt it was appropriate to post it here.

I work on mostly business applications and use a lot of retrieval augmented generation however I recently discovered the power of knowledge graphs and recently wrote a blog post about it and would love to hear feedback.

More Powerful than AI RAG: Building Lightweight Knowledge Graphs

Basically if you have any data schema that has related entities you can easily build a knowledge graph saved as simple in memory json, and expose that to your AI through function tool calling.

Then allow your end users to ask questions and query that data in ways unachievable without a knowledge graph. This can also be used to update and make changes to your source data.

reddit.com
u/adefwebserver — 5 days ago
▲ 10 r/KnowledgeGraph+1 crossposts

Hi everyone,

I’ve spent the last few months building a custom knowledge graph extraction engine (which I call blAST) designed to map the architectural physics of massive software repositories.

Usually, extracting code into a graph requires an Abstract Syntax Tree (AST). The problem is ASTs are incredibly heavy, strictly monolingual, and fail if a repository doesn't compile. I wanted to map planetary-scale, multi-lingual enterprise systems, so I built a deterministic parser instead. It treats code like text and scans for keyword markers across 50+ languages to build the graph.

Here is how the graph ontology and analytics work:

1. The Ontology

  • Nodes: Files, Classes, and Functions.
  • Node Properties: 50+ dimensional vectors representing regex keyword hits (e.g., raw memory manipulation, state flux,etc).
  • Edges: File (imports/dependencies) and functional execution paths (outbound calls/reachability).

2. Graph Analytics & Network Topology

Once the graph is built, the engine runs network math over the repository to find architectural bottlenecks. I calculate:

  • Modularity & Average Path Length to measure encapsulation.
  • Articulation Points to find the "God Nodes" (if these fail, the graph shatters).
  • Cyclic Loop Density to measure static friction in the architecture.

3. K-Means Clustering on 1.5M Nodes

As all langauges have keywords that roughly mean the same thing, I analyzed 1000 repos of different languages and I took the regex count vectors of 1.59 million file nodes across 50 languages and ran them through an unsupervised K-Means clustering algorithm. The graph converged into 10 distinct architectural "micro-species" (e.g., UI View Layers, Highly Concurrent State Managers, Unshielded Native Core). The clustering algorithm successfully grouped a complex Java service and a defensive Rust file into the same exact node category based purely on their physical edge/property behavior.

4. Graph Traversal Use Cases

I used this graph engine to tear down Google DeepMind's original AlphaFold repo. By traversing the graph, the engine instantly isolated the absolute heaviest bottleneck in the network: a single node (contacts_network.py) running an $O(N^6)$ complexity loop holding up the entire pipeline.

code - https://github.com/squid-protocol/gitgalaxy

example data of google Deepmind's Alphafold - https://squid-protocol.github.io/gitgalaxy/museum-of-code/alphafold_teardown.html

Population data from 100's of repos - https://squid-protocol.github.io/gitgalaxy/03-04-claim-4-comparing-languages/

u/Chunky_cold_mandala — 7 days ago
▲ 21 r/KnowledgeGraph+3 crossposts

Knowledge Graphs to tackle the problem of searching code and documentation again and again with help of Mnemo

This is what your codebase actually looks like.

2032 nodes. 2878 edges. 7 relationship types.

Every service. Every dependency. Every API. Every owner. Every connection your team built over years — visualised in one graph.

Most AI coding assistants see none of this.

They see the file you have open.
Maybe the files you paste in.
Nothing else.

So when they generate code, they generate it blind.
No knowledge of what depends on what.
No knowledge of what breaks if you change something.
No knowledge of the relationships your team spent years building.

This is the real problem with AI in enterprise development.
It's not capability. The models are powerful.

It's context. AI operates on a fraction of the knowledge your senior engineers carry in their heads.

Mnemo builds this knowledge graph automatically from your codebase.

Services and their boundaries.
APIs and their consumers.
Dependencies and their blast radius.
Files and their owners.
Decisions and their history.

And then makes all of it available to your AI assistant — automatically, on every session.

No more blind generation.
No more code that compiles but breaks something downstream.
No more AI that doesn't know why things are the way they are.

This is what AI-assisted development should actually look like.

🔗 github.com/Mnemo-mcp/Mnemo

Drop a comment if you've ever had AI break something it didn't know existed.

u/killerexelon — 9 days ago

Protégé Short Course at Stanford: hands-on OWL ontology development with Protégé

Hi r/KnowledgeGraph — I’m part of the Protégé team at Stanford, and I wanted to share that we’re running the Protégé Short Course this June.

It’s a hands-on introduction to ontology development with OWL 2 and Protégé. The course is aimed at beginners as well as intermediate users who want a deeper grounding in OWL ontologies, reasoning, querying, and practical ontology-engineering workflows.

Participants receive course materials, including a 221-page hands-on manual developed by the Protégé team, with walkthroughs, diagrams, quizzes, and more than 100 practical exercises.

Early-bird registration is available until May 23.

Details are here:

https://protege.stanford.edu/shortcourse/

Happy to answer questions about the course, the intended audience, or what topics are covered.

Matthew

reddit.com
u/MatthewH2 — 7 days ago
▲ 12 r/KnowledgeGraph+3 crossposts

NornicDB 1.1.0 preview - memory decay as declarative policy - MIT Licensed

hey guys so i wrote a database, NornicDB.

https://github.com/orneryd/NornicDB/releases/tag/v1.1.0-preview-1

it got mentioned in research last month. https://arxiv.org/pdf/2604.11364

the researcher actually commended on issue #100 here:

https://github.com/orneryd/NornicDB/issues/100#issuecomment-4296916032

and i’ve released a preview tag for people to play with. 1.1.0-preview. docker images, mac installer, or build it locally.

the idea is to convert memory decay into policy that can be declared in cypher. it started with Ebbinghaus but as the researcher pointed out, is insufficient for agentic memory.

with the policies you can define the decay curve profiles. when you enable memory decay, it sets up policies to match the Ebbinghaus-Roynard model as he describes in the paper. that plus the “canonical graph ledger” bootstrap enables you to move a lot of glue code into the database using the primitives i provide. (cardinality, temporal no-overlap constraints, etc…)

the way it works is a visibility suppression layer in between Cypher and badger. on-access meta is stored in a separate index. there are functions to reveal/decay scoring functions in cypher for debugging queries or bypassing the visibility layer. having the layer there and the meta flushed separately from the data itself maintains negligible performance overhead for enabling it at the data layer.

it’s research backed. I’m writing my own research paper in response to 4 different papers converging on my database implementation.

726 stars and counting. MIT licensed. neo4j and qdrant driver compatible.

enjoy!

edit: clarity on performance overhead. the way i’ve built it and benchmarked it, the performance overhead is within noise tolerances. +/- <1% variance across runs and overhead measures in nanoseconds in tests.

u/Dense_Gate_5193 — 9 days ago
▲ 3 r/KnowledgeGraph+1 crossposts

We hit this while building an RFP automation system. Client had hundreds of documents: past RFPs, RFIs, proposal templates, internal reference files spanning years. When we requested for single source of truth - they confessed that they had none. We had a hunch that this is going to lead to a funny outcome.

We ingested everything and started taking queries.

First real tests:

- "What's our pricing?" Three different numbers depending on which document you pull.

- "How many employees?" Four different answers.

- "What's our compliance certification status?" One doc says pending. Another says SOC2Type1. The most recent one says HiTrust.

At cogniswitch, we take a neuro-symbolic approach, still the system generated answers the team was not really stoked about. It was on a feedback call client's growth team mentioned that the answers are dated. Obviously. The documents just tons of conflicts/ contradictions.

We went back and asked for the source of truth. There wasn't one. These were live internal documents that had accumulated years of drift. Nobody had reconciled them because nobody needed to until an AI had to answer from all of them at once.

We ended up building a conflict detection layer before the answer generation layer. Scan the corpus for conflicting facts - pricing, headcount, certification status - with different stated values across documents. Flag them. Human resolves which is authoritative. Then you can build anything on top off this knowledge foundation.

Lesson learnt the hard way - gap with output-only evals: your benchmark asks whether the AI answered correctly. But if your knowledge base has contradictions, "correct" doesn't have a stable meaning.

Clear need for context evals - checking whether your retrieval corpus is internally consistent before you ever run a query - are barely a discipline. I don't know of good tooling for it. Most teams discover this problem the same way we did.

Anyone building RAG on messy enterprise document sets running into this?

reddit.com
u/Ok_Gas7672 — 14 days ago