
I revamped an old project of mine to make it evaluate hallucination of the notable LLM Models that are deployed and oh my god.
when I was in Undergrad, I made a boolean algebra, which had simple logic of taking the variables (Distinct Alphabets) as inputs, turning them into boolean numerals. and then generate a truth table that was relevant.
And with that truth table solve the boolean expression that was a simple string notation.
I was exploring my github Repo randomly and found my old projects, and found out only this one has 2 stars in it which made me kinda proud (Because no other repo has any stars.)
So I used claude code to see what can we do with this project.
And randomly I had a brainwave to check if we can plug it into AI LLM Models to see if the slop is real (Since boolean logic needs to be at scale).
So I repurposed the base code of Java to Python, experimented with few use cases and I built a deterministic boolean algebra engine that evaluates expressions by exhaustive truth table enumeration, cross-verified with z3. Then I used it as an oracle to benchmark tinyllama and llama3.2:3b on satisfiability questions — can two rules ever be true simultaneously?
Both scored 50%. Coin flip. But the failure mode is the interesting part:
- tinyllama always answered "yes" — constant output
- llama3.2:3b always answered "no" — constant output
Neither model is reasoning case by case. They're outputting a prior. The per-case strips in the chart make it obvious — uniform colour across every case, no variation.
This isn't a small model problem. It's a architectural one. Transformers aren't built to enumerate truth tables. The right fix isn't a bigger model — it's a deterministic layer that does the computation the model can't.
!pip install boolean-algebra-engine
Repo + benchmark: Check out here
Update 1: Since I got bashing on other sub. Here's a test on Gemma