u/Floppy_Muppet

New RSI Benchmark ATH! Looking for feedback on research pre-publish.

​

Hi All~ So we just hit an ATH on our internal RSI benchmark we call COMB (Calibrated Observation Matching Benchmark) which was created to evaluate the performance of recursive self-improvement agent harnesses, specifically ones that enable experience-derived learnings for the host agent.

Each benchmark run takes 10-20hrs, simulating tens of thousands of interactions through 3 RSI harness-equipped host agents, and then evaluates how close the harness's belief-state is to a blind corpus of 22 Ground-Truth learnings which are only known to the benchmark judge.

This has been a 7+ month journey and we are currently on benchmark run (and harness iteration) #53, hitting a recent ATH of discovering 16/22 ground truths, with a pathway towards higher highs still 🤞

Anyways, reason for the post~ We are planning to start publishing more info and live results of our benchmark/research journey to our website so it's easier for folks to follow along, and would greatly appreciate any and all feedback/questions/reactions you have on the pre-publish that we just got up on our dev site before I goes live: https://dev.honeynudger.ai/comb-benchmark

Thanks so much in advance for your time and look forward to hearing from you all -- don't hold back! 🙌 🙏

Ps. As you'll see mentioned on the page, we're also planning to open source the COMB benchmark in the near future to hopefully help advance the RSI agent space forward and offer the same rubric to help devs choose the right harness for their use case as the self-learning/self-improving agent space begins ballooning as we think it might.

reddit.com
u/Floppy_Muppet — 5 days ago