u/MeetVege — reddlx

Swapped out Sonnet for GLM 5.1 and K2.6 in Claude Code for a week

The recent subsidy posts here got under my skin. Yeah the 5-hour limits went back up earlier this month but that didn't really answer the question, just made it less urgent. So last week I kept Claude Code but pointed ANTHROPIC_BASE_URL at a different provider and used GLM 5.1 plus K2.6 for the week. Both came out in April so I figured the early integration bugs would mostly be worked out.

It's a Go service I've been working on for a while. Normal week of refactors plus some test scaffolding and a couple new endpoints. Same stuff I'd usually have Sonnet do. Set GLM 5.1 as the default in the env vars, used K2.6 when I needed wider context across files. Went with one of the Anthropic-compatible aggregator routes rather than wiring two providers separately, because I didn't want to rewrite my session scripts.

GLM 5.1 surprised me. I'd written off the benchmark hype as PR but for the kind of day-to-day refactor work I do, the gap to Sonnet wasn't really noticeable after a couple days. It's more verbose than Sonnet. Double checks itself a lot more than I'd like. I can't really speak to the frontend agent stuff people are excited about because I don't do enough of it.

K2.6 was solid for the wide-context tasks. Fed it about 80k tokens for a migration across a few packages and references tracked correctly. The weak spot is the same one I hit with every open model, custom tools with three or four nested args. Sonnet handles those fine, K2.6 needs a retry maybe a quarter of the time.

Sonnet's hallucinations are sneaky. It'll invent a function signature that looks like something the library would have. GLM's are louder, syntax compiles fine but the module it references isn't in your imports. Bad in different ways but I'd rather have the loud kind in review.

One thing that tripped me up early. The model env var names in Claude Code are tied to Sonnet and Opus, so when I set ANTHROPIC_DEFAULT_SONNET_MODEL to GLM, I forgot Opus was still pointing at the Anthropic default and was silently falling back. Burned a chunk of the first morning before I noticed. Make sure you set every model env var, not just the obvious one.

On cost. Can't give a clean comparison because subscription vs subscription is messy. But the same week of work that usually has me watching my Claude Code session burn down by Friday afternoon felt fine on the new setup. Not the meme-y "I saved 75%" story, but not a small difference either.

Latency is the one thing that hasn't really faded. Sonnet you don't notice, you just work. GLM is close. K2.6 has this little pause before each tool call, which fades in batch work but stands out when you're typing back and forth. Don't see that in any benchmark.

Anyway. Subsidy threads were what got me to actually try it instead of speculating.

reddit.com

u/MeetVege — 1 day ago

▲ 6 r/LLMDevs

Shared RAG index with metadata filters started cracking around 30 tenants

We've been doing customer-facing RAG for about a year. Each customer uploads their own docs, and they only see results from their own corpus.

Started in a single Pinecone index with namespaces per tenant. Worked fine through the first 10 or so customers, then namespace count itself became an ops headache, so we flipped to a single namespace and tenant_id metadata filter on every query. That carried us to maybe customer 18. Then a few things started getting weird.

Recall got noticeably worse for tenants with smaller corpora. I don't have a great theory for why, but my hunch is that hybrid scoring inside a giant shared index starts being dominated by the term distribution of larger tenants. If 80% of your docs are from three big customers, and a fourth customer searches a term that's common in their own docs but rare in the shared corpus, BM25 weights end up looking strange. The vector side was less obviously broken. With top-K retrieval and a metadata filter, small-corpus tenants were sometimes getting fewer than K candidates back at all, which then fed a reranker that didn't have enough to work with.

The other issue was operational. A reindex of any single tenant's docs meant reprocessing them inside the shared ingestion pipeline. Updates to one customer's content sometimes stalled because of an ingestion job from a different customer. Not a great look when the customer with the slow job is also the one paying the most. Granted, that one isn't really an index-topology problem. You could parallelize workers and keep the index shared. But the two failure modes started compounding, and the simplest fix for both at once was just per-tenant everything.

So now I'm trying to decide whether to flip to per-tenant isolated indexes. The downside is obvious. Thirty separate indexes to keep an eye on, plus you're paying for storage thirty times instead of once. You also lose the ability to do cross-tenant analytics, which we do use occasionally for product decisions.

What I keep going back and forth on is whether this is an architectural question or just a "your shared index needs better scoring" question. At 30 tenants both stories are plausible. At 100 I don't know which one breaks first, and the migration cost of switching topologies later is not small.

Mostly trying to figure out how other people drew the line.

reddit.com

u/MeetVege — 3 days ago

▲ 2 r/LLMDevs

Shared RAG index with metadata filters started cracking around 30 tenants

We've been doing customer-facing RAG for about a year. Each customer uploads their own docs, and they only see results from their own corpus.

So now I'm trying to decide whether to flip to per-tenant isolated indexes. The downside is obvious. Thirty separate indexes to keep an eye on, plus you're paying for storage thirty times instead of once. And you lose the ability to do cross-tenant analytics, which we do use occasionally.

Been prototyping with Denser Retriever for the last couple of weeks partly because its data model treats a knowledge base as a first-class resource with its own ID, and you create one through the same API you upload docs to. Per-tenant KB ends up being one POST per customer signup, which is the cleanest take on this I've come across. Not sure I've stress-tested it enough at scale yet to claim anything beyond "the ergonomics are easier."

The thing I'm still stuck on is whether this is an architectural question or just a "your shared index needs better scoring" question. At 30 tenants both stories are plausible. At 100 I don't know which one breaks first.

Mostly trying to figure out how other people drew the line.

reddit.com

u/MeetVege — 3 days ago

▲ 19 r/dataengineering

How are you actually monitoring a RAG pipeline in prod? Inherited one and there's basically nothing to look at

Maybe this is a known thing but it keeps catching me off guard.

We have a RAG service running an internal assistant. Lived with the AI/ML team for a year and change, just got moved to my team in the last reorg. Code runs, embeddings get computed on schedule, vector store updates. From a pipeline perspective it looks like a normal cron'd job, exit zero or paged.

Then i started asking the things i ask about every other data asset and the answers were either bad or didn't exist. Freshness SLA? "Whatever the cron is." So 6 hours, sometimes more if a batch hangs. Quality monitoring? Don't really have any. Users complain in slack and that's the signal. If an embedding job half fails and leaves a doc in a weird state in the index, would anyone notice? Long pause, then "eventually i guess." No view on whether retrieval is trending differently this week vs last.

Coming from dbt + airflow + a few years of pushing Great Expectations on every team i can find, this feels like 2014. No row count equivalent. The closest thing in scope is "did the embedding job exit zero." That's it.

Started pulling raw logs into duckdb to look and some of it was rough. Roughly 7% of live queries are serving an answer despite the top retrieval score being well below where i'd want it. No abstention, no flag, no escalation. The model just keeps talking and the user takes it on faith.

The other thing bugging me, none of the metrics the AI team uses are useful for this in production. They care about precision@k on an offline eval set they hand curated almost a year ago. I care about how the thing is actually behaving in prod this week, which they have no view on.

Tons written about RAG retrieval quality, almost nothing about RAG observability as a thing you actually operationalize. Would honestly be glad to hear from anyone who's built a real monitoring layer here. Otherwise it kind of feels like we're all running on user complaint driven signal.

reddit.com

u/MeetVege — 10 days ago

▲ 0 r/digitalnomad

About 60% of my invoices now settle in USDC because half my clients are in places where wire transfers are either expensive or just stuck for days. Cool in theory.

In practice the off-ramp is where everything breaks. Wise flagged my account twice in a quarter once it noticed regular crypto deposits. Revolut's crypto sell limits are fine until rent is actually due. Local exchange in Tbilisi or Bangkok depending on where I am, but spread + withdrawal fees end up around 2-3% before I even get cash, and ATM fees bite again.

I keep hearing nomads say they just hold stables and "spend directly," which sounds great until I try to figure out what that actually means at a Tesco self-checkout.

Tried a couple of CEX-backed cards (Crypto.com, BitPay style) but the custody piece always made me uneasy after FTX. The self-custody side seems newer and I haven't really pressure-tested any of them.

Mostly the bit I keep getting stuck on is the last mile from chain to checkout without giving up custody or eating 3% twice.

reddit.com

u/MeetVege — 16 days ago

▲ 183 r/OsakaTravel

Just got back from my first osaka trip, 3 nights in feb. had a great time and dotonbori absolutely lived up to it the first night, the energy is real. did kuromon market, osaka castle, took the day in nara, ate way too much takoyaki, all good stuff.

but by night 2 i kept getting this feeling that the parts i was seeing were the very top layer and there was a whole other osaka i wasnt getting to. couple things made me think that:

we wandered into tenjinbashisuji one afternoon almost by accident and ended up eating at this small place off a side street where nobody spoke english and it was the best meal of the trip. completely different vibe from the main tourist drag. then on the last night we got tired of the dotonbori area and went looking for somewhere quieter, ended up in what i later learned people call ura-namba, tiny standing bars, alleys, locals after work. felt like we'd been one block away from this for 2 days without knowing.

and shinsekai i didnt make it to but everyone whose opinion i trust says i screwed up by skipping it.

so basically im realizing 3 nights wasnt enough and im already thinking about coming back. for people who know osaka well, what's the layer past the obvious stuff? specific neighborhoods, specific kinds of places, things that take a couple visits to find? happy to pin a list for whenever i make it back.

u/MeetVege — 25 days ago