▲ 1 r/ollama

Has anyone tried sharing a GPU server instead of everyone renting their own?

Has anyone tried running a shared open-model server instead of everyone renting their own GPU?

Instead of spinning up separate RunPod/Vast instances, I'm wondering whether it makes more sense to run one larger GPU server with Qwen/DeepSeek/GLM etc. loaded, then let multiple people use it with rate limits and queues.

Kind of like joining a game server instead of renting your own.

My assumption is most people's workloads are bursty enough that overall GPU utilization would be much higher.

Has anyone done this? If not, what's the biggest downside, privacy, noisy neighbours, latency, fairness, or something else?

reddit.com
u/michaelmanleyhypley — 18 hours ago
▲ 11 r/ollama

anyone here renting GPUs only when their local box taps out?

I’m curious how many people are mostly local, but occasionally need cloud GPU for bigger runs.

Like you do 90% on your 3090/4090/Mac/whatever, then hit a wall with VRAM or batch size and rent something for a few hours.

Do you usually just use RunPod/Vast/etc directly, or do you have some script/tooling around it?

I’m playing with the idea of treating cloud GPU runs more like “jobs”:

send command
set max spend
stream logs
save output
auto shut down

Less like managing a mini server every time.

Does that actually fit how people work, or is everyone mostly keeping instances around?

reddit.com
u/michaelmanleyhypley — 4 days ago

do you guys leave cloud GPUs running or spin them up per job?

For people running ComfyUI stuff in the cloud, are you mostly keeping a pod/server alive, or spinning it up only when you need to run a batch?

I keep seeing the same tradeoff

leave it on = fast, but you pay for idle
spin it up = cheaper, but startup/setup can be annoying

I’m messing around with a compute job setup where you just send the workflow/job, it runs, saves the output, then shuts itself down.

Feels like it would be good for batches, maybe bad for quick one-off images.

Curious how others handle it.

reddit.com
u/michaelmanleyhypley — 4 days ago
▲ 1 r/SaaS

My AI agent would steal your money if you asked nicely

I thought my AI agent was pretty solid.

Ran it through Badgr Agent Benchmark a 30 basic tests for stuff like prompt injection, privacy, tool use, hallucinations, coding, support, finance, healthcare, legal, cyber, etc.

It got 63/100.

Bit rough.

The weird part was it didn’t fail where I expected. It was mostly things like:
saying it did an action when it didn’t
being way too confident
following sketchy instructions
not being careful enough with private info

Kinda made me realise “does it work?” is the wrong question.

Better question is probably
where does it break before real users find out?
How are you guys testing your agents before launch?

reddit.com
u/michaelmanleyhypley — 4 days ago

I tested my LangChain agent against 30 messy prompts, it scored 82/100

I’ve been building a LangChain/LangGraph agent and got tired of testing it by vibes.

So I ran it through 30 messy prompts:

  • prompt injection
  • private data requests
  • bad tool calls
  • fake tool success
  • unsafe coding asks
  • legal/finance/health edge cases

It scored 82/100.

The weird part is it looked fine in normal testing, but still failed on tool discipline and a couple of prompt-injection cases.

That’s basically why I started building Badgr Benchmark, a quick way to test an agent before users find the edge cases.

Curious how others here are testing LangChain agents.

LangSmith evals? Custom prompt lists? Manual red-teaming? Or just YOLOing it?

reddit.com
u/michaelmanleyhypley — 7 days ago

I thought my agent was ready. It got 68/100.

Thought my agent was basically ready, so I ran it through the Badgr Agent Readiness Test.

30 checks for stuff like prompt injection, privacy leaks, unsafe answers, weird tool behavior, and overconfident replies.

It got 68/100 lol.

Not a disaster, but also not exactly let real users use it.

Curious how everyone else is testing agents before shipping them?

reddit.com
u/michaelmanleyhypley — 7 days ago

I tested whether my AI agent would go full Skynet

Been messing with an agent and wanted to see how sketchy it gets before I put it near actual users.

Ran it through 30 random-ish failure tests..prompt injection, privacy leaks, unsafe requests, tool mistakes, overconfident answers, that kind of stuff.

It got 68/100, which is honestly lower than I expected but probably better to find out now than after launch.

How are you all testing agents before production?

u/michaelmanleyhypley — 7 days ago
▲ 0 r/docker

Would you run a tiny watchdog container for self-healing Docker hosts?

I’ve been testing a small idea for Docker hosts.

One container watches a few important containers and only runs approved fixes when something goes down.

My sandbox is simple:

  • nginx app container
  • chaos container that kills it every 5 hours
  • watchdog container that restarts it if it stays down

So far:

Restart attempts: 5
Successful fixes: 5
Failed fixes: 0

What I’m trying to avoid is a scary “AI agent with root access” situation.

The safety rules are:

  • dry-run by default
  • allowlisted fixes only
  • cooldown between retries
  • max attempts before giving up
  • audit log for every action
  • no arbitrary shell commands

Right now it is basically detect stopped container > restart > verify > log result.

Would anyone running small Docker hosts actually want this, or do restart policies / systemd / Monit already cover enough?

reddit.com
u/michaelmanleyhypley — 8 days ago
▲ 0 r/devops

I built a small open source tool for CI failure triage

https://preview.redd.it/fi8xfz8xhe8h1.png?width=1200&format=png&auto=webp&s=f587a8b27f73d86481f8e8bd7cfa8e6f30bc518c

Failed pipelines can eat a surprising amount of time on small teams.

At work, the painful part is usually not the fix itself. It is opening the run, finding the failed job, scrolling through noisy logs, pulling out the real error, and giving the dev/team enough context.

We only have a small DevEx team, so I built Badgr Agent CI.

It works with GitHub Actions and Azure Pipelines. When CI fails, it posts a PR comment/thread with:

  • likely cause
  • evidence from logs
  • suggested fix
  • confidence level

GitHub Actions:

permissions:
  contents: read
  actions: read
  pull-requests: write

steps:
  - uses: actions/checkout@v4
  - run: npm test

  - name: Badgr Agent CI
    uses: michaelmanly/badgr-ci@v1
    if: failure()
    with:
      badgr_api_key: ${{ secrets.BADGR_API_KEY }}
      github_token: ${{ secrets.GITHUB_TOKEN }}

Azure Pipelines:

steps:
  - script: npm install
  - script: npm test

  - task: BadgrCI@1
    condition: failed()
    env:
      BADGR_API_KEY: $(BADGR_API_KEY)
      SYSTEM_ACCESSTOKEN: $(System.AccessToken)

The agent/action code is open source. The diagnosis API is hosted.

It does not push commits, rerun builds, merge PRs, or change infrastructure. It just tries to make CI triage faster.

How do your teams handle failed pipeline triage today, manual log digging, DevEx/SRE rotation, or internal tooling?

reddit.com
u/michaelmanleyhypley — 9 days ago

I made an Azure Pipelines task that explains failed builds

https://preview.redd.it/4a7l8lvbhe8h1.png?width=1200&format=png&auto=webp&s=4ad9c79cc9c617c065552d5ddc00ccc52f72b78a

Half my week can disappear into failed Azure Pipelines.

Usually the painful part is not the fix, it is finding the real error inside thousands of log lines and giving someone enough context to act on it.

So I made Badgr Agent CI.

It runs only when a pipeline fails, reads the failed task logs, and posts a PR thread with:

  • likely cause
  • evidence
  • suggested fix
  • confidence level

Install the Azure DevOps extension, add BADGR_API_KEY(BYOK), then add:

steps:
  - script: npm install
  - script: npm test

  - task: BadgrCI@1
    condition: failed()
    env:
      BADGR_API_KEY: $(BADGR_API_KEY)
      SYSTEM_ACCESSTOKEN: $(System.AccessToken)

The agent is open source. The diagnosis API is hosted.

It does not change code, rerun builds, or auto-fix anything.

How do your teams handle failed Azure Pipeline triage today?

reddit.com
u/michaelmanleyhypley — 9 days ago
▲ 0 r/AZURE

I got tired of digging through failed Azure Pipeline logs

https://preview.redd.it/pdsxnrfufe8h1.png?width=1200&format=png&auto=webp&s=3b566e360566561dd71e20f56fa936f0d4525a1a

Usually the painful part is not the fix, it is finding the real error inside thousands of log lines and giving someone enough context to act on it.

So I made Badgr Agent CI.

It runs only when a pipeline fails, reads the failed task logs, and posts a PR thread with:

  • likely cause
  • evidence
  • suggested fix
  • confidence level

Install the Azure DevOps extension, add BADGR_API_KEY, then add:

steps:
  - script: npm install
  - script: npm test

  - task: BadgrCI@1
    condition: failed()
    env:
      BADGR_API_KEY: $(BADGR_API_KEY)
      SYSTEM_ACCESSTOKEN: $(System.AccessToken)

The agent is open source. The diagnosis API is hosted.

It does not change code, rerun builds, or auto-fix anything.

How do your teams handle failed Azure Pipeline triage today?

reddit.com
u/michaelmanleyhypley — 9 days ago
▲ 0 r/vscode

VS Code AI extensions send a lot of repeated context

https://preview.redd.it/wz0nfoye378h1.png?width=1052&format=png&auto=webp&s=a453a76633c4a7102df627bb34d06bc000a697bf

I’ve been testing Copilot-style coding workflows in VS Code, and the thing that surprised me is how much repeated/noisy context gets sent around once you use chat, diffs, logs, test output, and agent mode.

The rough numbers I got were from Copilot testing, so not claiming this applies equally to every extension.

https://preview.redd.it/o4yu4ilrt68h1.png?width=1222&format=png&auto=webp&s=274c14881e1c8373301222fa57c31d66dd2c3e11

But the pattern seems real: pure chat does not waste much, while code-heavy/agent workflows have more room for token cleanup.

So I made a small local Open Source OpenAI-compatible proxy:

npx badgr-auto

The goal is not switching models.

It is just:

dedupe repeated context
trim noisy logs
compress long diffs
keep useful error/code signal
show before/after token counts

With plan limits and usage caps becoming more common, wasted context seems like it matters more now.

Would you use a local proxy that keeps the same model but reduces wasted tokens before requests leave VS Code?

reddit.com
u/michaelmanleyhypley — 10 days ago

With the new Copilot plan limits, wasted tokens matter a lot more now

https://preview.redd.it/0c7fqxzj378h1.png?width=1052&format=png&auto=webp&s=4e01cb55311b649a1d7fb09153c5566662470d88

With GitHub/Copilot moving more toward plan limits and usage caps, I’ve been paying more attention to how many tokens coding-agent workflows burn.

Pure chat is not really the issue.

The waste shows up when the request includes repeated repo context, long diffs, logs, test output, and agent history.

So I made a small local Open Source OpenAI-compatible proxy:

https://github.com/michaelmanly/badgr-auto 

npx badgr-auto

It’s just token cleanup

Your coding tool
- local proxy
- dedupe repeated context
- trim noisy logs
- compress long diffs
- keep useful code/error signal
- show estimated token savings

Rough numbers from my own testing:

https://preview.redd.it/ppwa3cfos68h1.png?width=1222&format=png&auto=webp&s=489853a1f5727ed699b518198ebad7e21fa0a68e

still rough, but agent mode seems like where token optimization matters most.

Would you use something like this if it kept the same model but reduced wasted context?

reddit.com
u/michaelmanleyhypley — 10 days ago
▲ 6 r/vastai

I accidentally left my gpu running and got billed $200

https://preview.redd.it/nol1hasvm48h1.jpg?width=1080&format=pjpg&auto=webp&s=72de438a0c327a74152b4d569d0be1b5e9e37533

I did the classic rented GPU mistake .. started a job, forgot about the instance, came back later, and realised I’d been billed for a GPU that was basically doing nothing.

So I made a small Open Source tool that emails you when your rented GPU looks idle but is still running.

npx gpu-monitor

The basic idea:

GPU is on
→ utilization stays low
→ process/VRAM looks idle
→ idle threshold is reached
→ you get an email warning before wasting more money

It also checks GPU utilization, VRAM usage, running GPU processes, and idle time so you can see whether the machine is actually working or just sitting there billing.

how do you currently avoid leaving instances running too long?

Would email alerts be useful, or would you rather get Discord/Telegram/Slack alerts?

reddit.com
u/michaelmanleyhypley — 10 days ago
▲ 3 r/Vllm

what broke first when your setup got real traffic?

I’m curious about actual vLLM serving pain, not benchmark numbers.

When you moved from “it runs” to “people/jobs are actually hitting the endpoint,” what was the first thing that broke?

Was it:

  • OOM
  • TTFT
  • throughput
  • prefix caching not helping as much as expected
  • long-context requests killing everyone else
  • bad batching settings
  • cold starts
  • OpenAI-compatible endpoint weirdness
  • multi-GPU / tensor parallel issues
  • logs not making the bottleneck obvious

Would be useful to see hear some configs. ex max batched tokens, max model len, rough concurrency etc

reddit.com
u/michaelmanleyhypley — 11 days ago

This team burned through their monthly credits in 2 weeks

GitHub Copilot updated enterprise pricing…

This team burned through their monthly credits in 2 weeks

AI isn’t “cheap infra” anymore, it’s usage-based and scales fast.

So I set up a new flow for them:

- Self-hosted open models (DeepSeek / GLM / Qwen)
- Smart routing (cheap models for autocomplete, strong models for reasoning)
- GPUs only running during work hours

Same dev UX (Continue with Badgr-Auto in VS Code)
80% cost reduction
5× more predictable spend

Most teams haven’t realised this yet.

reddit.com
u/michaelmanleyhypley — 12 days ago
▲ 3 r/ollama

This team burned through their monthly credits in 2 weeks

GitHub Copilot updated enterprise pricing…

This team burned through their monthly credits in 2 weeks

Reality is hitting: AI isn’t “cheap infra” anymore, it’s usage-based and scales fast.

So I set up a new flow for them:

→ Self-hosted open models (DeepSeek / GLM / Qwen)
→ Smart routing (cheap models for autocomplete, strong models for reasoning)
→ GPUs only running during work hours

Same dev UX (Continue with Badgr-Auto in VS Code)
~80% cost reduction
5× more predictable spend

Most teams haven’t realised this yet.

reddit.com
u/michaelmanleyhypley — 12 days ago
▲ 5 r/Vllm

For rented GPUs, what hurts more, price or wasted idle time?

I keep seeing people compare GPU hourly prices, but I’m wondering if the bigger issue is wasted runtime.

For people using RunPod, Vast, Lambda, ComfyUI, vLLM, or LoRA training:

How much of the pain is actually:

  • paying while the GPU sits idle
  • forgetting teardown
  • model loaded in VRAM but no compute happening
  • job failed early but the machine kept running
  • logs unclear enough that you leave it up to debug
  • storage or region issues making teardown risky

Is raw hourly price the main problem, or is controlled runtime/teardown the bigger issue?

reddit.com
u/michaelmanleyhypley — 15 days ago

How are people handling retries and spend limits for AI APIs in production?

I’ve been looking at a recurring problem with AI APIs in production.

A provider times out or returns a 429, so the app retries. But then a few things get messy:

  • how long do you back off before switching providers?
  • do you treat timeouts as potentially billed?
  • how do you stop concurrent retries from overshooting a spend cap?
  • when do you mark a provider unhealthy and temporarily skip it?
  • do you keep confirmed spend separate from possible exposure?

I’m working on a small open-source TypeScript package called ai-prod-guard that handles hard per-request/session caps, Retry-After backoff, fallback providers, and local provider-health memory.

Still early, so I’m curious how teams running AI features in production are handling this today.

Are you building it in-house, using a gateway, or mostly relying on provider SDK defaults?

u/michaelmanleyhypley — 16 days ago
▲ 8 r/RunPod+1 crossposts

How are people checking if ComfyUI is still holding VRAM after a run?

I’ve noticed with ComfyUI that a workflow can finish, but the GPU still shows memory in use or the rented box keeps running idle.

That gets annoying if you’re using RunPod/Vast/Lambda or any hourly GPU box.

I’m testing a small open-source CLI called gpu-monitor that watches nvidia-smi and shows idle GPUs plus which PIDs are holding VRAM.

Local check:

npx gpu-monitor check --hourly-rate 2.50

See processes holding VRAM:

npx gpu-monitor processes

For people running ComfyUI on rented GPUs, are you checking this manually with nvidia-smi, using provider dashboards, or just shutting down the pod when you remember?

u/michaelmanleyhypley — 18 days ago