u/clairedoesdata

Handling multiple MCP servers and multiple models together

To everyone who has connected to multiple MCP servers and multiple models from different providers, how do you you guys maintain the infrastructure while keeping the tokens in check?? I use an OSS llm gateway for this and it seems to work fine. I am curious to know if there are other/better ways people are doing this. Share your infra in the comments.

reddit.com
u/clairedoesdata — 5 days ago
▲ 3 r/mcp

Handling multiple MCP servers and multiple models together

To everyone who has connected to multiple MCP servers and multiple models from different providers, how do you you guys maintain the infrastructure while keeping the tokens in check?? I use an OSS llm gateway for this and it seems to work fine. I am curious to know if there are other/better ways people are doing this. Share your infra in the comments.

reddit.com
u/clairedoesdata — 5 days ago

Primarily putting this up for those newer to the field who need help sifting through all the benchmarks.

OSWorld-V benchmarks models by having them perform realistic desktop productivity activities (multi-application use, file management etc.). GPT-5.4 achieved 75% performance on the benchmark this week, narrowly beating the 72.4% human baseline.

The usefulness of the benchmark for learners lies in the fact that it provides a grounded, quantifiable measure of capability in relation to what most people think of as "AI agents". Many popular benchmarks (GSM8K, MMLU, HumanEval) measure highly specialized capabilities and can mislead regarding a model's actual utility due to skewed scores.

To develop an intuition on what a benchmark tells you regarding which models are useful for what:

Reasoning benchmarks (arithmetic, programming etc.) indicate narrow capabilities

Long-context benchmarks indicate retrieval capabilities, NOT reasoning with context

API correctness benchmarks (Berkeley Function Calling, ToolBench) measure API accuracy

OSWorld-V and similar agent benchmarks measure closer to actual usefulness of models

The failure mode for benchmarks like GSM8K is very different from that for OSWorld-V so don't forget that when you see capability claims.

reddit.com
u/clairedoesdata — 18 days ago

Primarily putting this up for those newer to the field who need help sifting through all the benchmarks.

OSWorld-V benchmarks models by having them perform realistic desktop productivity activities (multi-application use, file management etc.). GPT-5.4 achieved 75% performance on the benchmark this week, narrowly beating the 72.4% human baseline.

The usefulness of the benchmark for learners lies in the fact that it provides a grounded, quantifiable measure of capability in relation to what most people think of as "AI agents". Many popular benchmarks (GSM8K, MMLU, HumanEval) measure highly specialized capabilities and can mislead regarding a model's actual utility due to skewed scores.

To develop an intuition on what a benchmark tells you regarding which models are useful for what:

Reasoning benchmarks (arithmetic, programming etc.) indicate narrow capabilities

Long-context benchmarks indicate retrieval capabilities, NOT reasoning with context

API correctness benchmarks (Berkeley Function Calling, ToolBench) measure API accuracy

OSWorld-V and similar agent benchmarks measure closer to actual usefulness of models

The failure mode for benchmarks like GSM8K is very different from that for OSWorld-V so don't forget that when you see capability claims.

reddit.com
u/clairedoesdata — 18 days ago