u/aistranin

Thinking Machines’ interaction models are more interesting than the benchmarks

Thinking Machines’ interaction models are more interesting than the benchmarks

The most important part here is not the benchmark numbers. It is the shift in product logic.

If this approach scales, a huge class of AI products may no longer need an external orchestrator.

Live translation, pronunciation tutors, an assistant that comments on code while you type, workout rep counting, navigation for blind users - a lot of this is currently built with awkward pipelines and noticeable latency.

Here, interactivity becomes a property of the model itself.

The limitations are real too. Long sessions fill up context fast. You need a stable connection. The current checkpoint is not their largest model. Their bigger models are still too slow for realtime use.

But the direction looks strong.

This is not just "ChatGPT with voice." It is an attempt to build AI that does not only answer after you finish. It is AI that can be present in the moment.

Link: https://thinkingmachines.ai/blog/interaction-models/

u/aistranin — 5 days ago

X published the updated For You algorithm on GitHub

X released an updated version of the For You algorithm on GitHub.

You can now look at how X builds and ranks the recommendation feed.

The repo xai-org/x-algorithm contains code for the system behind the For You feed, from candidate selection to final post ranking. There are two main content sources:

  • posts from accounts you follow
  • posts from the global corpus, found through ML retrieval

After that, everything goes through Phoenix, a transformer model based on Grok's architecture. It predicts the chance that a user will take actions like liking, replying, reposting, clicking, and other engagement signals.

The system then combines those signals into a final score and decides what gets shown in the feed.

Worth reading if you want to see which signals actually affect recommendations, how the ranking pipeline works, and where the platform filters content before showing it.

GitHub: https://github.com/xai-org/x-algorithm

u/aistranin — 5 days ago
▲ 2 r/PracticalTesting+1 crossposts

How to start with automated testing for Python projects

A few years ago, I was leading a Python team with a large legacy codebase and uneven testing habits. We wanted more confidence, but the usual advice of "just write more tests" was not enough.

What helped most was weekly shared learning:

  • 30-60 minutes discussing the next pytest/testing topic
  • about 60 minutes of paired practice on real code
  • rotating pairs so testing knowledge spread across the team
  • focusing on small improvements instead of huge rewrites

The practice part was the key. It helped turn pytest from an individual skill into a team habit.

I wrote up a longer article about it if someone is curious https://www.istranin.dev/blog/onboard-python-team-pytest-testing-ci-cd/

Looking forward to your feedback and your thoughts on testing today, especially with AI-generated tests. Personally, I do not think AI is good enough yet to handle tests completely on its own instead of developers, but I have heard very different experiences, to be honest.

u/aistranin — 5 days ago

Paper: production-derived benchmarks for coding agents are getting more serious

Paper worth reading: ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Short summary: the authors built a benchmark from real developer-agent sessions with a production AI coding assistant. Each sample includes the original prompt, the committed code change, and tests that should go from failing to passing. The benchmark spans seven programming languages. In their evaluation, model solve rates ranged from 53.2% to 72.2%.

Why this matters: a lot of coding benchmarks are useful, but they often miss how messy real work is. Production prompts are not always clean. Monorepos have weird test setups. Codebases have local conventions. The paper argues that benchmark design should reflect those conditions.

A few concepts in plain English:

"Fail-to-pass tests" means tests that fail before the agent’s change and pass after the correct fix. This gives a concrete signal that the change solved the intended problem.

"Multi-run stability checks" means running the same evaluation more than once to see if the result is reliable. Agents can be nondeterministic, so one lucky run is not enough.

"Harness design" means the environment around the model: tools, shell access, test commands, file editing, context loading, and rules. For coding agents, the harness can matter almost as much as the model.

My practical takeaway: if your team is evaluating coding agents, do not stop at public leaderboard scores. Build a small internal benchmark from real tickets, real tests, and real repo constraints.

reddit.com
u/aistranin — 7 days ago

f you want a structured way to learn agent development without starting from random blog posts, Hugging Face has a free AI Agents course:

https://huggingface.co/learn/agents-course/en/unit0/introduction

It covers the basics first, then moves into actual frameworks and projects.

The syllabus includes:

  • What agents are
  • How tools, actions, and observations work
  • Agent frameworks like smolagents, LlamaIndex, and LangGraph
  • Agentic RAG
  • A final project where you build, test, and certify an agent
  • Bonus material on observability, evaluation, and function-calling

I like this kind of resource because it does not treat agents as just "LLM plus loop."

For junior devs, the useful concept is the agent control loop:

  1. The model receives a goal and context
  2. It chooses an action
  3. A tool runs that action
  4. The result comes back as an observation
  5. The agent decides what to do next

That loop is the core of most agent systems. The framework changes, but the pattern keeps showing up.

If you are already comfortable with Python and basic LLM APIs, this seems like a good weekend learning path. Build the smallest possible agent first. Then add one tool. Then add logging. Then add a human approval step.

That progression teaches more than trying to build a giant "does everything" agent on day one.

u/aistranin — 20 days ago

Came across this article and thought it was worth sharing here: How to Build Production-Grade Generative AI Applications

It’s a good practical overview of what teams usually learn the hard way after the prototype phase. A few points it gets right:

  • not every problem should use an LLM
  • model selection should be based on task fit, latency, cost, context window, and safety, not just hype
  • prompt engineering matters, but structured inputs/outputs matter just as much
  • guardrails, QA, eval pipelines, and tracing are not “later” concerns
  • production failures usually come from accuracy drift, hallucinations, cost, and lack of observability

What I liked most is that it frames GenAI systems as engineered products, not prompt demos. That maps well to agentic dev too: once agents can use tools and run longer workflows, monitoring, constraints, and evaluation become first-class design problems.

u/aistranin — 27 days ago
▲ 2 r/PracticalAgenticDev+1 crossposts

A lot of teams now say they are “testing AI workflows,” but when you dig in, the actual approach is all over the place.

I’ve seen combinations like:

  • mocked unit tests around prompt builders / orchestration logic
  • deterministic tests with frozen model outputs
  • cheap-model integration tests in CI
  • full end-to-end runs nightly
  • eval pipelines before release
  • production monitoring plus human review

The hard part is balancing:

  • cost
  • runtime
  • brittleness
  • confidence
  • reproducibility

What I’m trying to understand is what people here do in practice.

Questions:

  • What do you test with classic software tests vs evals?
  • Where do you mock, and where do you insist on real model calls?
  • What runs on every PR vs nightly?
  • How do you catch regressions that are not binary failures but “quality drift”?
  • What looked promising at first but turned out to be low-value?

Would love concrete examples of test architecture, CI strategy, and lessons learned.

reddit.com
u/aistranin — 20 days ago

OpenAI published this on April 15: The next evolution of the Agents SDK.

The interesting part is not just “better agents.” It’s that the SDK is moving toward real execution infrastructure for systems that can inspect files, run commands, edit code, and work on longer-horizon tasks inside controlled environments.

That feels important for practical agentic development because the hard part is no longer just model quality. It’s whether the system can execute safely, repeatedly, and observably.

My take:

  • the center of gravity is moving from prompt tricks to runtime design
  • agent frameworks are becoming more like operating environments
  • the real moat is starting to look like execution, safety, evals, and observability rather than raw chat quality

Curious how people here see it:

  • Are you using vendor SDKs directly, or building your own orchestration layer?
  • What’s still missing most: evals, rollback, state handling, approvals, tracing?

Source: OpenAI Agents SDK update

reddit.com
u/aistranin — 1 month ago

35B parameters, ~3B active thanks to MoE.

Key points:

  • In agentic coding, it reaches the level of models with ~10× larger active parameter count
  • Outperforms Qwen3.5-27B (dense) and the previous Qwen3.5-35B-A3B
  • Natively multimodal architecture (text + vision)
  • In VLM benchmarks, comparable to Claude Sonnet 4.5, and in some tasks performs better
  • Strong metrics in spatial reasoning tasks

Benchmarks:

  • MMMU - 81.7 vs 79.6
  • MMMU-Pro - 75.3 vs 68.4
  • MathVista - 86.4 vs 79.8
  • RealWorldQA - 85.3 vs 70.3

Practical implications:

  • MoE provides a multiple reduction in compute without sacrificing quality
  • Well-suited for agent-based scenarios where sequential actions and planning matter
  • Can be used as a unified stack for both code and vision tasks

Apache 2.0 (no restrictions for production use)

https://huggingface.co/Qwen/Qwen3.6-35B-A3B

u/aistranin — 1 month ago

On one hand, planning is an incredibly powerful capability in AI systems. It opens the door to more autonomous, agent-like behavior and lets models tackle more complex, multi-step problems.

On the other hand, it’s also the part I trust the least right now.

In my experience, I’ve been able to get patterns like reflection and tool use to work quite reliably. They’re much easier to reason about, debug, and iterate on—and they consistently improve application performance.

Planning, though, feels different. It’s harder to predict what the model will actually do, especially ahead of time. Even with careful prompting and constraints, the outcomes can be inconsistent or surprising in ways that are tough to control.

That said, things are moving fast. The progress over the past year alone has been huge, so I’m pretty confident this gap will close sooner rather than later.

How do you evaluate planning? How to monitor?

reddit.com
u/aistranin — 1 month ago

Hey - glad you’re here 👋

This is a dev-first community of people actually building agentic systems.

We care about practical agentic development:

  • real architectures
  • real failures
  • real tradeoffs
  • real systems that (sometimes) work

Relevant Community Topics:

  • autonomous agents
  • multi-agent setups
  • tool use / orchestration
  • evals, debugging, reliability
  • production lessons
reddit.com
u/aistranin — 1 month ago

Robotic process automation (RPA) for repetitive e2e tests

Robotic Process Automation (RPA) in testing refers to the use of “software robots” to mimic and repeat the actions that human testers perform when interacting with an application.

Is RPA the same as an automated testing script? No - RPA is not the same as automated testing scripts. It uses the UI to mimic human actions and execute workflows, while automated testing scripts programmatically verify that software behaves correctly.

  • RPA = “Do what a user does”
  • Test automation = “Check if the system behaves correctly”

According to https://testfort.com/blog/test-automation-trends, RPA adoption in testing is expected to grow significantly as organizations use it to reduce manual labor costs and scale testing efforts alongside AI-driven automation. Something to look after in the industry 👀

u/aistranin — 1 month ago

LLMs for test case generation are promising - but reliability is still a major issue

Source: https://link.springer.com/article/10.1007/s10586-026-06021-z

A recent review explores how large language models (LLMs) are being used to generate test cases.

https://preview.redd.it/guardbfaeltg1.png?width=1280&format=png&auto=webp&s=fc2f3acdb6a97bfe7d87e7fa30e7ad1cf9cbf154

Key takeaways:

  • Software testing is critical but still time-consuming and labor-intensive
  • Traditional automated methods (search-based, constraint-based) often:
    • lack coverage
    • produce less relevant test cases
  • LLMs introduce a new approach:
    • understand natural language requirements
    • generate context-aware test cases and code
    • directly translate requirements to test cases
    • LLM-based approaches show promising performance vs traditional methods

Open issues:

  • Lack of standard benchmarks and evaluation metrics
  • Concerns about correctness and reliability of generated tests

In practice, reliability seems like the biggest blocker - LLMs generate tests that look correct but often miss edge cases or assert the wrong behavior. Or they focus on retesting some obvious scenarios multiple times ignoring actual unit responsibility in the surrounding system.

What is your experience generating tests with AI?

reddit.com
u/aistranin — 1 month ago
▲ 2 r/PracticalAgenticDev+1 crossposts

Are you into testing AI agents?

From https://devops.com/is-your-ai-agent-secure-the-devops-case-for-adversarial-qa-testing/

>The future belongs to organizations that recognize “sunny day” testing is no longer enough. The teams that build the “storm simulators” now will operate with a level of confidence and security that their competitors cannot match.

They suggest simulating network failures, ambiguous requirements and prompt injection to see if an agent maintains safe behavior. The message is that AI agents are part of our software stack now, and they need to be tested with creativity.

What do you think?

reddit.com
u/aistranin — 1 month ago
▲ 5 r/PracticalTesting+1 crossposts

My experience coding with AI has never been like 10× faster (more like 0.8× hehe). Sure, AI copilots can generate OK looking code, but for me it has mostly been a waste of time. The tech debt is leveraged, learning is slower, and you often end up spending more time fixing things than if you had just written the code by hand much more simply (without AI).

I tend to see more benefits from AI code generation when it’s used with Test-Driven Development (TDD), at least when starting with end-to-end or integration tests first. I also shared my thoughts on this on YouTube: https://youtu.be/Mj-72y4Omik

Some developers argue that TDD is too slow and that you should focus on end-to-end tests (writing them manually) and let AI generate unit tests. That kind of works. But when it comes to learning Python (especially for beginners), I see a lot of frustration from overusing AI. TDD seems like a nice approach to avoid just relying on AI.

What do you think?

u/aistranin — 1 month ago

New Dev Intros 🎉

Congrats on becoming a member of r/PracticalTesting community 🎉

Every great software community starts with people like you - developers who care about building, testing, and shipping great software products.

This space is all about practical testing: real-world approaches, useful tools, lessons learned, and honest discussions about what actually works (and what doesn’t).

Whether you’re here to learn, share your experience, or ask questions — you’re in the right place.

To get started:

  • Introduce yourself 👋
  • Share what you’re currently working on
  • (Optionally) Tell us more about your background/experience in testing

Let’s build a community where testing is not just theory, but something that truly helps us ship better code 🚀

reddit.com
u/aistranin — 2 months ago