u/gringobrsa

MLOps - observability at scale (agentic space )

Hi folks , has anyone here worked in the agentic AI space?

How are you handling observability for AI agents especially around infrastructure, tracing, monitoring, debugging, and reliability at scale?

I’m particularly interested in learning from people who have experience with large-scale agentic deployments in Tier 1 tech companies. Experience from smaller implementations is still useful, but I’m mainly looking for insights from production environments with high scale and complexity.

Any tips, lessons learned, or recommended tooling/frameworks would be appreciated.

reddit.com
u/gringobrsa — 3 days ago
▲ 2 r/mlops+1 crossposts

What Is an AI Agent And Why Deploying One Is Nothing Like Deploying an API

Just published a simple breakdown of what an AI Agent actually is and why deploying one is nothing like deploying a normal API.

Covers:

  • AI agents vs LLM calls
  • Why agents hallucinate more
  • RAG on Vertex AI
  • Observability and tracing
  • Real production failure examples
  • Why agent deployment is an infrastructure problem, not just prompting

Read the article here

u/gringobrsa — 9 days ago

Why Do Enterprises Still Choose AWS Over GCP?

I’ve worked with both AWS and GCP in enterprise environments, and honestly as an engineer I personally prefer a lot of things in GCP.

Things like:

  • ORG hierarchy
  • UI - console
  • VPC setup
  • Kubernetes experience
  • Data & AI products

all feel cleaner and more modern to me compared to AWS.

But despite that, almost every large enterprise, big firms, or etc I work with still defaults to AWS first.

I understand part of it is the head start AWS had, but I think there’s more to it than technology.

AWS feels extremely enterprise-focused:

  • stable APIs/services
  • strong local presence worldwide
  • huge partner ecosystem
  • local language support
  • easier direct customer engagement
  • mature enterprise processes

Meanwhile with GCP, sometimes it feels harder to navigate internally or get connected to the right teams/escalations compared to AWS.

I’ve also noticed many executives still hesitate with GCP even when engineers like the platform technically.

Curious what others here think: What do you believe GCP still needs to improve to seriously compete with AWS in large enterprise adoption?

Is it:

  • support?
  • partner ecosystem?
  • executive trust?
  • long-term product consistency?
  • enterprise sales culture?
  • regional presence?

Would love to hear perspectives from people who worked across multiple clouds in real enterprise environments.

reddit.com
u/gringobrsa — 12 days ago

Terraform plan says "No changes" but your infra is drifting. Here's why and how to fix it.

Most teams run a nightly terraform plan and call it drift detection. It's not.

Wrote a deep dive on the four blind spots that silently kill this approach in production out-of-band resources that are invisible to tfstate, ignore_changes swallowing real security changes without a trace, -refresh=false comparing against stale state instead of reality, and actual drift buried inside attribute noise that engineers learn to ignore.

Full article https://medium.com/@rasvihostings/terraform-drift-detection-in-production-why-plan-isnt-enough-f660af7e1029

reddit.com
u/gringobrsa — 12 days ago

Most teams celebrate deploy day.
But nobody warns you about Day 2.

The real work starts after launch:
• alert fatigue
• config drift
• surprise cloud bills
• database bottlenecks
• compliance gaps
• making decisions with incomplete information

Shipping is hard. Operating is harder.

Wrote about the reality of running systems in high-intensity environments across AWS/GCP/SaaS teams.

Would love to hear your biggest Day 2 lesson or outage story.

Read here: Medium Blog Post

reddit.com
u/gringobrsa — 14 days ago

Most teams celebrate deploy day.
But nobody warns you about Day 2.

The real work starts after launch:
• alert fatigue
• config drift
• surprise cloud bills
• database bottlenecks
• compliance gaps
• making decisions with incomplete information

Shipping is hard. Operating is harder.

Wrote about the reality of running systems in high-intensity environments across AWS/GCP/SaaS teams.

Would love to hear your biggest Day 2 lesson or outage story.

Read here: Medium Blog Post

reddit.com
u/gringobrsa — 14 days ago
▲ 1 r/SaaS

I’m selecting 1 company for a sponsored Google Cloud (GCP) project (migration or optimization).

This is part of my cloud consulting practice focused on GCP infrastructure, architecture, and cost optimization, with a focus on SMBs and growing companies.

What I offer:
• End-to-end migration to Google Cloud (from AWS, Azure, or on-prem)
• Or optimization of an existing GCP environment
• Production-grade architecture and best practices
• Cost, performance, and reliability improvements

Ideal candidate:
• Small to mid-sized business (SMB) or scaling startup
• Already running workloads on Google Cloud, AWS, or Azure or actively planning a migration to GCP
• Has real production systems (not side projects)
• Has existing cloud spend and clear business needs

For companies already on GCP:
We’ll first define a focused scope based on your priorities, such as:
• Cost optimization
• Architecture improvements
• Reliability and scaling
• Security and best practices

Requirements:
• Clear scope agreed upfront (projects without defined scope won’t be considered)
• Collaborative and responsive team
• Willingness to be a detailed public case study

Clarification:
• The sponsorship covers my consulting time and implementation work
• Cloud infrastructure costs (GCP/AWS/Azure and related services) remain the responsibility of the client

Notes:
• Limited to one company only
• Scope will be strictly defined to ensure quality delivery
• Work will be done primarily during EDT hours

Learn more: https://www.cloudrelo.com/

If this fits your situation, send a brief overview of your current setup, goals, and challenges.

reddit.com
u/gringobrsa — 23 days ago

Just published the final part of my series on building a PCI-DSS compliant GKE framework for financial workloads.

This one focuses on data protection, governance, and audit logging how you actually protect card data and prove it to auditors.

If you're into cloud security / fintech / platform engineering, would love your thoughts especially how you’ve built similar frameworks for banks or regulated environments.

Read here: https://medium.com/@rasvihostings/building-a-pci-dss-compliant-gke-framework-for-financial-institutions-data-protection-governance-0deaa1b72893

reddit.com
u/gringobrsa — 26 days ago

Just published the final part of my series on building a PCI-DSS compliant GKE framework for financial workloads.

This one focuses on data protection, governance, and audit logging how you actually protect card data and prove it to auditors.

If you're into cloud security / fintech / platform engineering, would love your thoughts especially how you’ve built similar frameworks for banks or regulated environments.

Read here: https://medium.com/@rasvihostings/building-a-pci-dss-compliant-gke-framework-for-financial-institutions-data-protection-governance-0deaa1b72893

reddit.com
u/gringobrsa — 26 days ago

Just published the final part of my series on building a PCI-DSS compliant GKE framework for financial workloads.

This one focuses on data protection, governance, and audit logging how you actually protect card data and prove it to auditors.

If you're into cloud security / fintech / platform engineering, would love your thoughts especially how you’ve built similar frameworks for banks or regulated environments.

Read here: https://medium.com/@rasvihostings/building-a-pci-dss-compliant-gke-framework-for-financial-institutions-data-protection-governance-0deaa1b72893

reddit.com
u/gringobrsa — 26 days ago