Day 2: The CTO Advisor's Lab – Prototyping AIOps for the Modern Enterprise
A CTO's Guide to Vibe-Coding, AIOps, and the Future of Enterprise Workloads.
My "Day 0" post detailed the high-level architecture of the Virtual CTO Advisor, built on Google Cloud's Vertex AI. While that post focused on the strategic vision, the reality of Day 2 is a stark, hands-on lesson in the operational challenges of taking an AI application from a "vibe coded" prototype to a production-ready system. This post isn't about the job of an IT director; it's about the strategic responsibility of the CTO to understand the future of operations.
Note: So far my GCP bill is ~$190 less $90 in promotional credits
As a former IT executive who now advises leaders on strategy, my core skill lies in balancing vision with effective team management. I am not the hands-on SRE. But I am living a classic dilemma: how do you effectively advise a team of experts if you don't truly understand the challenges they face? This project is my laboratory—a real-world case study to gain the technical context required to build the right solutions for the teams I advise.
The Production Reality: A DevOps Model of One
My project is a DevOps model of one, with all components, from content ingestion to vector indexing, running on Cloud Run. This works for the initial build, but as I’ve said before, this model "gets to a point where it falls over." A pure DevOps approach lacks the specialization needed for production-level scale and stability. This is precisely why SRE teams exist—they handle the operational overflow and toil that a full-stack developer can’t or shouldn't be responsible for.
To be clear, my immediate challenge is the absence of a dedicated SRE function. The system has no proactive monitoring, no structured incident response, and no automated remediation. My Day 2 thinking is a direct response to this gap: I can’t afford to hire a team of SREs, so my AI must become my SRE agent. This isn't about me becoming a sysadmin; it's about me becoming a strategic platform architect.
The "RAG-ops" Framework: A Custom-Built AIOps Solution
I’ve been asked if I'm reinventing the wheel when there are established AIOps and managed observability solutions on the market. That's a fair question. My goal is not to replace platforms like Datadog or Splunk, which have extensive, off-the-shelf capabilities. Instead, this project serves as a real-world case study in how a CTO can prototype a highly customized AIOps solution, integrated deeply with a unique application and data. I'm leveraging the same Retrieval-Augmented Generation (RAG) model that answers user queries to also troubleshoot system issues. This is a platform engineering problem at its core, connecting observability data to an AI model for actionable insights.
Here is the proposed workflow, leveraging the existing architecture:
Observability & Telemetry: My application generates rich telemetry via Cloud Logging and Cloud Monitoring. This includes logs from the
virtual-cto-api
Cloud Run service, theprocess-and-embed
andpopulate-vector-index
jobs, and performance metrics from the Vertex AI Endpoints.AI as the Alerting Engine: Instead of just sending an alert to a human, a custom metric or log-based alert in Cloud Monitoring triggers a Cloud Function. This function sends a structured prompt to the fine-tuned Gemini model.
Contextualized Troubleshooting: The prompt includes a summary of the incident and links to the relevant logs. The AI performs a "RAG lookup" on two distinct data sets:
Operational Data: Real-time logs and metrics from the system.
Knowledge Base: My corpus of blog posts, which includes my past troubleshooting methodologies and architectural principles, indexed in Vertex AI Vector Search.
Remediation and Analysis: The AI synthesizes the information from both sources. It can then generate:
A probable root cause analysis.
A step-by-step remediation plan (e.g., "Increase the memory allocation for
virtual-cto-api
," or "re-run thepopulate-vector-index
Cloud Run job").A summary for a human operator (me!) to review and approve.
The Unavoidable Future of Workloads
My personal struggle with operationalizing a single AI application is a microcosm of a much larger industry trend. The "vibe-coding" I’m doing—using generative AI to build functional applications quickly and with minimal hands-on knowledge—is a harbinger of things to come. I believe we are on the cusp of an explosion in "vibe-coded" applications.
While this promises a rapid increase in business value, it will unleash a torrent of new, often undocumented and operationally immature workloads into production. The burden of maintaining these systems will fall squarely on platform leaders and their engineering teams. The systems you’re managing today are the low-hanging fruit.
The "RAG-ops" framework I'm building isn't just for me; it’s a prototype for the kind of automated, AI-driven operational tools that platform teams will need to handle this new wave of demand. My pain today is your strategic problem tomorrow. This is the future of IT—a world where the platform team is the last line of defense against an army of rapidly deployed, AI-generated applications.
Kick the tires on Virtual CTO Advisor. Ask it some of your more challenges questions. There are two modes -
Advisory: This inspects my corpus of published knowledge and provides you strategic advice in my voice based on my data.
Research: You want to go deep in an area and get my take? Research uses Google Search to go deep and then provides a “Keith’s take.”