At Nutanix, we're redefining intelligent observability with Panacea.ai - an AI/ML-powered platform that automatically detects, explains and correlates anomalies across logs and metrics. In version 1.0, we used regex-based filters along with historical data to identify anomalies. In version 2.0, we advanced to AI/ML and capabilities that deliver deeper, context-rich anomaly detection and are working on building an enterprise-grade automated RCA (Root Cause Analysis) engine - powered by an agentic platform that integrates MCP servers as tools, and a conversational interface that enables users to query, explore and discuss issues as naturally as a chat. As a technical manager, you will own the architecture and AI/ML systems that power both log and metrics analysis, enabling automated diagnostics and reducing triage time for QA failures, regression runs, and customer issues. You'll also help define and drive the central AI charter at Nutanix, building reusable components, model infrastructure, and scalable ML services. This is a hands-on, high-impact role where you will work on this AI/ML innovation, and shape Nutanix's central AI charter.

About the Team

The Panacea team has a passionate set of engineers across India and US office. We move fast, collaborate closely, and care deeply about quality and ownership. Our mission is to deliver AI/ML-powered developer productivity tools that solve real engineering and support pain points at scale.

Why Join Us Work along with high-impact team delivering AI-first observability tools that directly improve engineering velocity and product quality. Tackle challenging technical and product problems at scale and speed. Shape the foundational AI platform and practices across Nutanix. Enjoy the flexibility of hybrid work, with a culture that values deep work, collaboration, and ownership. Be part of a startup-style team backed by the scale, reach, and stability of a global cloud leader.

Your Role

Auto RCA Engine: Deliver an AI-driven engine that correlates logs and metrics across distributed services, automatically surfacing explanations for incidents. This includes an agentic platform that integrates MCP servers as tools, alongside a chat-like conversational interface that enables engineers to query issues, run diagnostics, and collaborate on RCA in natural language. Here, LLMs will power interactive diagnostics and human-like discussions around problem-solving
AI-Powered Observability Platform: Own the vision, architecture, and delivery of Panacea's ML-based log and metrics analyzer that reduces triage time and improves engineering efficiency. This includes leveraging LLMs for anomaly explanation, RCA summaries, and contextual recommendations to engineers and support teams. Metrics Anomaly Detection: Drive development of models that detect anomalies in CPU, memory, disk I/O, network traffic, and service health, enabling proactive identification of performance regressions. LLMs will assist in summarizing anomalies and providing contextual recommendations for remediation.
Feedback Loop & Continuous Learning: Help build infrastructure that captures user interactions and feedback, using LLMs and ML pipelines to retrain and improve anomaly detection and RCA accuracy over time. Central AI Charter: Collaborate with product and support teams to define foundational AI infrastructure, shared ML components, governance practices, and standards that scale across Nutanix's product ecosystem. Collaborate with cross-functional stakeholders (SRE, QA, Dev) to deeply understand pain points and translate them into intelligent tooling
People Management Lead and manage high performing team of engineers Ability to work collaboratively with engineers and provide technical/ thought leadership Team 1 on 1’s, hiring, upskilling/coaching, performance management, employee engagement Drive productivity of the team with OKR and new feature readiness. Ensure the team of engineers are motivated and engaged.

What You Will Bring

Educational Background: B.Tech/M.Tech in Computer Science, Machine Learning, AI, or a related field. 8+ years in software engineering, with a track record of designing, developing, and deploying AI/ML systems at scale, AI/ML .
Expertise: Strong in time-series anomaly detection, statistical modeling, supervised/unsupervised learning. Experience building ML models for metrics data (CPU, memory, IOPS, network, etc.) using models like Isolation Forest, Prophet, LSTM, or deep autoencoders. Experience with LLMs for downstream tasks like summarization, root cause reasoning, or intelligent Q&A
Preferred experience in designing and deploying agentic workflows. Engineering Skills: Strong Python programming skills with proficiency in ML libraries (PyTorch, TensorFlow, Scikit-learn), time-series frameworks, and MLOps tools.
Experience building and operating robust data pipelines and serving models at scale. Observability Knowledge: Familiarity with logs, metrics, and traces, along with monitoring tools such as Prometheus, Grafana, and ELK Experience leading and managing high performing teams

Work Arrangement

Hybrid: This role operates in a hybrid capacity, blending the benefits of remote work with the advantages of in-person collaboration. In locations where our workplace policy applies (i.e. San Jose, Durham, Mexico City, Bangalore, Pune, Hoofddorp, Belgrade, Barcelona, Singapore, Sydney and Tokyo), employees are expected to work onsite a minimum of 3 days per week to foster collaboration, team alignment, and access to in-office resources. Workplace type may vary based on location and team requirements. Please speak with your recruiter for details. Additional team-specific guidance and norms will be provided by your manager.

--

This job is no longer accepting applications

See open jobs at Nutanix.See open jobs similar to "Manager - AI/ML" Khosla Ventures.

See more open positions at Nutanix

Powered by Getro.com

Privacy policy Cookie policy