Join the teams that make the Impossible Happen.

Lead Site Reliability Engineer

Ayasdi

This job is no longer accepting applications

See open jobs at Ayasdi.See open jobs similar to "Lead Site Reliability Engineer" Khosla Ventures.

Software Engineering

India · Bengaluru, Karnataka, India

Posted 6+ months ago

Introduction

SymphonyAI is at the forefront of innovation, leveraging cutting-edge artificial intelligence and machine learning technologies to transform industries and drive business growth. As a global leader in AI-powered solutions, we empower organizations to harness the full potential of data-driven insights. SymphonyAI enterprise applications rapidly deliver transformative business value across retail, CPG, financial services, manufacturing, media, Enterprise IT and the public sector. SymphonyAI combines unrivalled AI technology, vertical expertise and industry-specific data and insights into applications that drive the highest value for customers. We are one of the largest and fastest growing AI portfolios.

Job Description

As an SRI Engineer, you will operate at the intersection of reliability, data intelligence, and automation. You will monitor and optimize the health of Azure and AWS environments, use AI-driven insights to prevent incidents, and continuously improve system performance and scalability.

You’ll be part of a global 24×7 reliability team, collaborating closely with Cloud Infrastructure Engineering (CIE), DevOps, Customer Success, and Product teams to ensure SymphonyAI’s SaaS platforms consistently meet or exceed customer SLAs and availability targets.

This role is ideal for engineers who combine deep technical skill with curiosity, problem-solving, and a passion for customer excellence.

Key Responsibilities

Monitoring, Observability & Proactive Detection

Maintain real-time visibility of all customer environments using Datadog, CloudWatch, Azure Monitor, and Prometheus.
Develop advanced monitoring dashboards, synthetic checks, and trend analyses to detect early warning signals.
Use machine learning and anomaly detection (e.g., Datadog AIOps, Azure AI) to predict and prevent outages.
Continuously tune monitoring thresholds to reduce noise while maximizing incident insight.
Establish performance baselines and proactively address deviations before SLA breach.

Automation, AI & Operational Excellence

Build intelligent automation using Power Automate, Datadog Workflows, Azure Logic Apps, and Defender for Cloud to reduce manual interventions.
Design auto-remediation workflows that fix common or predictable issues in real time.
Contribute to the creation and enhancement of AI-driven playbooks that guide faster triage, root-cause identification, and resolution.
Partner with the internal GenAI and Automation teams to develop custom models improving incident response and capacity management.
Champion a “Zero Toil” mindset—automate repetitive tasks, streamline response, and free engineers for innovation.

Reliability Engineering & Continuous Improvement

Ensure all SaaS environments achieve or exceed defined SLOs (Service Level Objectives) and SLAs.
Use trend and correlation analysis to identify recurring issues, performance degradations, and potential bottlenecks.
Participate in game days and chaos testing to validate system resilience and recovery readiness.
Drive post-incident reviews (PIRs) that focus on learning, not blame—ensuring preventive measures are implemented.
Collaborate on release readiness reviews and verify reliability criteria before deployments and updates.

Cross-Functional Collaboration & Customer Engagement

Partner with Cloud Infrastructure Engineering (CIE) and DevOps to improve release pipelines, scaling, and observability.
Collaborate with Service Delivery Managers (SDMs) to interpret trends and communicate proactively with customers.
Participate in major incident bridges, communicating clearly, confidently, and constructively with both customers and executives.
Support customer onboarding, environment validation, and performance benchmarking.
Act as a reliability advocate internally—educating teams on best practices for operability and monitoring.

24×7 Global Operations

Work as part of a follow-the-sun model, ensuring continuous global coverage.
Take part in on-call rotations, providing leadership during major incidents and escalations.
Maintain accurate handovers and documentation between shifts, ensuring transparency and continuity.

Required Skills & Experience

Proven experience as a Site Reliability Engineer, Cloud Operations Engineer, or SaaS Support Engineer.
Strong proficiency in Azure and AWS monitoring and automation ecosystems.
Expert knowledge of observability platforms such as Datadog, CloudWatch, Azure Monitor, and Grafana.
Hands-on experience with automation frameworks (Power Automate, Terraform, Ansible, Azure Logic Apps, or similar).
Familiarity with AI Ops and intelligent alerting platforms.
Strong grasp of ITIL practices, particularly Incident, Problem, Change, and Event Management.
Skilled in Kubernetes, AKS, and containerized environments.
Understanding of databases (PostgreSQL, Oracle, etc.) and performance tuning.
Excellent customer communication and presentation skills; able to explain technical issues to non-technical stakeholders confidently.
Experience working in 24×7 operational models.

Preferred Experience

Exposure to SymphonyAI products such as Sensa, NetReveal, InvestigationHub, or DataHub.
Experience building or integrating AI-based monitoring or predictive analytics solutions.
Experience in financial services or regulated industries (AML, KYC, Fraud, WLM).
Scripting in Python, Bash, or PowerShell.
Familiarity with DevOps, CI/CD, and Infrastructure-as-Code (IaC) practices.
Certifications: Azure Administrator/DevOps, AWS Solutions Architect, or Datadog Certified Professional.

What Success Looks Like

Customer environments consistently meet or exceed uptime, SLA, and availability goals.
Automation replaces manual effort, reducing incident MTTR and improving recovery.
Trends and AI insights drive proactive prevention rather than reactive firefighting.
Customers experience confidence and trust through transparent, data-driven communication.
The SRI team becomes recognized as a world-class reliability and intelligence function within SymphonyAI.

About Us

SymphonyAI Financial Services delivers world-leading AI-powered compliance and fraud prevention solutions to global financial institutions. As we expand into next-generation SaaS platforms—Sensa, InvestigationHub, and DataHub—our SRI (Site Reliability & Intelligence) team is the backbone, ensuring resilience, availability, and excellence in every customer experience.

Our mission is to build intelligent, self-healing systems that anticipate and prevent issues before they impact customers. The SRI team is not just about uptime—it’s about insight, innovation, and intelligent automation.

This job is no longer accepting applications

See open jobs at Ayasdi.See open jobs similar to "Lead Site Reliability Engineer" Khosla Ventures.

See more open positions at Ayasdi