Symphony AI is looking for a Site Reliabilit Engineer with a strong background in Design, build, and maintain scalable, reliable, and secure Azure and Kubernetes infrastructure to run applications and services in on-premise and Cloud environments

Apply now and be part of shaping the future of AI Products. #SRE #GenAi #LLM #BangaloreJobs 📊💡

Job Description

Highly Observable

Develop and maintain comprehensive system monitoring dashboards and alerting mechanisms
using tools like Prometheus, Grafana, Jaeger, Zipkin, Site24x7 and DataDog.
Implement permanent solutions to eliminate false alerts and ensure seamless integration of new
services with alerting systems.
Utilize PromQL for metrics querying and Grafana for visualization to analyse system performance
and make data-driven decisions.

Robust Systems

Implement auto-scaling mechanisms for IRIS platform services within Kubernetes environments to
ensure optimal resource utilization.
Ensure the IRIS platform achieves a 99.9% uptime SLA through proactive monitoring and incident
management.
Manage Kubernetes node pools effectively and perform database/application tuning as required
(such as request/limits, affinities, hpa, vpa, node sizing, taints/tolerations, vm autoscaling, service
mesh, storage classes)

Incident Management

Actively participate in on-call rotations and respond promptly to Production incidents.
Collaborate closely with development teams to troubleshoot and resolve issues in both onpremise and AKS environments.
Document all critical actions, convert findings into repeatable processes, and automate wherever
possible.

Consistent Environments

Manage IRIS Kubernetes deployments using ArgoCD and helm charts across multiple
environments.
Automate all environments using IaC tools such as Ansible, Terraform and ARM templates.

Security

Enforce compliance with IRIS security policies and standards across all system operations.
Conduct security audits periodically and perform necessary actions.

Mandatory skills:

4+ years working with Linux, Docker and Kubernetes
4+ years working as SRE engineer specialized in troubleshooting & RCAs and setting up highly observable systems. Should be well-versed with system and application metrics, metric types (such as counters, gauges, histogram and summaries), aggregations (such as averaging, percentiles, rates), alerting (static & dynamic thresholds), distributed tracing, latency analysis, SLA & SLO definitions and tracking, searching & dashboarding with log tools such as Kibana.
Exposure to setting up applications in Cloud and On-premise environments.
Knowledge of best security practices, vulnerability management and testing tools such as Acunetix, Snyk, CheckMarx or Trivy.
Able to work in an Agile environment

About Us

SymphonyAI leads the way in innovation, harnessing cutting-edge AI and machine learning to transform industries worldwide. As a global powerhouse in AI solutions, we enable organizations to unlock the full potential of data-driven insights. With a presence in 20 countries, our enterprise applications deliver rapid, transformative value across retail, CPG, finance, manufacturing, media, IT, and public sectors. Combining unmatched AI technology, vertical expertise, and industry-specific insights, we create applications that drive maximum value for our customers. Join us in shaping the future as we build a "World Class Engineering Team" committed to high performance. With over 3000 talented leaders, data scientists, and professionals, we incubate and develop groundbreaking solutions within the SymphonyAI Group.

This job is no longer accepting applications

See open jobs at Ayasdi.See open jobs similar to "Site Reliability Engineer" Khosla Ventures.

See more open positions at Ayasdi

Privacy policy Cookie policy