🚀 Exciting Opportunity Alert!
Symphony AI is looking for a Site Reliabilit Engineer with a strong background in Design, build, and maintain scalable, reliable, and secure Azure and Kubernetes infrastructure to run applications and services in on-premise and Cloud environments
Apply now and be part of shaping the future of AI Products. #SRE #GenAi #LLM #BangaloreJobs 📊💡
Job Description
Highly Observable
- Develop and maintain comprehensive system monitoring dashboards and alerting mechanisms
using tools like Prometheus, Grafana, Jaeger, Zipkin, Site24x7 and DataDog. - Implement permanent solutions to eliminate false alerts and ensure seamless integration of new
services with alerting systems. - Utilize PromQL for metrics querying and Grafana for visualization to analyse system performance
and make data-driven decisions.
Robust Systems
- Implement auto-scaling mechanisms for IRIS platform services within Kubernetes environments to
ensure optimal resource utilization. - Ensure the IRIS platform achieves a 99.9% uptime SLA through proactive monitoring and incident
management. - Manage Kubernetes node pools effectively and perform database/application tuning as required
(such as request/limits, affinities, hpa, vpa, node sizing, taints/tolerations, vm autoscaling, service
mesh, storage classes)
Incident Management
- Actively participate in on-call rotations and respond promptly to Production incidents.
- Collaborate closely with development teams to troubleshoot and resolve issues in both onpremise and AKS environments.
- Document all critical actions, convert findings into repeatable processes, and automate wherever
possible.
Consistent Environments
- Manage IRIS Kubernetes deployments using ArgoCD and helm charts across multiple
environments. - Automate all environments using IaC tools such as Ansible, Terraform and ARM templates.
Security
- Enforce compliance with IRIS security policies and standards across all system operations.
- Conduct security audits periodically and perform necessary actions.
Mandatory skills:
- 4+ years working with Linux, Docker and Kubernetes
- 4+ years working as SRE engineer specialized in troubleshooting & RCAs and setting up highly observable systems. Should be well-versed with system and application metrics, metric types (such as counters, gauges, histogram and summaries), aggregations (such as averaging, percentiles, rates), alerting (static & dynamic thresholds), distributed tracing, latency analysis, SLA & SLO definitions and tracking, searching & dashboarding with log tools such as Kibana.
- Exposure to setting up applications in Cloud and On-premise environments.
- Knowledge of best security practices, vulnerability management and testing tools such as Acunetix, Snyk, CheckMarx or Trivy.
- Able to work in an Agile environment
About Us
SymphonyAI leads the way in innovation, harnessing cutting-edge AI and machine learning to transform industries worldwide. As a global powerhouse in AI solutions, we enable organizations to unlock the full potential of data-driven insights. With a presence in 20 countries, our enterprise applications deliver rapid, transformative value across retail, CPG, finance, manufacturing, media, IT, and public sectors. Combining unmatched AI technology, vertical expertise, and industry-specific insights, we create applications that drive maximum value for our customers. Join us in shaping the future as we build a "World Class Engineering Team" committed to high performance. With over 3000 talented leaders, data scientists, and professionals, we incubate and develop groundbreaking solutions within the SymphonyAI Group.