People Matter

Site Reliability Engineer

Ayasdi

Ayasdi

Software Engineering
Bengaluru, Karnataka, India · India
Posted 6+ months ago
Introduction

🚀 Exciting Opportunity Alert!

Symphony AI is looking for a Site Reliabilit Engineer with a strong background in Design, build, and maintain scalable, reliable, and secure Azure and Kubernetes infrastructure to run applications and services in on-premise and Cloud environments


Apply now and be part of shaping the future of AI Products. #SRE #GenAi #LLM #BangaloreJobs 📊💡


Job Description

Highly Observable

  • Develop and maintain comprehensive system monitoring dashboards and alerting mechanisms
    using tools like Prometheus, Grafana, Jaeger, Zipkin, Site24x7 and DataDog.
  • Implement permanent solutions to eliminate false alerts and ensure seamless integration of new
    services with alerting systems.
  • Utilize PromQL for metrics querying and Grafana for visualization to analyse system performance
    and make data-driven decisions.

Robust Systems

  • Implement auto-scaling mechanisms for IRIS platform services within Kubernetes environments to
    ensure optimal resource utilization.
  • Ensure the IRIS platform achieves a 99.9% uptime SLA through proactive monitoring and incident
    management.
  • Manage Kubernetes node pools effectively and perform database/application tuning as required
    (such as request/limits, affinities, hpa, vpa, node sizing, taints/tolerations, vm autoscaling, service
    mesh, storage classes)

Incident Management

  • Actively participate in on-call rotations and respond promptly to Production incidents.
  • Collaborate closely with development teams to troubleshoot and resolve issues in both onpremise and AKS environments.
  • Document all critical actions, convert findings into repeatable processes, and automate wherever
    possible.

Consistent Environments

  • Manage IRIS Kubernetes deployments using ArgoCD and helm charts across multiple
    environments.
  • Automate all environments using IaC tools such as Ansible, Terraform and ARM templates.

Security

  • Enforce compliance with IRIS security policies and standards across all system operations.
  • Conduct security audits periodically and perform necessary actions.

Mandatory skills:

  • 4+ years working with Linux, Docker and Kubernetes
  • 4+ years working as SRE engineer specialized in troubleshooting & RCAs and setting up highly observable systems. Should be well-versed with system and application metrics, metric types (such as counters, gauges, histogram and summaries), aggregations (such as averaging, percentiles, rates), alerting (static & dynamic thresholds), distributed tracing, latency analysis, SLA & SLO definitions and tracking, searching & dashboarding with log tools such as Kibana.
  • Exposure to setting up applications in Cloud and On-premise environments.
  • Knowledge of best security practices, vulnerability management and testing tools such as Acunetix, Snyk, CheckMarx or Trivy.
  • Able to work in an Agile environment

About Us

SymphonyAI leads the way in innovation, harnessing cutting-edge AI and machine learning to transform industries worldwide. As a global powerhouse in AI solutions, we enable organizations to unlock the full potential of data-driven insights. With a presence in 20 countries, our enterprise applications deliver rapid, transformative value across retail, CPG, finance, manufacturing, media, IT, and public sectors. Combining unmatched AI technology, vertical expertise, and industry-specific insights, we create applications that drive maximum value for our customers. Join us in shaping the future as we build a "World Class Engineering Team" committed to high performance. With over 3000 talented leaders, data scientists, and professionals, we incubate and develop groundbreaking solutions within the SymphonyAI Group.