Site Reliability Engineering Manager
At F5, we strive to bring a better digital world to life. Our teams empower organizations across the globe to create, secure, and run applications that enhance how we experience our evolving digital world. We are passionate about cybersecurity, from protecting consumers from fraud to enabling companies to focus on innovation.
Everything we do centers around people. That means we obsess over how to make the lives of our customers, and their customers, better. And it means we prioritize a diverse F5 community where each individual can thrive.
Position Overview: We are seeking an experienced and highly motivated Site Reliability Engineering (SRE) Manager to lead our SRE team and ensure the reliability, availability, and performance of our mission-critical systems. As the SRE Manager, you will play a pivotal role in shaping and executing the overall SRE strategy, collaborating closely with cross-functional teams to enhance system resilience and scalability. Your leadership and technical expertise will drive the continuous improvement of our infrastructure, processes, and monitoring capabilities.
- Team Leadership: Lead, mentor, and manage a team of skilled SRE engineers, fostering a culture of collaboration, innovation, and professional growth. Provide technical guidance and leadership to enable the team to meet operational goals effectively.
- Strategy and Planning: Develop and execute the SRE strategy in alignment with the company's overall goals, ensuring the reliability and scalability of our services. Define and implement best practices for incident response, capacity planning, and performance optimization.
- Operational Excellence: Oversee the operational health of critical systems, addressing incidents and outages swiftly while identifying root causes and implementing preventive measures. Streamline incident response processes and drive post-incident reviews to continuously enhance system reliability.
- Infrastructure and Automation: Collaborate with engineering teams to design and build scalable, reliable infrastructure. Drive the adoption of automation and infrastructure-as-code principles to manage and deploy systems efficiently.
- Monitoring and Alerting: Establish robust monitoring, logging, and alerting systems to proactively identify and address potential issues. Ensure that the team responds to alerts effectively and that monitoring tools are kept up to date.
- Performance Optimization: Continuously analyze system performance, identifying bottlenecks and areas for improvement. Implement solutions to optimize resource utilization and application response times.
- Capacity Planning: Work closely with engineering and product teams to forecast capacity needs and plan for scaling infrastructure. Develop strategies to ensure smooth scaling during peak usage periods.
- Collaboration: Foster strong partnerships with engineering, product, and DevOps teams to understand system requirements, deployment pipelines, and release processes. Collaborate on architectural reviews and contribute to the design of reliable, scalable systems.
- Vendor Relationships: Manage relationships with third-party vendors and service providers, evaluating their solutions and ensuring that they align with our reliability and performance goals.
- Continuous Improvement: Drive the culture of continuous improvement within the SRE team and across the organization. Identify opportunities to enhance processes, technologies, and practices to achieve greater efficiency and reliability.
- Bachelor's or higher degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
- 10+ years of experience and proven experience managing and leading SRE or operations teams in a fast-paced, technology-focused environment.
- Strong understanding of SRE principles, DevOps practices, and modern infrastructure technologies.
- Proficiency in cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
- Demonstrated experience with infrastructure-as-code tools (e.g., Terraform, Ansible) and version control systems (e.g., Git).
- Solid understanding of monitoring and alerting tools (e.g., Prometheus, Grafana, Splunk) and incident management processes.
- Excellent problem-solving skills with the ability to analyze complex technical issues and provide effective solutions.
- Strong communication and interpersonal skills, with the ability to collaborate effectively with cross-functional teams and communicate technical concepts to non-technical stakeholders.
- Master's degree in a relevant field.
- Relevant professional certifications (e.g., AWS Certified DevOps Engineer, Google Professional DevOps Engineer).
- Previous experience working in Agile or DevOps-oriented environments.
The Job Description is intended to be a general representation of the responsibilities and requirements of the job. However, the description may not be all-inclusive, and responsibilities and requirements are subject to change.The annual U.S. base pay range for this position is: $150,960.00 - $226,440.00
F5 maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, geographic locations, and market conditions, as well as to reflect F5’s differing products, industries, and lines of business. The pay range referenced is as of the time of the job posting and is subject to change.
You may also be offered incentive compensation, bonus, restricted stock units, and benefits. More details about F5’s benefits can be found at the following link: https://www.f5.com/company/careers/benefits. F5 reserves the right to change or terminate any benefit plan without notice.
Please note that F5 only contacts candidates through F5 email address (ending with @f5.com) or auto email notification from Yello/Workday (ending with f5.com or @myworkday.com).
Equal Employment Opportunity