Title: Cloud Infrastructure Site Reliability Engineer
NC, US US US Waltham, MA, US, 02451 Cranberry Township, PA, US, 16066-5209 Wichita, KS, US, 67208 Boulder, CO, US, 80301
Job Summary
As a Cloud Infrastructure / Site Reliability Engineer, you will be operating at the intersection of development and operations. Your role will involve engaging in and enhancing the lifecycle of cloud services - from design through deployment, operation, and refinement. You will be responsible for maintaining these services by measuring and monitoring their availability, latency, and overall system health and building automation for efficient cloud operations management.
You will play a crucial role in sustainably scaling systems through automation and driving changes that improve reliability and velocity. As part of your responsibilities, you will administer cloud-based environments that support our SaaS/IaaS offerings, which are implemented on a microservices, container-based architecture (Kubernetes). In addition, you will oversee a portfolio of customer-centric cloud services (SaaS/IaaS), ensuring their overall availability, performance, and security. You will work closely with both NetApp and cloud service provider teams, including those from Azure, located across the globe in regions such as RTP, Reykjavík, Bangalore, Sunnyvale, Redmond, and more.
Due to the critical nature of the services we support, this position involves participation in a rotation-based on-call schedule as part of our global team. This role offers the opportunity to work in a dynamic, global environment, ensuring the smooth operation of vital cloud services. To be successful in this role, you should be a motivated self-starter and self-learner, possess strong problem-solving skills, and be someone who embraces challenges.
Key Responsibilities
- •Incident Response and Troubleshooting: Address and perform root cause analysis (RCA) of complex live production incidents and cross-platform issues involving OS, Networking, and Database in cloud-based SaaS/IaaS environments. Implement SRE best practices for effective resolution.
- Analysis, and Infrastructure Maintenance: Continuously monitor, analyze, and measure system health, availability, and latency using tools like Prometheus, Stackdriver, ElasticSearch, Grafana, and SolarWinds. Develop strategies to enhance system and application performance, availability, and reliability. In addition, maintain and monitor the deployment and orchestration of servers, docker containers, databases, and general backend infrastructure.
- Document system knowledge as you acquire it, create runbooks, and ensure critical system information is readily accessible.
- Security Management: Stay updated with security protocols and proactively identify, diagnose, and resolve complex security issues.
- Automation and Efficiency: Identify tasks and areas where automation can be applied to achieve time efficiencies and risk reduction. Develop software for deployment automation, packaging, and monitoring visibility.
- Issue Tracking and Resolution: Use Atlassian tool chain, first party cloud service management tools to track and resolve issues based on their priority.
- Team Collaboration and Influence: Work in tandem with other Cloud Infrastructure Engineers and developers to ensure maximum performance, reliability, and automation of our deployments and infrastructure. Additionally, consult and influence developers on new feature development and software architecture to ensure scalability.
- Debugging, Troubleshooting, and Advanced Support: Undertake debugging and troubleshooting of service bottlenecks throughout the entire software stack. Additionally, provide advanced tier 2 and 3 support for NetApp's Cloud Data Services solutions.
- Directly influence the decisions and outcomes related to solution implementation: measure and monitor availability, latency, and overall system health.
Job Requirements
- 8+ years experience in scripting and infrastructure automation using tools such as Ansible, Python, Go or Ruby.
- Deep working knowledge of Containers, Kubernetes, Serverless computing implementation, and distributed systems design patterns.
- DevOps/SRE development methodologies.
- Proficiency in Linux/Unix and CoreOS.
- Experience with cloud platforms such as AWS, Azure, or Google Cloud.
- Ability to lead a scrum team, influence stakeholders to effectively maintain a product backlog, manage sprints.
Education
- A Bachelor of Science Degree in Computer Science, a master’s degree; or equivalent experience is required.
Compensation:
The target salary range for this position 152,150 - 196,900 USD. The salary offered will be determined by the candidate's location, qualifications, experience, and education and may be outside of this range. Final compensation packages are competitive and in line with industry standards, reflecting a variety of factors, and include a comprehensive benefits package. This may cover Health Insurance, Life Insurance, Retirement or Pension Plans, Paid Time Off (PTO), various Leave options, Performance-Based Incentives, employee stock purchase plan, and/or restricted stocks (RSU’s), with all offerings subject to regional variations and governed by local laws, regulations, and company policies. Benefits may vary by country and region, and further details will be provided as part of the recruitment process.
Job Segment:
Cloud, Computer Science, Database, Unix, Linux, Technology