Share this Job

Title:  Site Reliability Engineer


Research Triangle Park, NC, US, 27709

Requisition ID:  34294
Job Summary

Here at NetApp, we are developing a new and broad portfolio of SaaS solutions that enable our customers to harness the power of their data in new and interesting ways.  In support of that mission, we are rapidly expanding our new Site Reliability Engineering (SRE) organization to run these new SaaS offerings.   We are looking for smart individuals who get things done to join our team and help deliver these amazing capabilities to market.

Job Summary:

As a Site Reliability Engineer, you’ll manage a portfolio of customer-facing cloud services (SaaS/IaaS) ensuring overall availability, performance and security.   This role is an office-based position at the NetApp RTP campus and reports directly to an SRE manager.  You’ll work in a highly collaborative environment with NetApp and Google/Microsoft teams from all over the world (Reyjkavik, Bangalore, Sunnyvale, Redmond, and more).   This position includes rotational on-call work as part of a global team due to the critical nature of the services we support.

Job Requirements

Job Requirements:

You will administer cloud-based environments that support our SaaS / IaaS offerings that are implemented on a microservices, container-based architecture (Kubernetes).  You will automate repetitive and error prone tasks and processes, using tools like Ansible, Python, PowerShell and other scripting languages.  You will ensure adequate monitoring is in place to quickly be alerted for any issues on the production environments, using tools like Prometheus, Observium, ElasticSearch. Splunk.   You will continuously measure the availability, latency and overall system health, using tools like Grafana, OpsRamp and others.   You’ll secure our environment from security threats deploying patches and least-privilege configurations.  You’ll respond to incidents and drive change that prevents issues from re-occurring.  You’ll look for opportunities to automate the recovery for incidents that we cannot prevent in advance.   You will use Atlassian Jira to track issues to resolution based on their priority.

You’ll design and implement tools for automated deployment of multiple environments.   You’ll work deeply in one or more cloud environments AWS, Google, Azure.  You’ll document findings and solutions you create using Atlassian Confluence.


Key characteristics: You are generally curious and highly motivated with a passion for ensuring scalable, performant and highly-available solutions. You are great at and love debugging and solving technical problems throughout a technology stack. Have a mindset that if you must do the same thing twice, you’re going to automate it! Familiarity with Linux and Windows operating systems. The ability to write automation and understanding of a least one language such as:  Bash, Python, PowerShell, Ansible, Go. A basic understanding of public cloud vendors such as AWS, Azure, Google Cloud or others. 

Typically requires a minimum of 8 years of related experience with a Bachelor’s degree; or 6 years and a Master’s degree; or a PhD with 3 years experience; or equivalent experience.

Nearest Major Market: Durham
Nearest Secondary Market: Raleigh

Job Segment: Engineer, Linux, Cloud, Software Engineer, Engineering, Research, Technology