Title:  Customer Reliability Engineer

Location: 

Research Triangle Park, NC, US, 27709

Requisition ID:  103681

Job Summary

As a Customer Reliability Engineer, you’ll manage a portfolio of customer-facing cloud services (SaaS/IaaS) ensuring overall availability, performance and security. You’ll work in a highly collaborative environment with NetApp and Google/AWS/Microsoft teams from all over the world (RTP, Reykjavík, Bangalore, Sunnyvale, Redmond, and more). This position includes rotational on-call work as part of a global team due to the critical nature of the services we support.

Job Requirements

You will be working in a hectic and fast paced organization as an engineer on the Customer Reliability Engineering (CRE) team. This team is responsible for assisting NetApp Cloud Volume Services (CVS) and Astra customers in resolving complex technical issues in production environments. 

We are looking for a CRE with a deep understanding of complex distributed system platforms/cloud technologies and ability to simply articulate it to customers and SREs within a customer organization. 

You will have the opportunity to work with your teammates and our customers to support many new, leading-edge technologies that solve real challenges. You will work to provide robust feedback and guidance to our Product and Engineering teams while being a voice for our customers. You want to make our customers successful while strengthening their relationship with NetApp. You can make a huge impact and have real ownership for the work you do.

Job Requirements

Essential Responsibilities

  • Work with external customers and partners to help make them successful
  • Respond to, troubleshoot and drive root cause analysis (RCA) of complex live production incidents and cross platform issues handling OS, Networking and Database in a cloud-based SaaS / IaaS environments by following and implementing SRE best practices
  • Continuously monitor, analyze and measure the availability, latency and overall system health using tools like Prometheus, Stackdriver, ElasticSearch, Grafana and SolarWinds as well as develop steps to improve system and application performance, availability and reliability
  • Document your system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available
  • Keep up-to date with security and proactively identify, diagnose, and solve complex security issues
  • Maintain and monitor deployment, orchestration of the servers, docker containers, databases, and general backend infrastructure
  • Apply automation to any tasks or parts of the system that would benefit from it or are performed manually
  • Utilize Atlassian Jira to track issues to resolution based on their priority

 

Education

Qualifications

  • Experience in Incident Management processes and ability to resolve issues within agreed organization SLA/SLO
  • Advanced knowledge of Linux operating systems (Ubuntu, CentOS, etc.)
  • Working knowledge of container-based architecture (Kubernetes)
  • Intermediate knowledge of tools like Ansible, Python, Bash, Go, PowerShell and other scripting language
  • Intermediate knowledge in algorithms, data structures and databases (SQL/NoSQL)
  • Intermediate knowledge of networking concepts
  • Understanding and real-world hands-on experience with cloud environments such as GCP or AWS
  • General knowledge of site reliability engineering principles

Education

  • BS in computer science or equivalent or 6+ years professional experience


Nearest Major Market: Durham
Nearest Secondary Market: Raleigh

Job Segment: Cloud, Computer Science, Software Engineer, Engineer, Technology, Engineering, Research