Title:  Customer Reliability Engineer

Location: 

Research Triangle Park, NC, US, 27709 US US US US US

Requisition ID:  103701

Job Summary

As a Customer Reliability Engineer, you’ll manage a portfolio of customer-facing cloud services (SaaS/IaaS) ensuring overall availability, performance and security. You’ll work in a highly collaborative environment with NetApp and Google/AWS/Microsoft teams from all over the world (RTP, Reykjavík, Bangalore, Sunnyvale, Redmond, and more). This position includes rotational on-call work as part of a global team due to the critical nature of the services we support.

Job Requirements

  • You will be working in a hectic and fast paced organization as an engineer on the Customer Reliability Engineering (CRE) team. This team is responsible for assisting NetApp Cloud Volume Services (CVS) and Astra customers in resolving complex technical issues in production environments. 

    We are looking for a CRE with an understanding of complex distributed system platforms/cloud technologies and ability to simply articulate it to customers and SREs within a customer organization. 

    You will have the opportunity to work with your teammates and our customers to support many new, leading-edge technologies that solve real challenges. You will work to provide robust feedback and guidance to our Product and Engineering teams while being a voice for our customers. You want to make our customers successful while strengthening their relationship with NetApp. You can make a huge impact and have real ownership for the work you do.

Essential Responsibilities

  • Work with external customers and partners to help make them successful
  • Respond to, troubleshoot and drive root cause analysis (RCA) of complex live production incidents and cross platform issues handling OS, Networking and Database in a cloud-based SaaS / IaaS environments by following and implementing SRE best practices
  • Continuously monitor, analyze and measure the availability, latency and overall system health using tools like Prometheus, Stackdriver, ElasticSearch, Grafana and SolarWinds as well as develop steps to improve system and application performance, availability and reliability
  • Document your system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available
  • Keep up-to date with security and proactively identify, diagnose, and solve complex security issues
  • Maintain and monitor deployment, orchestration of the servers, docker containers, databases, and general backend infrastructure
  • Apply automation to any tasks or parts of the system that would benefit from it or are performed manually
  • Utilize Atlassian Jira to track issues to resolution based on their priority

Qualifications

  • Knowledge of the Incident Management processes and ability to resolve issues within agreed organization SLA/SLO
  • Intermediate knowledge of Linux operating systems (Ubuntu, CentOS, etc.)
  • Basic knowledge of container-based architecture (Kubernetes)
  • Basic knowledge of tools like Ansible, Python, Bash, Go, PowerShell and other scripting language
  • Basic knowledge in algorithms, data structures and databases (SQL/NoSQL)
  • Basic knowledge of networking concepts
  • Understanding of cloud environments such as GCP or AWS
  • General knowledge of site reliability engineering principles

Education

  • A minimum of 1 year of incident management or operational work is required
  • Degree in Computer Science or related field
  • Preferred experience:
    • Knowledge of the Incident Management processes and ability to resolve issues within agreed organization SLA/SLO
    • Intermediate knowledge of Linux operating systems (Ubuntu, CentOS, etc.)
    • Basic knowledge of container-based architecture (Kubernetes)
    • Basic knowledge of tools like Ansible, Python, Bash, Go, PowerShell and other scripting languages
    • Basic knowledge in algorithms, data structures and databases (SQL/NoSQL)
    • Basic knowledge of networking concepts
    • Understanding of cloud environments such as GCP or AWS
    • General knowledge of site reliability engineering principles


Nearest Major Market: Durham
Nearest Secondary Market: Raleigh

Job Segment: Database, Linux, SQL, Computer Science, Engineer, Technology, Engineering