Title: Site Reliability Engineer
Canberra, ACT, AU
Job Summary
Our TechOps Engineers are the frontline team keeping our large fleet of cloud-hosted Apache Kafka, Cassandra, OpenSearch, Cadence, Valkey, Clickhouse and PostgreSQL clusters up and running. Every day you will diagnose and solve challenging and interesting technical problems providing a service that is relied on by some of the leading global names in tech to deliver for millions of end users.
This role is for an Australia-based Senior TechOps engineer – primarily focusing on Cadence Opensource technology – that includes operating, maintaining, upgrading and continuously improving the Managed Service for Cadence (across AWS, Azure and GCP) to deliver a great customer experience.
Job Requirements
- Working with our Managed Service Product development team to establish Cadence operational requirements and support procedures.
- Responding to customer queries and incidents, diagnosing and solving complex technical issues by liaising with customer’s engineers. This will include written communication via support tickets and occasional video-call based support.
- The role provides an opportunity to additionally work extensively on Apache Cassandra, Kafka, Opensearch, PostgreSQL, along with Cloud providers such as AWS, GCP and Azure.
- Assist/mentor Level-1 team members to develop their technical capabilities on Cadence.
- Undertake complex cluster operations such as migrations, upgrades and maintenance.
- Provide expert operational support to our nodes running in the cloud (AWS, Azure and GCP) as well as On-premise, using technologies such as Linux (Debian), Docker, and languages including Java, Python and Bash.
- Investigate issues and apply standard maintenance procedures to optimize the performance and stability of production systems
- Liaise with the Development and Product Management team through all stages of the development cycle to ensure proper release processes/procedures are being followed
- Develop and continually improve our suite of internal automation tools, applications, and processes
- Be a proactive, reliable and supportive member of the TechOps team, and participate in a rotating L2 shift roster
Skills and Education
We're looking for smart engineers with exceptional communication skills, a positive attitude, and a passion for IT and learning new things. We expect you to be, or quickly become proficient in the range of technologies we use.
- You must have at least 3-5 years of working experience in addition to:
- Managing Production environment, including performance benchmarking and tuning on application and kernel level.
- Strong Linux skills with experience in cloud environments is a must, preferably AWS or GCP or Azure. Should be comfortable working from the command line. This is essential, there are no GUIs here
- Familiarity with installing and maintaining VMs and applications in scale, including upgrade, migration and life cycle management.
- Ability to debug applications using logs and metrics, and replicate issues in local environment.
- Preferably experience with Ansible, Prometheus, Terraform, Grafana and Docker.
- Good fundamental computer science / software engineering skills and knowledge, particularly operating system internals, memory management, and networking.
- Ideally, programming skills in languages such as GO, Python, Java, Bash scripting, SQL and source code control using Git.
- Exceptional ability to communicate clearly and professionally in written and verbal English (essential).
- Follow required processes and procedures.
- Work as part of a team and use your initiative to get things done.
- Passion for all things IT, and especially open source.
- Any customer service experience is favorable.
Job Segment:
Cloud, Computer Science, Software Engineer, Open Source, Developer, Technology, Engineering