Title: Site Reliability Engineer (Kafka)
Canberra, ACT, AU
Job Summary
NetApp is looking for a Senior Techops Engineer to join our growing Instaclustr team in Australia. NetApp’s Instaclustr offering provides open source as-a-service company, delivering reliability at scale. We manage cutting edge open-source technologies (Cassandra, Kafka, PostgreSQL, Redis/Valkey, OpenSearch, Postgres, ClickHouse and Cadence) for our customers around the world.
NetApp Instaclustr makes it easy for our customers to run powerful open-source applications at the highest levels of scale. We have developed a platform that takes care of the whole lifecycle: provisioning infrastructure, installing applications and, most importantly, keeping the applications running reliably in production. Since being founded in 2013, Instaclustr has grown strongly, with over 300 customers worldwide, and over 19,000 nodes under management.
Our Technical Operations Engineers are the frontline team keeping our large fleet of cloud-hosted open-source clusters up and running. Your work will ensure the security, reliability and performance of world-class systems and databases. You will collaborate with our customer’s technical teams, from globally recognised companies in the gaming, banking and logistics industry sectors, ranging from big multinationals to emerging start-ups.
The Role
If you have excellent operational knowledge in managing Kafka clusters, look no further !!
As a Site Reliability Engineer (Kafka), you are in the frontline team keeping our large fleet of cloud-hosted Kafka clusters up and running. Every day, you will diagnose and solve interesting technical problems, providing Kafka as a Managed Service in a highly automated environment. Our service is relied on by some of the leading global names in Banking and Financial Services, Telecom, IoT and Tech companies that interact with millions of end users.
Skills & Experience
We're looking for smart engineers with exceptional communication skills, a positive attitude, and a passion for IT and learning new things. We expect you to be, or quickly become proficient in a range of the technologies we use. Successful candidates for this role will:
- Have strong experience in Kafka, and a desire to learn more and develop to a true expert level.
- Ideally should already have experience diagnosing various operational issues through the analysis of logs /graphs.
- Past experience with abovementioned tech’s upgrades and migrations would be favourable.
- Have good experience working on one Public Cloud provider such as AWS, Azure or GCP.
- Preferably have past IT Customer service/support experience.
- Good fundamental Computer science / software engineering skills and knowledge, particularly Operating System internals, memory management, and networking.
- Strong knowledge and experience with Linux and be comfortable working from the command line (essential)
- Exceptional ability to communicate clearly and professionally in written and verbal English (essential).
- Work as part of a team and use your initiative to get things done.
- Ability to follow required processes and procedures.
- Investigating/researching issues by reviewing the source code.
- Programming skills in Python or Java, and source code control using Git would be a plus.
I'm interested. What else will I be doing?
- Provide expert operational support to our nodes running in the cloud (AWS, Azure and GCP), using technologies such as Linux (Debian), Docker, and languages including Java, Python and bash.Liaise with our customers’ engineers in resolving interesting issues related to Kafka usage and other supported technologies.
- Participate in on-call Level 2 roster.
- Liaise with our customers’ engineers in resolving interesting issues related to Kafka.
- Undertake complex cluster operations such as migrations, upgrades and maintenance on our fleet.
- Develop and continually improve our suite of internal automation tools, applications, and processes.
Job Segment:
Cloud, Open Source, Computer Science, Developer, Linux, Technology