Site Reliability Engineering, Automation & Cloud Scaling - Competitive Rates, High-Impact Infrastructure Project
About the Role
Join a leading studio in the high-growth gaming industry, focused on building, operating, and scaling the reliable, high-performance systems that our global community of players and partners depend on. This role is a critical opportunity for a seasoned Site Reliability Engineer (SRE) to lead the complete reshaping of our infrastructure from the ground up.
You will apply software engineering principles to operations, architecting resilient cloud infrastructure, significantly reducing toil through automation, and championing SRE best practices across the engineering organization. This is an ideal role for someone passionate about performance, resilience, and leading transformative technical projects.
Key Responsibilities
- Design, build, and maintain highly reliable, scalable production systems in cloud environments (primarily AWS).
- Be an active part of incident response, on-call support, and conduct blameless post-incident reviews.
- Define and manage SLOs, SLIs, and error budgets, driving reliability improvements across the engineering teams.
- Improve system observability through robust metrics, monitoring, tracing, and alerting solutions.
- Develop automation and tooling to eliminate operational toil and streamline development workflows.
- Perform capacity planning, performance tuning, and conduct reliability-focused architecture reviews.
- Implement and manage infrastructure-as-code solutions (Terraform or equivalent).
- Perform chaos engineering, failure testing, and resilience validation.
- Work closely with development teams to ensure new services meet strict reliability and operational standards.
Skills & Experience Required
- Strong software engineering skills in at least one language (e.g., Python, Go, Java, or similar).
- Hands-on experience operating and scaling cloud-based production systems (preferably AWS).
- Solid understanding of distributed systems, networking, load balancing, and Linux internals.
- Experience with containerization and orchestration (Docker, Kubernetes, or ECS).
- Strong knowledge of observability tooling (metrics, logging, tracing) and formal incident management practices.
- Experience with Infrastructure-as-Code (Terraform, CloudFormation, or similar).
- Experience designing and maintaining CI/CD pipelines and deployment automation.
- Proven background in running services in production, including on-call participation.
- Demonstrated ability to troubleshoot complex systems under pressure.
- Experience with SLO design, error budgets, and SRE operational models is a plus.
What You’ll Bring
- A reliability-first mindset balanced with practicality and pragmatism.
- A strong passion for automation and aggressively reducing repetitive work.
- High sense of ownership, curiosity, and willingness to dive deep into production issues.
- A highly collaborative approach, partnering with engineers to improve system quality for all.
Contract Details
- Duration: Initial 3 to 6-month contract, with a high likelihood of extension based on project phases.
- Location: Primarily Central Auckland based (hybrid work environment)
- Remuneration: Competitive hourly paid rates commensurate with your senior SRE experience and specialised skills in the gaming/high-scale environment.
How to Apply:
To apply for this position, please submit your CV and a cover letter detailing your relevant experience in SRE and cloud infrastructure.
Alternatively, contact Amaan at 0220607986 or email on amaan.kazmi@randstaddigital.co.nz for a confidential discussion.
At Randstad, we are passionate about providing equal employment opportunities and embracing diversity to the benefit of all. We actively encourage applications from any background.
...