Gameball

Who We Are

We're a team that believes great products are built by empowered engineers who care deeply about reliability, performance, and continuous improvement. We value ownership, collaboration, and thoughtful engineering practices that allow us to build scalable systems and solve meaningful problems.

Our environment encourages engineers to explore new ideas, automate repetitive work, and improve how systems operate at scale. If you're passionate about building resilient infrastructure and improving production reliability, we'd love to work with you.

The Role

We are hiring a Platform Site Reliability Engineer II to help improve the reliability, performance, and operational maturity of our platform.

This role sits at the intersection of Site Reliability Engineering, platform engineering, and backend systems understanding. You will work on production reliability, incident response, observability, automation, bottleneck detection, and system performance across our cloud infrastructure and services.

This is not a traditional operations-heavy DevOps role. The focus is on engineering-led reliability, system performance, proactive issue detection, and automation, with a reasonable level of hands-on platform and DevOps operations.

We welcome candidates from SRE, platform, infrastructure, or strong backend engineering backgrounds, provided they have solid exposure to production infrastructure, observability, incident handling, and system reliability.

Responsibilities

Improve the reliability, availability, and performance of production systems
Participate in and take ownership of incident response, troubleshooting, root cause analysis, and follow-up actions
Improve monitoring, alerting, dashboards, and observability, primarily using Datadog
Investigate production issues across infrastructure, Kubernetes workloads, ECS services, databases, caching layers, and application behavior
Identify system and application bottlenecks, and propose improvements in collaboration with backend engineers
Support capacity planning, scalability, resilience, and performance forecasting
Maintain and improve infrastructure running on AWS
Work with Kubernetes and ECS environments to improve runtime stability and deployment reliability
Use Terraform and infrastructure-as-code practices to increase consistency and reduce manual work
Contribute to hands-on DevOps and platform operations
Drive automation across monitoring, diagnostics, incident handling, and platform workflows
Use AI-assisted engineering tools (such as Claude Code or similar tools) to improve automation, troubleshooting, and engineering productivity

Qualifications

Required

3–5 years of experience in SRE, Platform Engineering, Infrastructure Engineering, or backend engineering with strong production infrastructure exposure
Strong hands-on experience with AWS
Experience working with Kubernetes and/or ECS
Experience with Terraform or similar infrastructure-as-code tools
Strong experience with monitoring, observability, and production troubleshooting, ideally using Datadog
Experience participating in or owning production incidents
Solid backend and systems understanding, including how infrastructure, runtime behavior, and application code interact
Ability to identify performance, scaling, reliability, and resource bottlenecks in production systems
Good understanding of PostgreSQL / RDS and Redis
Strong troubleshooting and analytical skills across infrastructure, services, dependencies, and application behavior
Ability to collaborate effectively with backend and platform engineering teams
Experience using AI tools such as Claude Code or similar assistants to support automation and engineering productivity

Preferred

Experience reading, debugging, or contributing to .NET services
Experience with load testing, performance testing, or capacity modeling
Familiarity with SLOs, SLIs, error budgets, and postmortem practices
Experience working in high-throughput or data-intensive production environments

Candidate Profile

The ideal candidate is an engineer who combines strong infrastructure and reliability foundations with solid backend systems understanding.

They should be comfortable:

Investigating production behavior
Identifying risks and reliability gaps early
Improving observability and diagnostics
Collaborating with engineering teams to improve system performance and resilience

You enjoy solving complex production problems and building systems that are stable, scalable, and continuously improving.

‍

Apply now

Table of contents