HomeCareer
Platform Site Reliability Engineer II

Platform Site Reliability Engineer II

Hybrid
Full-time
Professional level

Who We Are

We're a team that believes great products are built by empowered engineers who care deeply about reliability, performance, and continuous improvement. We value ownership, collaboration, and thoughtful engineering practices that allow us to build scalable systems and solve meaningful problems.

Our environment encourages engineers to explore new ideas, automate repetitive work, and improve how systems operate at scale. If you're passionate about building resilient infrastructure and improving production reliability, we'd love to work with you.

The Role

We are hiring a Platform Site Reliability Engineer II to help improve the reliability, performance, and operational maturity of our platform.

This role sits at the intersection of Site Reliability Engineering, platform engineering, and backend systems understanding. You will work on production reliability, incident response, observability, automation, bottleneck detection, and system performance across our cloud infrastructure and services.

This is not a traditional operations-heavy DevOps role. The focus is on engineering-led reliability, system performance, proactive issue detection, and automation, with a reasonable level of hands-on platform and DevOps operations.

We welcome candidates from SRE, platform, infrastructure, or strong backend engineering backgrounds, provided they have solid exposure to production infrastructure, observability, incident handling, and system reliability.

Responsibilities

  • Improve the reliability, availability, and performance of production systems
  • Participate in and take ownership of incident response, troubleshooting, root cause analysis, and follow-up actions
  • Improve monitoring, alerting, dashboards, and observability, primarily using Datadog
  • Investigate production issues across infrastructure, Kubernetes workloads, ECS services, databases, caching layers, and application behavior
  • Identify system and application bottlenecks, and propose improvements in collaboration with backend engineers
  • Support capacity planning, scalability, resilience, and performance forecasting
  • Maintain and improve infrastructure running on AWS
  • Work with Kubernetes and ECS environments to improve runtime stability and deployment reliability
  • Use Terraform and infrastructure-as-code practices to increase consistency and reduce manual work
  • Contribute to hands-on DevOps and platform operations
  • Drive automation across monitoring, diagnostics, incident handling, and platform workflows
  • Use AI-assisted engineering tools (such as Claude Code or similar tools) to improve automation, troubleshooting, and engineering productivity

Qualifications

Required

  • 3–5 years of experience in SRE, Platform Engineering, Infrastructure Engineering, or backend engineering with strong production infrastructure exposure
  • Strong hands-on experience with AWS
  • Experience working with Kubernetes and/or ECS
  • Experience with Terraform or similar infrastructure-as-code tools
  • Strong experience with monitoring, observability, and production troubleshooting, ideally using Datadog
  • Experience participating in or owning production incidents
  • Solid backend and systems understanding, including how infrastructure, runtime behavior, and application code interact
  • Ability to identify performance, scaling, reliability, and resource bottlenecks in production systems
  • Good understanding of PostgreSQL / RDS and Redis
  • Strong troubleshooting and analytical skills across infrastructure, services, dependencies, and application behavior
  • Ability to collaborate effectively with backend and platform engineering teams
  • Experience using AI tools such as Claude Code or similar assistants to support automation and engineering productivity

Preferred

  • Experience reading, debugging, or contributing to .NET services
  • Experience with load testing, performance testing, or capacity modeling
  • Familiarity with SLOs, SLIs, error budgets, and postmortem practices
  • Experience working in high-throughput or data-intensive production environments

Candidate Profile

The ideal candidate is an engineer who combines strong infrastructure and reliability foundations with solid backend systems understanding.

They should be comfortable:

  • Investigating production behavior
  • Identifying risks and reliability gaps early
  • Improving observability and diagnostics
  • Collaborating with engineering teams to improve system performance and resilience

You enjoy solving complex production problems and building systems that are stable, scalable, and continuously improving.

Ready to level up your customer loyalty?

Level up your customer loyalty!

Retain more customers with less work, thanks to gamified engagement built for modern teams.
Keep customers engaged easily with gamified tools for teams.