Who We Are
We're a team that believes great products are built by empowered engineers who care deeply about reliability, performance, and continuous improvement. We value ownership, collaboration, and thoughtful engineering practices that allow us to build scalable systems and solve meaningful problems.
Our environment encourages engineers to explore new ideas, automate repetitive work, and improve how systems operate at scale. If you're passionate about building resilient infrastructure and improving production reliability, we'd love to work with you.
The Role
We are hiring a Platform Site Reliability Engineer II to help improve the reliability, performance, and operational maturity of our platform.
This role sits at the intersection of Site Reliability Engineering, platform engineering, and backend systems understanding. You will work on production reliability, incident response, observability, automation, bottleneck detection, and system performance across our cloud infrastructure and services.
This is not a traditional operations-heavy DevOps role. The focus is on engineering-led reliability, system performance, proactive issue detection, and automation, with a reasonable level of hands-on platform and DevOps operations.
We welcome candidates from SRE, platform, infrastructure, or strong backend engineering backgrounds, provided they have solid exposure to production infrastructure, observability, incident handling, and system reliability.
Responsibilities
- Improve the reliability, availability, and performance of production systems
- Participate in and take ownership of incident response, troubleshooting, root cause analysis, and follow-up actions
- Improve monitoring, alerting, dashboards, and observability, primarily using Datadog
- Investigate production issues across infrastructure, Kubernetes workloads, ECS services, databases, caching layers, and application behavior
- Identify system and application bottlenecks, and propose improvements in collaboration with backend engineers
- Support capacity planning, scalability, resilience, and performance forecasting
- Maintain and improve infrastructure running on AWS
- Work with Kubernetes and ECS environments to improve runtime stability and deployment reliability
- Use Terraform and infrastructure-as-code practices to increase consistency and reduce manual work
- Contribute to hands-on DevOps and platform operations
- Drive automation across monitoring, diagnostics, incident handling, and platform workflows
- Use AI-assisted engineering tools (such as Claude Code or similar tools) to improve automation, troubleshooting, and engineering productivity
Qualifications
Required
- 3–5 years of experience in SRE, Platform Engineering, Infrastructure Engineering, or backend engineering with strong production infrastructure exposure
- Strong hands-on experience with AWS
- Experience working with Kubernetes and/or ECS
- Experience with Terraform or similar infrastructure-as-code tools
- Strong experience with monitoring, observability, and production troubleshooting, ideally using Datadog
- Experience participating in or owning production incidents
- Solid backend and systems understanding, including how infrastructure, runtime behavior, and application code interact
- Ability to identify performance, scaling, reliability, and resource bottlenecks in production systems
- Good understanding of PostgreSQL / RDS and Redis
- Strong troubleshooting and analytical skills across infrastructure, services, dependencies, and application behavior
- Ability to collaborate effectively with backend and platform engineering teams
- Experience using AI tools such as Claude Code or similar assistants to support automation and engineering productivity
Preferred
- Experience reading, debugging, or contributing to .NET services
- Experience with load testing, performance testing, or capacity modeling
- Familiarity with SLOs, SLIs, error budgets, and postmortem practices
- Experience working in high-throughput or data-intensive production environments
Candidate Profile
The ideal candidate is an engineer who combines strong infrastructure and reliability foundations with solid backend systems understanding.
They should be comfortable:
- Investigating production behavior
- Identifying risks and reliability gaps early
- Improving observability and diagnostics
- Collaborating with engineering teams to improve system performance and resilience
You enjoy solving complex production problems and building systems that are stable, scalable, and continuously improving.
Ready to level up your customer loyalty?
Level up your customer loyalty!
.avif)
.avif)
.avif)
.avif)


