Senior DevOps Engineer, Infrastructure & Reliability

Remote, USA • Full-time • Posted 2026-05-31

Conduct interviews with engineering teams to identify and remove operational friction in CI/CD, deployments, observability, and cloud environments.
Design and implement scalable infrastructure-as-code patterns using Terraform to standardize provisioning and reduce configuration drift.
Own and evolve the Kubernetes platform, including EKS or self-managed environments, so workloads are secure, scalable, and resilient.
Architect and optimize CI/CD pipelines to improve deployment frequency, reduce lead time, and increase release confidence.
Lead reliability initiatives such as incident response improvements, root cause analysis, and postmortem practices.
Design and enforce secure networking, IAM, and secrets management strategies across environments.
Improve observability through metrics, logs, and tracing using DataDog or similar tooling.
Optimize cloud costs through rightsizing, autoscaling, and architectural improvements.
Own disaster recovery planning, backup strategies, and multi-region resilience initiatives.
Refactor manual or brittle infrastructure into automated, testable, reproducible systems and drive adoption through documentation and hands-on support.

8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles.
Proven experience designing and operating production Kubernetes environments at scale.
Deep hands-on expertise with AWS infrastructure and cloud networking.
Strong experience building and maintaining Terraform modules across large cloud environments.
Demonstrated ownership of CI/CD systems and measurable improvement of DORA metrics.
Experience leading incident response processes and driving meaningful postmortem outcomes.
Strong understanding of distributed systems, event-driven architectures with Kafka, and database performance with PostgreSQL.
Proven ability to modernize legacy infrastructure and eliminate manual operational toil.
Experience navigating high-ambiguity environments and translating operational friction into prioritized infrastructure roadmaps.
Nice to have: experience operating high-throughput Kafka clusters, tuning PostgreSQL or Redis, implementing autoscaling, building internal developer platforms, applying security best practices, working with multi-region systems, using Python for automation, or introducing SLO/error budget/chaos testing frameworks.
All remote hires must be able to travel to Orlando, Florida at least twice per year, plus for orientation in Orlando.

Apply tot his job

Apply To this Job

Similar Jobs