Summary

We are seeking a seasoned Senior Site Reliability Engineer (SRE) to join our dynamic team and play a pivotal role in driving our products and technology forward to accelerate business growth. This role is critical in ensuring the reliability, scalability, and performance of our infrastructure and applications. As a Senior SRE, you will collaborate with cross-functional teams to solve complex business challenges, from strategic planning to hands-on execution, ensuring our systems are resilient, secure, and highly available.

Responsibilities

Establish and maintain standards for reliability and resilience across infrastructure and application components to ensure consistent performance and uptime.
Identify and optimize redundancies, monitoring practices, and alerting patterns to prevent incidents and improve system health.
Design and build resilient, highly available distributed systems that support business-critical applications.
Develop and maintain infrastructure automation tools for cloud environments, enabling scalable and repeatable deployments.
Continuously monitor systems and services using advanced tools, provide rapid incident response, triage, and resolve system or client issues to minimize downtime.
Oversee the health and performance of the application ecosystem, ensuring seamless integration and operation.
Drive improvements in platform infrastructure and applications focusing on reliability, resiliency, performance, and quality.
Create and maintain comprehensive documentation, knowledge articles, and runbooks to support operational excellence and team enablement.
Design and implement SRE patterns and practices that comply with security guidelines and policies, ensuring data protection and system integrity.
Work closely with development, operations, and security teams to align on goals, share knowledge, and deliver robust solutions.

Requirements

Must-Have Skills

Datadog: Expertise in using Datadog for monitoring infrastructure and applications, setting up dashboards, alerts, and analyzing metrics to ensure system health and performance.
Splunk: Proficient in leveraging Splunk for log aggregation, searching, and troubleshooting to quickly identify and resolve issues.
PagerDuty & OpsGenie: Experience with incident management platforms like PagerDuty and OpsGenie to manage on-call rotations, alerting, and incident escalation effectively.
Azure DevOps: Strong skills in Azure DevOps for CI/CD pipeline creation, automation, and managing source control repositories.
Documentation: Ability to produce clear, detailed, and accessible documentation, runbooks, and knowledge base articles to support operational processes.
Collaboration: Excellent interpersonal and communication skills to work effectively across teams and departments.
Educational Background: Bachelor’s degree in Computer Science or a related field, or equivalent professional experience.
Kubernetes at Scale (AKS, EKS, GKE): Deep understanding and hands-on experience managing Kubernetes clusters at scale, particularly using Azure Kubernetes Service (AKS), Amazon EKS, or Google GKE.
Kubectl and Helm: Proficiency with Kubernetes command-line tool (kubectl) and Helm for managing Kubernetes applications and deployments.
CI/CD Expertise: Strong experience designing, implementing, and maintaining continuous integration and continuous deployment pipelines.
Azure DevOps & GitHub Actions: Skilled in using Azure DevOps and GitHub Actions for automation, build, test, and deployment workflows.
Source Control Management (SCM): Proficient with SCM tools such as Git, including branching strategies, pull requests, and code reviews.

Nice-to-Have Skills

Infrastructure as Code Tools (Terraform, Pulumi): Experience with IaC tools like Terraform or Pulumi to automate cloud infrastructure provisioning and management.
Security Practices: Knowledge of security best practices including encryption at rest and in transit, using tools such as Azure Key Vault, HashiCorp Vault, or Google KMS to safeguard sensitive data.
Containerization: Experience deploying Java (Spring Boot) microservices in Docker container environments, ensuring efficient and consistent application delivery.
Authentication/Authorization: Familiarity with authentication and authorization protocols such as OpenID Connect, OAuth 2.0, and SAML to secure applications and services.
Scripting and Programming: Proficiency in scripting or programming languages such as Python, PowerShell, Java, or Node.js to automate tasks and develop custom solutions.
Event-Driven Architectures: Understanding of event-driven and event sourcing patterns using platforms like Kafka, Azure EventHub, RabbitMQ, and architectural patterns such as CQRS (Command Query Responsibility Segregation).

Job Type: Remote

Allowed Country: Argentina Brazil Chile Colombia Costa Rica Mexico Paraguay Peru Uruguay

Solicitar este puesto

Back to listings