Summary
We are seeking a seasoned Senior Site Reliability Engineer (SRE) to join our dynamic team and play a pivotal role in driving our products and technology forward to accelerate business growth. This role is critical in ensuring the reliability, scalability, and performance of our infrastructure and applications. As a Senior SRE, you will collaborate with cross-functional teams to solve complex business challenges, from strategic planning to hands-on execution, ensuring our systems are resilient, secure, and highly available.
Responsibilities
- Establish and maintain standards for reliability and resilience across infrastructure and application components to ensure consistent performance and uptime.
- Identify and optimize redundancies, monitoring practices, and alerting patterns to prevent incidents and improve system health.
- Design and build resilient, highly available distributed systems that support business-critical applications.
- Develop and maintain infrastructure automation tools for cloud environments, enabling scalable and repeatable deployments.
- Continuously monitor systems and services using advanced tools, provide rapid incident response, triage, and resolve system or client issues to minimize downtime.
- Oversee the health and performance of the application ecosystem, ensuring seamless integration and operation.
- Drive improvements in platform infrastructure and applications focusing on reliability, resiliency, performance, and quality.
- Create and maintain comprehensive documentation, knowledge articles, and runbooks to support operational excellence and team enablement.
- Design and implement SRE patterns and practices that comply with security guidelines and policies, ensuring data protection and system integrity.
- Work closely with development, operations, and security teams to align on goals, share knowledge, and deliver robust solutions.
Requirements
Must-Have Skills
- Datadog: Expertise in using Datadog for monitoring infrastructure and applications, setting up dashboards, alerts, and analyzing metrics to ensure system health and performance.
- Splunk: Proficient in leveraging Splunk for log aggregation, searching, and troubleshooting to quickly identify and resolve issues.
- PagerDuty & OpsGenie: Experience with incident management platforms like PagerDuty and OpsGenie to manage on-call rotations, alerting, and incident escalation effectively.
- Azure DevOps: Strong skills in Azure DevOps for CI/CD pipeline creation, automation, and managing source control repositories.
- Documentation: Ability to produce clear, detailed, and accessible documentation, runbooks, and knowledge base articles to support operational processes.
- Collaboration: Excellent interpersonal and communication skills to work effectively across teams and departments.
- Educational Background: Bachelor’s degree in Computer Science or a related field, or equivalent professional experience.
- Kubernetes at Scale (AKS, EKS, GKE): Deep understanding and hands-on experience managing Kubernetes clusters at scale, particularly using Azure Kubernetes Service (AKS), Amazon EKS, or Google GKE.
- Kubectl and Helm: Proficiency with Kubernetes command-line tool (kubectl) and Helm for managing Kubernetes applications and deployments.
- CI/CD Expertise: Strong experience designing, implementing, and maintaining continuous integration and continuous deployment pipelines.
- Azure DevOps & GitHub Actions: Skilled in using Azure DevOps and GitHub Actions for automation, build, test, and deployment workflows.
- Source Control Management (SCM): Proficient with SCM tools such as Git, including branching strategies, pull requests, and code reviews.
Nice-to-Have Skills
- Infrastructure as Code Tools (Terraform, Pulumi): Experience with IaC tools like Terraform or Pulumi to automate cloud infrastructure provisioning and management.
- Security Practices: Knowledge of security best practices including encryption at rest and in transit, using tools such as Azure Key Vault, HashiCorp Vault, or Google KMS to safeguard sensitive data.
- Containerization: Experience deploying Java (Spring Boot) microservices in Docker container environments, ensuring efficient and consistent application delivery.
- Authentication/Authorization: Familiarity with authentication and authorization protocols such as OpenID Connect, OAuth 2.0, and SAML to secure applications and services.
- Scripting and Programming: Proficiency in scripting or programming languages such as Python, PowerShell, Java, or Node.js to automate tasks and develop custom solutions.
- Event-Driven Architectures: Understanding of event-driven and event sourcing patterns using platforms like Kafka, Azure EventHub, RabbitMQ, and architectural patterns such as CQRS (Command Query Responsibility Segregation).
Job Type: Remote
Allowed Country: Argentina Brazil Chile Colombia Costa Rica Mexico Paraguay Peru Uruguay