PW – Sr. SRE B. – Job3730

PW – Sr. SRE B. – Job3730

Summary

We are looking for a seasoned Site Reliability Engineer (SRE) to join our team and support our strategy of driving products and technology to accelerate business growth. As an SRE, you will work alongside a team of problem solvers, helping to solve complex business issues from strategy to execution.

Responsibilities

  • Defining standard reliability and resilience for infrastructure and application components.
  • Proactively optimizing redundancies, monitoring practices, and alerting patterns.
  • Developing resilient and highly available distributed systems.
  • Building infrastructure as code tools for cloud environments.
  • Monitoring systems and services, providing incident response to triage and resolve system or client issues.
  • Managing the application ecosystem, improving platform infrastructure and applications with high reliability,resiliency, performance, and quality.
  • Creating documentation, knowledge articles, and runbooks.
  • Designing and implementing SRE patterns that adhere to our client’s security guidelines and policies.

Requirements

  • Bachelor’s degree in Computer Science or related field (or equivalent work experience).
  • At least 4 years of relevant working experience as a Site Reliability Engineer or similar role.
  • Advanced Kubernetes expertise – Strong skills in Kubernetes at scale using AKS, EKS, or GKE. Experience with Kubectl and Helm. Familiarity with tools like Lens or Rancher.
  • Observability: experience in setting up tools like Datadog & Splunk for actionable insights on microservice environments including synthetics, application performance monitoring, logging, and alerting (PagerDuty/OpsGenie integrations).
  • Good CI/CD expertise. Experience using Azure DevOps & GitHub Actions for continuous integration and continuous deployment processes.
  • SCM proficiency – Working with tools like GitHub for source code management, along with experience in branching strategies like GitFlow or trunk-based development.
  • Strong troubleshooting skills – Ability to dive deep into code-level analysis to provide development teams with a head start on resolving application issues. Effective contribution to root cause analysis exercises.
  • Good communication skills – Active listening, verbal and non-verbal communication, clarity, concision, confidence, open-mindedness, and respect.
  • Good documentation skills – Ability to effectively document automation and technical efforts for ease of adaptability of solutions.
  • Collaboration skills – Ability to work effectively with Scrum/Dev teams using a push/pull philosophy, managing expectations and contributing to the stability and improvement of the platform.

Nice to Have

  • Infrastructure as Code tools (Terraform, Pulumi). Preferably developed modules in the past rather than just using them.
  • Security practices including encryption at rest/in transit with tools like Azure Key vault, Hashicorp Vault, Google KMS.
  • Containerization experience deploying Java (Spring Boot) microservices in Docker environments.
  • Automation – Must be able to identify toil and opportunities to reduce that within the team.
  • Authentication/Authorization – Familiarity with Authn/Authz schemes like OpenID, OAuth 2.0, SAML.
  • Scripting and Programming – Experience with Python, Powershell, Java or Node.
  • Familiarity with event-driven/event sourcing patterns using platforms like Kafka, EventHub, RabbitMQ and patterns like CQRS.

Solicitar este puesto

Maximum allowed file size is 50 MB. Allowed type(s): .pdf