Experience: 8+ years, SRE, AWS, Python, Kubernetes
Job Description & Details
"The demand for resilient, cloud\u2011native services has never been higher, and organizations are racing to embed reliability into every layer of their stack. As a Site Reliability Engineer you\u2019ll be at the forefront of that transformation, turning chaos into predictable, automated operations. This role offers a unique chance to work across AWS and Azure, lead on\u2011call rotations, and shape automation strategies for a massive enterprise environment.\n\n# Job Summary\nYou will join Fidelity\u2019s Production Services team to design, build, and operate highly available platforms on AWS (EKS) and Azure (AKS). The role blends software engineering with systems operations, emphasizing automation (Ansible, Python, Jenkins), observability (Datadog), and incident response. You\u2019ll mentor developers, drive resiliency engineering practices, and continuously improve the reliability of 3,000+ applications.\n\n# Top 3 Critical Skills Table\n| Skill | Why it's critical | Mastery Level |\n|-------|-------------------|--------------|\n| Kubernetes (EKS/AKS) | Core platform for containerized workloads; reliability hinges on proper orchestration and scaling. | Senior |\n| AWS Cloud (incl. automation) | Majority of services run in AWS; deep knowledge enables resilient architecture and cost\u2011effective scaling. | Senior |\n| Automation (Ansible/Python) | Reduces manual toil, speeds up incident resolution, and enforces repeatable processes. | Senior |\n\n# Interview Preparation\n1. **Design a Datadog monitoring solution for a Kubernetes cluster. Which metrics would you collect and why?**\n *What the interviewer is looking for:* Understanding of observability pillars, ability to select key KPIs (CPU, memory, pod health, latency), and knowledge of alert thresholds.\n2. **Walk me through troubleshooting a failing deployment in EKS, including logs, events, and rollback strategies.**\n *What the interviewer is looking for:* Systematic debugging approach, familiarity with `kubectl`, CloudWatch/Datadog logs, and safe rollback mechanisms.\n3. **How would you automate infrastructure provisioning across AWS and Azure using IaC tools?**\n *What the interviewer is looking for:* Experience with multi\u2011cloud IaC (Ansible, Terraform), handling provider differences, and ensuring idempotent deployments.\n4. **Describe your on\u2011call incident management process: prioritization, communication, and post\u2011mortem creation.**\n *What the interviewer is looking for:* Ability to stay calm under pressure, clear stakeholder communication, and a data\u2011driven post\u2011mortem culture.\n5. **Write a Python snippet that triggers a Jenkins job after a successful Ansible playbook run.**\n *What the interviewer is looking for:* Scripting proficiency, REST API usage, and integration of CI/CD pipelines.\n\n# Resume Optimization\n- Site Reliability Engineer\n- Datadog\n- Kubernetes\n- AWS (EKS)\n- Azure (AKS)\n- Ansible\n- Python\n- Jenkins\n- Incident Management\n- Automation\n\n# Application Strategy\nWhen reaching out to the recruiter, send a concise email that begins with a friendly greeting, attaches your up\u2011to\u2011date resume, and clearly highlights your top relevant skills. Make sure to mention related skills you possess, such as Kubernetes, AWS automation, and Python scripting, and reference specific projects where you drove reliability improvements or led on\u2011call rotations.\n\n# Career Roadmap\n| Current Role | Typical Experience | Core Focus | Next Position |\n|--------------|-------------------|------------|---------------|\n| Site Reliability Engineer | 5\u20118 years | Incident response, automation, cloud ops | Senior Site Reliability Engineer |\n| Senior Site Reliability Engineer | 8\u201112 years | Architecture design, team mentorship, large\u2011scale reliability | SRE Manager / Reliability Director |\n| SRE Manager | 12+ years | Strategy, cross\u2011team coordination, budgeting | Director of Reliability Engineering |\n"