Site Reliability Engineer Tech Lead

  • Full-Time
  • Salt Lake City, UT
  • Willis Towers Watson
  • Posted 3 years ago – Accepting applications
Job Description

Our engineering team has built the largest private Medicare marketplace in the country. We passionately focus on the continuous improvement of the systems we build and the culture we promote. We build a platform that provides the best possible support to our customers who are shopping for insurance, and where our insurance carriers can be confident that their products are accurately and impartially represented.

The Role

We are looking to grow and lead our teams to include the Site Reliability Engineering discipline. We have spent many years growing and fostering a DevOps culture by bridging the divide between our Software and Infrastructure Engineering departments. We want the cross-functional teams we are building to include Site Reliability Engineers and need a tech lead to guide and curate our internal definition of success for this discipline. The ideal candidate will successfully coach managers on how to find, retain, and manage site reliability engineers (while remaining detached from a direct reporting relationship), and also coach engineers looking to specialize in how to learn and master this discipline. We operate in a complex multi-tenant hybrid cloud and on-premises infrastructure that spans both Windows and Linux OS. We strive for security, reliability, and automation in line with DevOps and Site Reliability Engineering principles. If you are passionate about helping team members grow and promoting a culture of learning and improvement through metrics and automation while sharing those lessons learned, we want to hear from you!

Responsibilities:

  • Become familiar with the career aspirations of all current and aspiring Site Reliability Engineers, and assist in setting short- and long-term goals to support them in those pursuits
  • Lead our Site Reliability Engineering Community of Practice
  • Mentor Site Reliability Engineers and others in the organization on reliability, reducing toil, operating software at growing scale, reducing technical complexity and sprawl, and writing software and tooling to improve resilience and automating operations
  • Assist to interview, hire, and onboard high-quality job applicants
  • Conduct 1-on-1 meetings with all Site Reliability Engineers
  • Keep leadership well informed of Site Reliability Engineering direction and focus, and Site Reliability Engineers focused on goals
  • Ensure that Site Reliability Engineers across various teams are well informed of changes or status
  • Explore new ways of improving communication between teams
  • Promote inclusion and collaboration between various functional disciplines
  • Write and maintain architectural, stakeholder, and policy documentation
  • Encourage and inspire others to innovate
  • Look for new ways to improve our processes and the quality of our infrastructure, along with new ways to remediate production incidents more quickly and safely
  • Look for new ways to increase the velocity with which teams deliver, leveraging expertise from various functional disciplines
  • Define success and accountability for the Site Reliability Engineering discipline
  • Adhere to and advocate for best practices including Infrastructure as Code, monitoring, high availability, disaster recovery, security, and DevOps methodologies
  • Provide timely assistance and remediation solutions during critical situations and production incidents
  • Take ultimate responsibility for the success or failure of the Site Reliability Engineering discipline
  • Guide the culture and attitude of our Site Reliability Engineers toward an optimistic, proactive, and encouraging direction
  • Foster an environment where it is safe to fail and to learn from failure

The Requirements

  • 10+ years of hands-on technical experience with many of the following technologies; at least 50% of day to day function will be focused in this area:
  • Windows and Linux Servers
  • VMware
  • Cloud platforms, preferably with Azure
  • Active Directory
  • Secrets management with Consul and Vault or similar systems
  • Configuration management tools like Salt and Terraform
  • Firewalls and load balancers such as F5
  • Web servers including IIS, NGINX, and Tomcat
  • Application Performance Monitoring with tools like New Relic
  • Infrastructure monitoring with tools like Sensu, SolarWinds, or Nagios
  • Continuous Integration and Continuous Delivery with tools like TeamCity, Octopus Deploy, Concourse, or Azure DevOps
  • Log Aggregation tools like SumoLogic or Splunk
  • Network theory and protocols such as DNS, DHCP, Proxy Servers, and Firewalls
  • Security operations with tools for SAST, DAST, RAST, and WAF\
  • Proficiency, high-comfort, and familiarity with
  • Three or more programming languages, such as C#, JavaScript, Python or Go
  • One or more scripting languages, such as Powershell and BASH
  • Command line interfaces
  • Git
  • Bachelor’s Degree strongly preferred; HS Diploma required


EOE, including disability/vets

Apply to this Job