Site Reliability Engineer Tech Lead

Full-Time
Salt Lake City, UT
Willis Towers Watson
Posted 4 years ago – Accepting applications

Job Description

Our engineering team has built the largest private Medicare marketplace in the country. We passionately focus on the continuous improvement of the systems we build and the culture we promote. We build a platform that provides the best possible support to our customers who are shopping for insurance, and where our insurance carriers can be confident that their products are accurately and impartially represented.

The Role

We are looking to grow and lead our teams to include the Site Reliability Engineering discipline. We have spent many years growing and fostering a DevOps culture by bridging the divide between our Software and Infrastructure Engineering departments. We want the cross-functional teams we are building to include Site Reliability Engineers and need a tech lead to guide and curate our internal definition of success for this discipline. The ideal candidate will successfully coach managers on how to find, retain, and manage site reliability engineers (while remaining detached from a direct reporting relationship), and also coach engineers looking to specialize in how to learn and master this discipline. We operate in a complex multi-tenant hybrid cloud and on-premises infrastructure that spans both Windows and Linux OS. We strive for security, reliability, and automation in line with DevOps and Site Reliability Engineering principles. If you are passionate about helping team members grow and promoting a culture of learning and improvement through metrics and automation while sharing those lessons learned, we want to hear from you!

Responsibilities:

Become familiar with the career aspirations of all current and aspiring Site Reliability Engineers, and assist in setting short- and long-term goals to support them in those pursuits
Lead our Site Reliability Engineering Community of Practice
Mentor Site Reliability Engineers and others in the organization on reliability, reducing toil, operating software at growing scale, reducing technical complexity and sprawl, and writing software and tooling to improve resilience and automating operations
Assist to interview, hire, and onboard high-quality job applicants
Conduct 1-on-1 meetings with all Site Reliability Engineers
Keep leadership well informed of Site Reliability Engineering direction and focus, and Site Reliability Engineers focused on goals
Ensure that Site Reliability Engineers across various teams are well informed of changes or status
Explore new ways of improving communication between teams
Promote inclusion and collaboration between various functional disciplines
Write and maintain architectural, stakeholder, and policy documentation
Encourage and inspire others to innovate
Look for new ways to improve our processes and the quality of our infrastructure, along with new ways to remediate production incidents more quickly and safely
Look for new ways to increase the velocity with which teams deliver, leveraging expertise from various functional disciplines
Define success and accountability for the Site Reliability Engineering discipline
Adhere to and advocate for best practices including Infrastructure as Code, monitoring, high availability, disaster recovery, security, and DevOps methodologies
Provide timely assistance and remediation solutions during critical situations and production incidents
Take ultimate responsibility for the success or failure of the Site Reliability Engineering discipline
Guide the culture and attitude of our Site Reliability Engineers toward an optimistic, proactive, and encouraging direction
Foster an environment where it is safe to fail and to learn from failure

The Requirements

10+ years of hands-on technical experience with many of the following technologies; at least 50% of day to day function will be focused in this area:
Windows and Linux Servers
VMware
Cloud platforms, preferably with Azure
Active Directory
Secrets management with Consul and Vault or similar systems
Configuration management tools like Salt and Terraform
Firewalls and load balancers such as F5
Web servers including IIS, NGINX, and Tomcat
Application Performance Monitoring with tools like New Relic
Infrastructure monitoring with tools like Sensu, SolarWinds, or Nagios
Continuous Integration and Continuous Delivery with tools like TeamCity, Octopus Deploy, Concourse, or Azure DevOps
Log Aggregation tools like SumoLogic or Splunk
Network theory and protocols such as DNS, DHCP, Proxy Servers, and Firewalls
Security operations with tools for SAST, DAST, RAST, and WAF\
Proficiency, high-comfort, and familiarity with
Three or more programming languages, such as C#, JavaScript, Python or Go
One or more scripting languages, such as Powershell and BASH
Command line interfaces
Git
Bachelor’s Degree strongly preferred; HS Diploma required

EOE, including disability/vets

Apply to this Job

Willis Towers Watson

Know someone who would be perfect for this role?

Apply to this Job