At BPTN – Black Professionals in Tech Network we’re pushing the future of tech forward by creating a space for Black professionals in tech to gather, grow and evolve – all while being a conduit for companies to engage this talent across North America.
We’re here to help Black professionals network, connect with one another, share resources and grow their careers. Our rapidly growing network counts over 50,000 Black professionals. We provide our members with access to mentorship, skill-building opportunities, and a strong peer network to support professional growth and advancement.
Our client is looking for an Intermediate System Reliability Engineer who will work cross-functionally amongst a variety of teams and be a contributor in all significant engineering service or solution delivered to the Global Systems Reliability Office and stakeholders. You will also work directly with the Software Engineering teams to both maintain and operate the company’s existing technology and build the next generation of technologies.
Responsibilities
- Work in collaboration with the Global System Reliability Engineering team as well as with software development, Quality, Product and Data Engineering teams to Champion SRE/ DevOps culture and practices
- Lead and collaborate with a team of Reliability Engineers (directly and through local and global communities of practice)
- Working closely with software development, Quality, Product and Data Engineering teams as a Champion of SRE/ DevOps culture and practices
- Working closely with global and regional architecture boards, champion the definition and implementation of resilience policies for new and existing solutions
- Lead management of Service Level Objectives with senior development and business leads
- Lead initiatives to continuously refine our build, plan and deploy practices for improved stability, reliability, efficiency, repeatability and security. You’ll create plans, collaborate with other SROs and DevOps team members – coordinating activity with development and business leads to increase service levels, lower costs, and support delivery velocity objectives
- Working closely with Development and operations teams to lead troubleshooting of our most severe incidents – leading senior stakeholder communication, driving problem-solving (e.g., log analysis, non-invasive tests) and debugging with best practice techniques
- Leading continuous improvement and execution of quality and timely major incident root cause analysis and blameless post-mortem activities to ensure we take action to avoid similar problems in the future
- Lead prioritization of reliability features and contribute to the design, development and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks
Qualifications
- 5+ years' experience in IT
- Strong understanding of SRE
- Degree in Computer Science, Engineering, or equivalent
- Excellent communication (both verbal and written). The ability to communicate confidently and clearly on conference calls, in meetings, via email, etc. at all levels of the organization is essential
- Performance and results oriented leadership skills – with a developmental bias (coaching)
- Experience working with large-scale distributed systems
- Experience using Jenkins, Bamboo or other CI tools
- Experience with GCP/Azure/AWS
- Experience working in an Agile environment
- Deep understanding of containerization and orchestration
- Experience with monitoring/observability tooling such as Dynatrace, DataDog, Splunk, Elastic Stack, Promeatheus, Jaeger, OpenTelemetry, etc.
- Experience in at least one high level programing language such as Python or Go
- Knowledge of R is a plus
- Configuration management using Ansible, Terraform, Puppet, Chef, or similar
Location
- Toronto,ON