Site Reliability Engineer at Stacklet #vacancy #remote

Stacklet helps organizations with large cloud estates optimize cost, improve security, and ensure compliance by simplifying and automating all aspects of governance via code. Our company was founded by the creators and maintainers of CNCF’s Cloud Custodian, an open-source project used today by thousands of well-known global brands.

Our Stacklet Platform is an award-winning governance as code solution that enables teams to identify and remediate cloud governance issues while establishing preventative guardrails against their recurrence. Renowned for supporting some of the world’s most substantial cloud service consumers, Stacklet Platform helps mitigate cloud waste and risk on a large scale.

Overview

Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other Stacklet production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the Stacklet codebases.

Role description

Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents.
Use your on-call shift to prevent incidents from ever happening.
Make monitoring and alerting alert on symptoms and not on outages.
Document every action so your findings turn into repeatable actions–and then into automation.
Improve the deployment process to make it as boring as possible.
Design, build and maintain core infrastructure pieces that allow successful scaling of the Stacklet platform
Debug production issues across services and levels of the stack.
Partner with development teams to improve services through rigorous testing and release procedures
Participate in system design consulting, platform management, and capacity planning
Create sustainable systems and services through automation and uplifts
Balance feature development speed and reliability with well-defined service level objectives

What we’re looking for

Here’s what we’re looking for in an ideal candidate.

A mind for systems – edge cases, failure modes, behaviors, specific implementations.
You know your way around Linux and the Unix Shell.
Strong programming skills – Python and/or Go
Collaborate and communicate asynchronously.
You document all the things so you don’t need to learn the same thing twice.
You have an enthusiastic, go-for-it attitude. When you see something broken, you can’t help but fix it.
You love delivering quickly and iterating fast.

What you can expect

Here’s a little bit about us, so you know what to expect if you join us.

100% remote – Slack and Google Meet for communication
Company laptop for development
Home office budget
Github and JIRA, lightweight agile process
AWS training and certification opportunities
Talented and experienced team, friendly culture
Travel 2 – 4 times a year for internal and external events
Career growth with opportunities for advancement
Equity compensation, excellent health benefits, generous time-off policy

Stacklet believes a diverse workforce enhances our ability to deliver world class products and services. We are committed to ensuring equal employment opportunities to all qualified individuals. Qualified applicants will receive consideration for employment without regard to their race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.

Go Python Unix shell Site Reliability Engineering (SRE) Linux

Leave a Reply Cancel reply