Site Reliability Engineering Manager at Nordhealth #vacancy #remote

Who are we?

Nordhealth’s mission is to build software that improves the daily lives of healthcare professionals. We build software that empowers veterinary and therapy professionals to provide the best possible care experiences to their patients. Our products are used daily by over 50,000 professionals in clinics and hospitals across 30+ countries. We excel with 20+ years of experience in healthcare and veterinary software.

We understand that talent comes from everywhere and anywhere. The greater our diversity, the better the products we deliver. That’s why we are a remote-first company, headquartered in Helsinki, Finland, with all 400+ employees working either remotely or from collaboration hubs. While our market presence is currently strongest in the Nordics, our customer base is rapidly growing in our other markets too, especially in Europe and North America (more at our website nordhealth.com .)

About the role

We are seeking a dedicated individual for a role that centers around Provet Cloud, our cloud-based veterinary practice management software (). Provet Cloud is designed to help veterinary practitioners save time so they can devote more attention to caring for their patients and to make managing a veterinary practice more efficient and simpler. It offers features for appointment scheduling, electronic medical records, inventory management, billing, and communication within the veterinary team.

The Site Reliability Engineering (SRE) Manager at our company plays a critical role in ensuring our systems’ reliability, performance, and scalability. The primary purpose of the SRE Manager is to bridge the gap between development and operations by applying software engineering principles to infrastructure and operational challenges.

This role also includes mentoring and planning of automating our infrastructure to accommodate higher loads resulting from increased usage and monitoring the cloud hosting costs to keep them at a proper level as our user base expands. The SRE team’s focus on automation, monitoring, and proactive maintenance helps us meet the demands of our expanding user base while ensuring that our services remain consistently available and performant.

This is a unique opportunity to join our team and contribute to enhancing the efficiency and simplicity of veterinary practice management through Provet Cloud!

Your key responsibilities include:

Lead, mentor, and support the SRE team members.
Oversee the monitoring, alerting, and troubleshooting of system issues.
Ensure high availability and reliability of production systems and services.
Coordinate response to system incidents and outages.
Perform post-incident reviews and ensure effective incident resolution and follow-up actions.
Manage and optimize the infrastructure, ensuring it meets current and future requirements.
Identify opportunities for automation to improve system reliability and operational efficiency.
Work closely with development, operations, and product teams to integrate reliability into the software development lifecycle.
Communicate effectively with stakeholders about system performance, incidents, and project status.
Define and track key performance indicators (KPIs) to measure system reliability and team performance.
Ensure systems adhere to security policies and compliance requirements.

What will help you to be successful in this role?

Ideally, you have already gained some experience from working in a fast growing, global SaaS company.

Success factors and key challenges of the role:

Maintaining high availability while simultaneously optimizing costs is crucial for the SRE Manager role. This involves balancing the need for reliability with cost-effectiveness to ensure efficient operations.
Keeping infrastructure maintained and updated with minimal downtime is essential, ideally with no noticeable interruptions for our clients and users. This requires careful planning and execution to minimize disruptions while making necessary changes.
Effective resource planning in a rapidly changing environment is critical to avoid overprovisioning while still meeting increasing demands. This involves staying proactive and adaptable to ensure resources are utilized optimally.
Continuous review and improvement of disaster recovery plans and procedures are necessary to mitigate potential risks effectively. Regular testing and updates are vital to ensure readiness for any unforeseen events.
Quick analysis and mitigation of any issues or incidents is essential, along with a clear plan for permanent resolution. This includes identifying root causes and implementing corrective measures to prevent recurrence.

Critical Knowledge and Experience:

Proficiency in AWS, Azure, or Google Cloud, and infrastructure as code (IaC) tools like Terraform.
Experience with monitoring tools like Prometheus or Grafana for real-time monitoring and alerting.
Experience in managing and responding to system incidents and outages.
Proven experience leading and managing an SRE or DevOps team.
Ability to prioritize tasks and manage multiple projects simultaneously.
Experience in planning and executing projects, including resource management and timeline adherence.
Experience working closely with cross-functional teams, including development, operations, and product teams.

Having one or more of these skills will help in succeeding in this role:

Focus on automating processes to improve efficiency and reduce manual intervention.
Ability to use data and metrics to drive decisions and improvements.
Understanding of security best practices and compliance requirements.
Experience in performance tuning and capacity planning.

What’s in it for you?

At Nordhealth, we do things a little bit differently. We value continuous improvement, diverse teams and autonomy which drive our collaboration. Our global healthcare domain is rapidly developing and we are seeking colleagues who enjoy working in this type of environment.

In addition, we offer:

The chance to work in a meaningful industry and in a fast-growing, global company on a path to changing digital healthcare
Competitive compensation and benefits
Learning and professional growth opportunities
The tools you need, and enjoy using
Frequent company events and talented colleagues from around the world

If you enjoy working in a fast-growing and international environment with the possibility to make an impact, this might be the perfect job for you. Apply now! We’ll fill the position as soon as we find the right person.

DevOps Prometheus Google Cloud Platform (GCP) Terraform Amazon Web Services (AWS) Azure Grafana Site Reliability Engineering (SRE)

Leave a Reply Cancel reply