Senior AI Observability Engineer (SRE) at SAP #vacancy #remote

Bring out your best

SAP innovations help more than four hundred thousand customers worldwide work together more efficiently and use business insight more effectively. Originally known for leadership in enterprise resource planning (ERP) software, SAP has evolved to become a market leader in end-to-end business application software and related services for database, analytics, intelligent technologies, and experience management. As a cloud company with two hundred million users and more than one hundred thousand employees worldwide, we are purpose-driven and future-focused, with a highly collaborative team ethic and commitment to personal development. Whether connecting global industries, people, or platforms, we help ensure every challenge gets the solution it deserves. At SAP, you can bring out your best.   

 

 

 

We are looking for a Senior AI Observability Engineer (SRE)focusing on both soft and physical layers of our global operations.

 

About the Role:
You will join a global & multidisciplinary SRE team of DevOps engineers, contributing to the development of AI solutions that power a stack of diverse observability services using Machine Learning and Large Language models. This role involves reshaping how we manage alerts, metrics, and logs by introducing deep learning and NLP to enhance reliability services. You will also support troubleshooting during major incidents related to our global cloud infrastructure, ensuring excellence in triage and resolution. You will help the team to reduce critical KPI’s around MTTD/MTTR, Signal to Noise Ratio, and other relevant metrics using these advanced methods.

Expectations and Tasks:

  • Collaborate with engineering and product management following Agile Methodologies such as SCRUM.
  • Ability to prioritize and deliver high-quality developments under time constraints.
  • Ensure smooth operations and maximize uptime of the services we are responsible for.
  • Participate in On-Call rotational coverage, including weekends and holidays, with compensation as per local policies. Global follow the sun model with local daytime coverage.
  • Share knowledge across the team.
  • Work on data analysis & generation.
  • Support AI research & development projects.
  • Train and fine-tune AI Models.

Required Skills:

  • Fast adoption of cutting-edge technologies.
  • Advanced analytical and problem-solving mindset.
  • Strong team player with excellent communication skills.
  • Self-starter who acts with a sense of urgency to quickly move issues forward efficiently and effectively.
  • Fluent in spoken & written English.

Required Experience:

  • Development:
    • 4+ years of experience in professional or enterprise development.
    • Strong knowledge of Python & JavaScript programming languages
    • Proven experience in REST API implementation using Flask or FastAPI.
    • Experience in microservice-based development.
  • DevOps:
    • Understand CI/CD pipelines using Azure, Jenkins, Travis, or similar.
    • Hands-on experience with docker containers & Kubernetes.
    • Work with public cloud environments such as GCP/AWS/Azure.
    • Solid understanding of JSON, YAML, & Github.
    • Solid Understanding of Enterprise/Service Provider Data Center Architecture.
    • Strong familiarity with Enterprise-class Fault Monitoring and Performance Management tools.
  • Artificial Intelligence:
    • Experience with ML frameworks like PyTorch, TensorFlow, or similar.
    • Knowledge in Prompt Engineering, Large Language Models, RAG, and Embeddings.
    • Good understanding of Machine Learning Supervised/Unsupervised models.
    • Good understanding of algorithms, data structures & data patterns.

 

Preferred:

  • Knowledge Graphs, Graph DB’s and Graph Theory
  • Experience with Elasticsearch, Splunk, or similar.
  • Experience in web development frameworks.
  • Familiarity with Terraform, HelmChart, Ansible, or similar tools.
  • Knowledge about Kubeflow, MLFlow, Dataflow, or similar technologies.

 

Education:

  • Bachelor’s or equivalent education in Software Engineering, Computer Science, or a related field.
  • Industry Technical Certifications (CKA, Elastic Certified Engineer, RHCE, CCNA, AZ-900, etc.) and ITIL related courseware are a plus.

 

 

Meet the Team:

Global Cloud Infrastructure & Delivery (GCID) develops and delivers services for cloud infrastructure and cloud operations to SAP Lines of Business (LoB) and through them, our external customers. We support LoBs and their customers’ cloud adoption journey through four hyperscaler public clouds and SAP’s Infrastructure-as-a-Service.

Service Reliability Engineering (SRE)  is   a   team   within   the   GCID organization. It   contributes   to   ensure   the   reliability and   availability   of   SAP cloud services (internal or external)  by   developing and enhancing observability   tools   that help to either prevent or isolate an incident.   SRE’s   proactively   help   automate   and   optimize processes. The SRE team runs globally in a follow the sun model.  

 

Bring out your best

SAP innovations help more than four hundred thousand customers worldwide work together more efficiently and use business insight more effectively. Originally known for leadership in enterprise resource planning (ERP) software, SAP has evolved to become a market leader in end-to-end business application software and related services for database, analytics, intelligent technologies, and experience management. As a cloud company with two hundred million users and more than one hundred thousand employees worldwide, we are purpose-driven and future-focused, with a highly collaborative team ethic and commitment to personal development. Whether connecting global industries, people, or platforms, we help ensure every challenge gets the solution it deserves. At SAP, you can bring out your best.   

 

We win with inclusion

SAP’s culture of inclusion, focus on health and well-being, and flexible working models help ensure that everyone – regardless of background – feels included and can run at their best. At SAP, we believe we are made stronger by the unique capabilities and qualities that each person brings to our company, and we invest in our employees to inspire confidence and help everyone realize their full potential. We ultimately believe in unleashing all talent and creating a better and more equitable world.
SAP is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to the values of Equal Employment Opportunity and provide accessibility accommodations to applicants with physical and/or mental disabilities. If you are interested in applying for employment with SAP and are in need of accommodation or special assistance to navigate our website or to complete your application, please send an e-mail with your request to Recruiting Operations Team:
For SAP employees: Only permanent roles are eligible for the SAP Employee Referral Program, according to the eligibility rules set in the SAP Referral Policy . Specific conditions may apply for roles in Vocational Training.

 

EOE AA M/F/Vet/Disability:

Qualified applicants will receive consideration for employment without regard to their age, race, religion, national origin, ethnicity, age, gender (including pregnancy, childbirth, et al), sexual orientation, gender identity or expression, protected veteran status, or disability.
Successful candidates might be required to undergo a background verification with an external vendor.

 

Requisition ID: 393759 | Work Area: Software-Development Operations | Expected Travel: 0 – 10% | Career Status: Professional | Employment Type: Regular Full Time | Additional Locations: #LI-Hybrid.

Agile kubeflow data-structures Artificial intelligence (AI) itil Cisco Certified Network Associate Terraform Amazon Web Services (AWS) Computer Science Azure JSON Knowledge graphs Elasticsearch Google Cloud Platform (GCP) YAML Docker Machine Learning graph-theory microservices Retrieval-Augmented Generation (RAG) Natural language processing (NLP) Splunk REST CI/CD Data Analyst Python Prompt engineering Flask Software Development Engineer JavaScript FastAPI Google Dataflow DevOps observability GitHub MLflow Kubernetes TensorFlow Scrum graphdb Jenkins Site Reliability Engineering (SRE) algorithms Ansible travis-ci PyTorch LLM

Leave a Reply