Site Reliability Engineer
Company: ITR
Location: Oak Ridge
Posted on: February 14, 2026
|
|
|
Job Description:
Job Description Job Description Site Reliability Engineer, HPC
Infrastructure and Platforms This position may require candidates
to be able to obtain a federal security clearance, so United States
citizenship is required. Overview: Seeking highly qualified
individuals to play a key role in improving the security,
performance, and reliability of the computing infrastructure which
supports multiple highly ranked Top500 Supercomputers, including
the world’s first exaflop system, Frontier. The Team: As a Site
Reliability Engineer, you will work within the HPC Infrastructure
and Platforms group to support all activities of our supercomputer
center. Our primary platform is the OLCF Slate Service, built on
Kubernetes and Red Hat OpenShift, which provides a container
orchestration service for running critical operation applications
and user-managed persistent applications that run alongside our
OLCF Supercomputer systems and other OLCF managed HPC clusters.
Major Duties/Responsibilities: Lead ongoing improvements in
reliability and scalability for our Kubernetes and Linux based
applications and services. Contribute as a technical resource to
define and implement best practices and standards for the center.
Provide primary operational support and engineering for production
applications. Define and implement define KPIs, processes and drive
continuous improvement. Influence the architecture and
implementation of solutions. Tune operating systems and
applications to increase performance and reliability of services.
Diagnose system operational problems quickly and effectively.
Participate in on-call rotation providing 24-hour, 7-day support
and off-hours maintenance windows. Coordinate with vendors to
resolve hardware and software problems. Deliver ORNL’s mission by
aligning behaviors, priorities, and interactions with our core
values of Impact, Integrity, Teamwork, Safety, and Service. Promote
diversity, equity, inclusion, and accessibility by fostering a
respectful workplace – in how we treat one another, work together,
and measure success. Basic Qualifications: Bachelor’s Degree in
computer science or closely related field and a minimum of 3 years
of experience as an SRE/Systems Engineer. An equivalent combination
of education and experience may be considered. Preferred
Qualifications: Excellent interpersonal/communication skills, and
the ability to work as part of a team. Strong working knowledge of
Unix system fundamentals and common network protocols. Experience
managing Linux/UNIX operating systems in a heterogeneous
environment. Solid understanding of networked computing environment
concepts. Excellent understanding of networking, particularly Linux
and Kubernetes networking Experience with instrumenting bare metal
and VMWare infrastructure Ability to develop and maintain programs
and scripts that aid in the operation and automation using various
shell (primarily bash) and high-level languages (Python or Go).
Ability to proactively identify performance issues, problems, and
areas for improvement. Ability to identify requirements and to
define, plan, and implement requisite solutions. Ability to plan,
organize, prioritize tasks, and complete assigned projects with
minimal supervision. Experience with continuous integration and
continuous deployment software methodologies and how they apply to
SRE/systems engineering. An understanding of code review and
familiarity with tools like GitHub and GitLab Experience using
tools such as Nagios, Grafana and Prometheus to monitor systems,
metrics, and create dashboards. Experience designing and implement
highly available systems/services utilizing virtual machines and
Kubernetes resources. Experience participating in an opensource
community with patches accepted upstream. Experience deploying and
maintaining automated configuration management software such as
Puppet or Ansible Experience implementing systems-level security
technologies like SELinux and following security best practices.
Special Requirement: This position requires the ability to obtain
and maintain a clearance from the Department of Energy. As such,
this position is a Workplace Substance Abuse program (WSAP) testing
designed position which requires passing a pre-placement drug test
and participation in an ongoing random drug testing program in
which employees are subject to being randomly selected for testing.
The occupant of this position will also be subject to an ongoing
requirement to report to ORNL any drug-related arrest or conviction
or receipt of a positive drug test result.
Keywords: ITR, Cleveland , Site Reliability Engineer, IT / Software / Systems , Oak Ridge, Tennessee