Senior Site Reliability Engineer - PSRE | Remote job opportunities

Company Overview

Arcesium is a global financial technology firm that solves complex data-driven challenges faced by sophisticated financial institutions. We continuously innovate our platform to meet future challenges, anticipate risks, and design advanced solutions that help our clients achieve transformational outcomes. Our commitment to intellectual curiosity, proactive ownership, and collaboration empowers you to contribute significantly from day one, enhancing your professional development.

About the Role and the Team

We are seeking a highly skilled and intelligent Senior Site Reliability Engineer (SRE) to join our Platform Site Reliability Engineering (PSRE) team. This team plays a vital role in ensuring the reliability and availability of mission-critical applications. The SRE will be responsible for observability, monitoring, incident management, and improving overall system stability. In this high-impact role, you will work under tight timelines and must be quick-thinking and proactive.

Key Responsibilities

Incident Management: Act as the primary contact for critical incidents impacting our platform during NY business hours, ensuring effective communication and swift resolution.
Proactive Monitoring: Continuously monitor application health and performance, identifying risks and implementing measures to enhance reliability.
Troubleshooting: Handle complex technical issues across different stack layers, utilizing analytical skills to identify root causes and solutions.
Collaboration: Work closely with various teams to ensure seamless incident response and proactive reliability initiatives.
Automation: Find ways to automate tasks and enhance system resilience while developing tools to streamline processes.
Continuous Improvement: Contribute to the enhancement of SRE practices, tools, and processes, fostering a learning culture within the team.

What We're Looking For

Up to 5 years of SRE, DevOps, or Production Engineering experience with an understanding of relevant principles.
Expertise in incident management and resolution of high-severity outages.
Proficiency in at least one programming language (Python or Java) for automation.
Hands-on experience with Kubernetes and cloud services (AWS preferred).
Excellent communication skills and strong problem-solving abilities under pressure.
Fluency in English is required, along with legal work eligibility in the country.

Nice-to-Have Skills

Experience with Terraform or CloudFormation.
Familiarity with monitoring tools (e.g., Datadog, Grafana).
Exposure to CI/CD pipelines and web application architectures.

Why Join Us?

This role directly impacts business-critical operations. If you thrive under pressure and enjoy a high-stakes environment, this is the place for you. Ready to make an impact? Apply now!