We are looking for a skilled engineer to design and implement alarm and alerting systems for our cloud-based microservices architecture. You will be responsible for integrating observability platforms including New Relic, configuring intelligent alerting and anomaly detection, and ensuring system reliability through proactive monitoring and alarm tuning.
Key Responsibilities
- Monitoring & Observability Design:
- Architect and implement monitoring and alerting strategies for distributed microservices.
- Integrate New Relic (or similar APM tools) with all relevant services and infrastructure components.
- Provide observability across all microservices include Cron Jobs and stored procedures.
- Ensure coverage across logs, metrics, traces (full observability stack).
- Alarm Configuration & Management:
- Define service-level indicators (SLIs), service-level objectives (SLOs), and error budgets.
- Create meaningful alarms and dashboards to detect anomalies and performance degradations.
- Avoid alert fatigue by setting up noise-free, actionable alerts.
- Incident Response & Automation:
- Automate responses to certain types of alerts using runbooks or scripts.
- Participate in on-call rotations and improve incident response processes.
- Post-incident reviews to fine-tune alarm rules and improve system resilience.
- Collaboration:
- Work closely with developers, QA, and operations teams to ensure observability is integrated throughout the SDLC.
- Provide training and guidance on using monitoring tools and interpreting alerts.
- Tooling & Integrations:
- Manage integrations between New Relic and other systems (e.g., PagerDuty, Slack, Jira).
- Set up infrastructure monitoring (e.g., Kubernetes, Docker, databases, cloud services).
Required Qualifications
- 3–5+ years in DevOps, SRE, or infrastructure roles supporting cloud-native microservices.
- Strong experience with observability platforms such as New Relic, Datadog, Prometheus/Grafana, or similar.
- Proficient in designing and implementing SLOs, SLIs, and alerting strategies.
- Hands-on experience with container orchestration (Kubernetes), CI/CD, and cloud services (AWS, GCP, or Azure).
- Experience with infrastructure as code (Terraform, Helm, etc.).
- Strong scripting skills (e.g., Python, Bash) and experience automating alert responses.
We offer competitive contract rates and a pleasant working environment.
Work hard in silence, let success be your noise.
“Frank Ocean”