Company: isgSearch
Location: Toronto, ON
Expected salary:
Job date: Sun, 08 Dec 2024 02:25:48 GMT
Job description: Sr. Reliability Engineer – Fully Remote in CanadaRequirements:
- Expertise in incident management and resolving live production issues.
- Strong troubleshooting skills, with a focus on performance optimization for large-scale applications.
- Proven experience in developing and maintaining reliable monitoring and alerting systems in high-demand environments.
- 7+ years of experience with the .NET Framework (C#), ensuring production stability.
- Proficiency in Kubernetes, Docker, and cloud platforms (GCP preferred).
- Experience with monitoring tools like Prometheus, Grafana, and Kibana.
- Familiarity with incident management tools such as FreshDesk and Confluence.
- Strong critical thinking and problem-solving abilities.
- Solid project management skills with a focus on scalability and system reliability.
VMWare Our client…Our client is a leading fintech company with a strong presence across Canada, driving innovation in financial services.Responsibilities:
- Operational Support: Provide live support for client applications, monitoring services to detect critical failures, and ensuring fast recovery with minimal downtime.
- Incident Resolution: Lead the response to production issues, ensuring resolution within SLA and SLO timelines. Conduct root cause analysis and implement permanent solutions.
- Monitoring & Reporting: Enhance monitoring systems and alerting mechanisms to proactively detect issues. Prepare data-driven reports to present findings clearly.
- System Stability & Scalability: Offer expert guidance on improving system stability and scalability across production environments.
- Process Automation: Drive initiatives to automate operational processes, improving efficiency across LiveOps.
- Postmortem & Continuous Improvement: Lead postmortem meetings, documenting findings and action items for future prevention.
- Cross-Functional Collaboration: Work with engineering teams to quickly resolve issues and implement long-term fixes.
- Team Leadership & Mentorship: Guide and mentor junior reliability engineers, ensuring high standards are maintained.
- On-Call Support: Participate in after-hours on-call rotation for production support.