It is a new role and is part of their expansion strategy for their growing business. The role reports to lead SRE Manager and requires candidates working closely with the Application and infrastructure teams to ensure that all aspects of the production infrastructure are accurately monitored and reported and alerted.
Job Description :
- Involved in developing a highly scalable and mission critical observability platform consisting of metrics, monitoring and logging systems.
- Building quality dashboards that provide visibility and standards for key indicators to understand the health of the company's most critical systems
- Help engineers better understand their systems through distributed tracing
- Troubleshoot, diagnose and resolve performance and reliability issues affecting the Observability infrastructure
- Analyzes Logs and relate it with metrics to identify root-cause of applications and provide tuning recommendations
- Apply AI and ML capability in early anomaly detection (reduction of mean-time issue identification), pattern analysis, self-healing, infrastructure resizing, noise reduction and outage prediction.
- Develop visualizations in Azure Monitor, Looker, providing single pane views for end user experience, application, infrastructure & security
- Collaborate with the business teams to develop metrics measuring the performance against initiatives and report on those to stakeholders.
- Collaborate with the SRE and Application Integration teams to ensure there is a convergence of business, technical and security requirements
Successful applicant would have following skills and qualifications
- Diploma in Computer Science/Information Technology or equivalent
- PReferably experience in building and developing highly scalable and critical observability platform with good understanding of metrics, logs and traces
- EXperience in DEvOps preferably in GCP or Azure. Open for candidates with strong experience in AWS too
- Experience in coding or programming using Python/Java/Go
- Experience in distributed tracing and debugging tools such as( Google Cloud trace, Application Insights, Jaeger, Zipkin etc)
- Familiarity with Time Series database
- Experience in monitoring tools such as DataDog, Azure monitor , Google operations, Prometheus etc and & visualisation tools such as Grafana
- Strong Communication and presentation skills
- Ability to prioritise and work on multiple projects
- Ability to be a team player and work in environments where teamwork is critical
- Highly self driven and motivated individual
If you are intested in the role and like to discuss further, please click Apply now or email at email@example.com
Only shortlisted candidates will be responded to, therefore if you do not receive a response within 14 days please accept this as notification that you have not been shortlisted.