Site Reliability Engineer
Our client is a top technology organization where they provide various platforms and innovation solutions through large data centers to their clientele base. They are looking for an experienced Site Reliability Engineer to join their team.
RESPONSIBILITIES AND DUTIES:
- Deep dive into development lines, learning and understanding the mechanism of every application component, and promoting product scalability, stability and performance.
- Manage and maintain product applications and services.
- Support and influence improvement in product application to enhance the stability and availability of the systems
- Solve key problems that potentially may take with the production systems and create solutions to prevent incidents occurring again
- Analyze patterns of production incidents and set-up appropriate alerting/monitoring mechanisms in the system to catch the issues beforehand.
- To create a release template to ensure that architecture and testing efforts performed during development of service are sufficient to support availability and performance SLAs.
- To test the playbook and scripts in the lower environment to ensure the accuracy and completeness.
- To provide and test the runbooks and healing/corrective automation scripts for restoration of runtime tools based services.
- To work with development team to do full integration of services with the application monitoring system
- Improve application stability & operational efficiency by developing scripts to automate tasks.
- Bachelor's degree in Computer Science, Engineering, MIS or related degree
- 4+ years of relevant technology experience,
- Experienced working with either C++, Go, Java, Python for scaling systems and services
- Good working record with either GCP or AWS environment
- Previously had responsibilities of SRE from previous projects e.g designing, delivering and managing large scale platforms
- Has good knowledge and working experience in application monitoring systems (which include AppDynamics)
- Hands-on knowledge with Linux operating system (Ubuntu, CentOS, etc.)
- Knowledge of Computer Network (TCP/IP, DNS, etc.)
- Experienced in Incident Management process and ability to resolve Level 1 issue within agreed organization SLA.
If you feel you have the right skills and experience for the role, kindly submit your updated CV in word format to firstname.lastname@example.org
Referrals are greatly appreciated.
EA Licence No: 11C5502
EA Registration Number: R1877789