Chennai
Posted 2 weeks ago
Site Reliability Engineer
– Full-time | Senior level | Chennai, Tamil Nadu, India | Hybrid Work Culture
REQUIREMENTS
- Bachelor’s degree in Computer Science, Information Technology, or related field. (or equivalent work experience).
- Proven experience as a Devops Engineer or Site Reliability Engineer or similar role, with at least 2 years.
- Strong hands-on experience with infrastructure-as-code tools like Terraform, configuration management tools like Ansible, and version control systems like Git.
- Proficiency in scripting languages such as Python, Bash, or Ruby for automation tasks.
- In-depth knowledge of CI/CD concepts and experience with CI/CD tools like Jenkins, GitLab CI/CD, CircleCI or GitHub Actions.
- Extensive experience working with cloud platforms like AWS, Azure, or GCP.
- Solid understanding of containerization technologies such as Docker and container orchestration tools like Kubernetes.
- Familiarity with monitoring and logging solutions like Prometheus, Grafana, ELK stack, etc.
- Excellent problem-solving skills and the ability to troubleshoot complex issues across different technology stacks.
- Strong communication and interpersonal skills to effectively collaborate with cross-functional teams.
WHAT YOU WILL DO
1. AWS Cloud Maintenance:
- Maintain and optimize AWS Cloud infrastructure to ensure scalability, reliability, and performance.
- Monitor AWS resources and services to identify and rectify potential issues before they impact the system.
2. Kubernetes Management:
- Manage and maintain Kubernetes clusters, ensuring high availability and performance.
- Implement best practices for container orchestration and scaling.
3. Incident Response:
- Participate in an on-call rotation to provide 24/7 support and respond to critical incidents promptly.
- Collaborate with cross-functional teams to troubleshoot and resolve system issues efficiently.
4. Bug Tracking and Resolution:
- Identify and document software and infrastructure bugs, working closely with development teams to prioritize and resolve them.
- Continuously improve monitoring and alerting systems to proactively detect issues.
5. Performance Optimization:
- Analyze system performance and implement optimizations to enhance reliability and reduce downtime.
6. Automation:
- Develop and maintain automation scripts and tools for provisioning, deployment, and monitoring.
7. Documentation:
- Create and update documentation for systems, processes, and incident response procedures.
8. Security and Compliance:
- Ensure security best practices are followed and participate in security audits and compliance initiatives.