Chennai

Posted 2 weeks ago

Rheo is an intelligent industrial AI platform that utilizes sensors and machine learning to optimize operational processes. 

Rheo fosters the right harmony between people and technology through data-led focus and transparency, thereby supercharging manufacturing/operations teams into a cohesive unit. At Rheo, we apply the same principles we advocate to our customers by creating effective lean solutions.

Top Left Decoration
Top Right Decoration

Site Reliability Engineer

– Full-time | Senior level | Chennai, Tamil Nadu, India | Hybrid Work Culture 

REQUIREMENTS

  • Bachelor’s degree in Computer Science, Information Technology, or related field. (or equivalent work experience).
  • Proven experience as a Devops Engineer or Site Reliability Engineer or similar role, with at least  2 years.
  • Strong hands-on experience with infrastructure-as-code tools like Terraform, configuration management tools like Ansible, and version control systems like Git.
  • Proficiency in scripting languages such as Python, Bash, or Ruby for automation tasks.
  • In-depth knowledge of CI/CD concepts and experience with CI/CD tools like Jenkins, GitLab CI/CD, CircleCI or GitHub Actions.
  • Extensive experience working with cloud platforms like AWS, Azure, or GCP.
  • Solid understanding of containerization technologies such as Docker and container orchestration tools like Kubernetes.
  • Familiarity with monitoring and logging solutions like Prometheus, Grafana, ELK stack, etc.
  • Excellent problem-solving skills and the ability to troubleshoot complex issues across different technology stacks.
  • Strong communication and interpersonal skills to effectively collaborate with cross-functional teams. 

WHAT YOU WILL DO

1. AWS Cloud Maintenance:

  • Maintain and optimize AWS Cloud infrastructure to ensure scalability, reliability, and performance.
  • Monitor AWS resources and services to identify and rectify potential issues before they impact the system.

2. Kubernetes Management:

  • Manage and maintain Kubernetes clusters, ensuring high availability and performance.
  • Implement best practices for container orchestration and scaling.

3. Incident Response:

  • Participate in an on-call rotation to provide 24/7 support and respond to critical incidents promptly.
  • Collaborate with cross-functional teams to troubleshoot and resolve system issues efficiently.

4. Bug Tracking and Resolution:

  • Identify and document software and infrastructure bugs, working closely with development teams to prioritize and resolve them.
  • Continuously improve monitoring and alerting systems to proactively detect issues.

5. Performance Optimization

  • Analyze system performance and implement optimizations to enhance reliability and reduce downtime.

6. Automation:

  • Develop and maintain automation scripts and tools for provisioning, deployment, and monitoring.

7. Documentation:

  • Create and update documentation for systems, processes, and incident response procedures.

8. Security and Compliance:

  • Ensure security best practices are followed and participate in security audits and compliance initiatives.

Job Features

Job Category

Engineering

Apply For This Job

A valid email address is required.
A valid phone number is required.