We are looking for candidates who are passionate about building and maintaining enterprise network infrastructure, take pride in what they build, are quick learners and love to work in a challenging and fast-paced environment. As part of a start-up competing in a world-scale market, you will be expected to work hard and will have the ability to manage services used by millions of people.
The ideal candidate should have strong troubleshooting skills to resolve complex technical issues by identifying the root cause and make necessary changes to prevent it from recurrence.
- Install, configure, build and troubleshoot our production servers and services.
- Maintain 100% uptime of the production services.
- Optimize / tune our servers and services for performance, scalability, and maintainability.
- Ensure that our monitoring tools catch and generate alerts on all production issues.
- Resolve issues reported by our monitoring tools, including following through on long-term issues.
- Follow escalation process through issue completion, including providing documentation after resolution.
- Supervise the junior system administrators to ensure that they are following procedures and completing tasks successfully.
- Perform root cause analysis of production issues and provide a report which includes recommendations for identifying future issues more quickly as well as preventing future failures entirely, whether through process or technology improvements.
- Send periodic NOC reports to managers with the system and service status.
- Manage backups and disaster recovery, including backup monitoring and verification, and leading restoration tests and disaster recovery drills.
- Become a technical escalation point during your shift.
- 6+ years experience supporting a real-time 24×7 production online web environment.
- Strong written and verbal communication skills; ability to organize and prioritize tasks.
- Experience training and mentoring more junior members of the team, and working with other departments to solve cross-departmental problems.
- Knowledge of unix scripting: shell, perl, python, ruby or equivalent languages (from an automation and monitoring standpoint).
- Ability to identify and configure add-on modules or plugins of open-source tools to effectively automate tasks and monitor production services.
- Strong knowledge of DNS, system build automation, and system configuration management tools.
- Familiarity with server virtualization technologies (Xen or equivalents).
- Experience with configuring and tuning monitoring systems (Nagios, Graphite or equivalents).
- Ability to work in fast-paced environments with weekly release schedules.
- Knowledge of redis or other nosql databases.
- RHCE certification or equivalent experience.
- Experience with package management (preferably on Debian systems).
- Working knowledge of the following technologies (or equivalents): ldap, ntp, dns, fai, dhcp, subversion, git, chef, mysql.
- Experience executing multi-week and multi-person projects as per the project plan.
Required skills and experience:
Desired skills and experience:
Start Date: Immediate
Interested candidates can send the profiles to email@example.com