Career
Infra Support Engineer
Kuala Lumpur, Malaysia
Infra Support Engineer – GMI Global Infrastructure Team Preferred Location: - Taiwan - Malaysia Responsibilities: - Provide first and second-line technical support to customers for AI Infrastructure, including GPU/CPU nodes, networking, storage, orchestration, and platform services. Support is delivered via ticketing systems, emails, Slack, or other messaging platforms. - Support GPU cluster delivery, including system provisioning, image deployment, network validation, BIOS/firmware updates, and GPU driver/runtime installation. - Monitor system health and service-level indicators using alerts and dashboards; respond to alerts 24x7 as scheduled. - Triage incidents by gathering context, verifying scope and impact, and following standard operating procedures and runbooks to perform immediate mitigations. - Escalate incidents to global SRE engineers with clear, concise incident notes and relevant logs/traces. - Maintain incident logs, update status pages, and communicate timely updates to stakeholders during incidents. - Perform routine operational tasks such as log checks, health checks, capacity checks, and simple automated fixes. - Participate in postmortems and contribute actionable follow-ups to reduce recurrence of incidents. - Help maintain and improve standard operating procedures (SOP), run periodic runbook validation, and document new procedures. - Work collaboratively with developers and SRE teams to improve system reliability. Qualifications: - Bachelor’s degree in Computer Science or a related field. - Over 2 years of experience in IT operations, server administration, SRE, DevOps, or technical support. - Hands-on Linux experience, including shell, kernel, and log management. - Basic networking knowledge, including TCP/IP, DNS, HTTP, and VLANs. - Familiarity with monitoring, alerting, and logging tools such as Prometheus, Grafana, and AlertManager. - Experience with Nvidia GPU infrastructure and Kubernetes. - Comfortable collecting diagnostics, reading logs, and interpreting traces. - Strong troubleshooting mindset and ability to follow runbooks under pressure. - Excellent written and verbal communication skills for customer-facing incident handling. - Willingness to work shifts and participate in on-call rotations. - Bilingual in English and Chinese.
Submit your application
Enter first name
Enter last name
Enter email
Enter phone
Enter nationality or work authorization
Links
Enter Linkedin Url