Microsoft Corporation Incident Manager in Singapore (APAC-HQ), Singapore

Business Function Overview:

Microsoft’s Cloud Infrastructure and Operations (MCIO) is the engine that powers our cloud services. We deliver the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Skype, OneDrive and the Microsoft Azure platform.

Our infrastructure is comprised of a large global portfolio of more than 100 datacenters and 1 million servers. Our portfolio is built and managed by a team of subject matter experts working to support services for more than 1 billion customers and 20 million businesses in over 90 countries worldwide.

We are responsible for designing, building and operating our unified global datacenters; managing the demand planning and capacity utilization of our unified infrastructure; and responsible for all of the operations needed to run the physical infrastructure (including supply chain, hardware, power, security, and workflow teams). We focus on smart growth with an emphasis on automation, data driven engineering, cost-effectiveness and environmental sustainability.

At the forefront of the action in MCIO is our Datacenter Operations Group!

Supporting the Datacenter Operations Group is the Field Operations Services organization where an Incident Management Team engages to facilitate the investigation, diagnosis, and resolution of complex server, network and infrastructure disruptions. In this role as a Major Incident Manager, you will be responsible for developing, implementing, and managing all processes, and procedures for effectively coordinating Major Incidents and Crises, end-to-end, and performing post-event Problem Management activities in close alignment with on-site operations teams, business management, internal partner, and even external customer stakeholders.


As a successful Major Incident Manager, your performance objectives are to:

  • Launch readiness planning and runbook creation for Major Incident-management, Service Desk, IT and Critical Environment (CE) Teams’ processes for Major Incident and Crisis event handling (aka Critical Incident Playbook)

  • Act as Facilitator during Major Incidents, Crises, and other broadly impacting events

  • Engage and manage workload and priorities of key stakeholders and participants in Major Incident activity to quickly assess business impact from service or application owners and quickly identify mitigation plans

  • Interact directly with Customers, Microsoft executive leaders, Managers and key stakeholders to proactively communicate current status on active Major Incidents or Crises

  • Facilitate industry-standard Root Cause Analysis (RCA) exercises across all Major Incident and Crises’ stakeholders/participants for beginning the Problem Management cycle

  • Record, coordinate, and report on progress of ‘Repair Item’ output from Post Incident Reviews, and RCA exercises

  • Provide feedback and drive improvements with current tools and process; driving initiatives to the appropriate group, for proactive design changes and implementation or business risk assessment for MI causal factors

  • Develop and deliver Post-Mortem reports for distribution to MS executive audience(s)


Skills & Qualifications

  • Demonstrated strategic and tactical thinking, quantitative and analytical skills, while under pressure

  • Strong understanding of logical IT principles such as Active Directory, Windows Server, IIS, SQL, Web services, and their applications in robust high availability environments

  • Knowledge and exposure with distributed systems across hyper-scale, cloud-based environments

  • Working knowledge of physical IT infrastructures such as Enterprise Server Platforms and related IT architectures and equipment

  • Solid understanding of large scale networking, including OSI Model, DNS, WINS, TCP/IP, VLANs, DHCP, Routing, ACLs, switching protocols, etc.

  • Understanding and knowledge of physical datacenters and their related infrastructure or resources such as power, rack space, CE Infrastructures (e.g. UPS, Generators, AHU) etc.

  • Flexibility and willingness to support a 24x7 global operation via off-hours support, on-call availability, or other as needed per rhythm and needs of the business

REQUIRED Skills & Experience

  • At least four (4) years of equivalent work experience in a datacenter or similar enterprise operations environment

  • Excellent problem resolution, judgment, negotiation and decision-making skills

  • Ability to balance competing demands for resources and adapt to changing priorities

  • Excellent written and oral communication skills; with special focus on customer/client level interaction

  • Capability for writing reports and presenting to Microsoft executive level audiences

  • Operations experience in a 24x7x365 support model

  • Practical experience with incident/outage and crisis management

Preferred Skills & Experience

  • Working knowledge of ITIL/MOF incident, problem, and change management components

  • PMP and Six Sigma DMAIC process improvement skills a plus

  • BS/BA in Computer Science, math, telecommunications, or equivalent education or experience