Microsoft Corporation Incident Manager in Singapore (APAC-HQ), Singapore
Business Function Overview:
Microsoft’s Cloud Infrastructure and Operations (MCIO) is the engine that powers our cloud services. We deliver the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Skype, OneDrive and the Microsoft Azure platform.
Our infrastructure is comprised of a large global portfolio of more than 100 datacenters and 1 million servers. Our portfolio is built and managed by a team of subject matter experts working to support services for more than 1 billion customers and 20 million businesses in over 90 countries worldwide.
We are responsible for designing, building and operating our unified global datacenters; managing the demand planning and capacity utilization of our unified infrastructure; and responsible for all of the operations needed to run the physical infrastructure (including supply chain, hardware, power, security, and workflow teams). We focus on smart growth with an emphasis on automation, data driven engineering, cost-effectiveness and environmental sustainability.
At the forefront of the action in MCIO is our Datacenter Operations Group!
Supporting the Datacenter Operations Group is the Field Operations Services organization where an Incident Management Team engages to facilitate the investigation, diagnosis, and resolution of complex server, network and infrastructure disruptions. In this role as a Major Incident Manager, you will be responsible for developing, implementing, and managing all processes, and procedures for effectively coordinating Major Incidents and Crises, end-to-end, and performing post-event Problem Management activities in close alignment with on-site operations teams, business management, internal partner, and even external customer stakeholders.
As a successful Major Incident Manager, your performance objectives are to:
Launch readiness planning and runbook creation for Major Incident-management, Service Desk, IT and Critical Environment (CE) Teams’ processes for Major Incident and Crisis event handling (aka Critical Incident Playbook)
Act as Facilitator during Major Incidents, Crises, and other broadly impacting events
Engage and manage workload and priorities of key stakeholders and participants in Major Incident activity to quickly assess business impact from service or application owners and quickly identify mitigation plans
Interact directly with Customers, Microsoft executive leaders, Managers and key stakeholders to proactively communicate current status on active Major Incidents or Crises
Facilitate industry-standard Root Cause Analysis (RCA) exercises across all Major Incident and Crises’ stakeholders/participants for beginning the Problem Management cycle
Record, coordinate, and report on progress of ‘Repair Item’ output from Post Incident Reviews, and RCA exercises
Provide feedback and drive improvements with current tools and process; driving initiatives to the appropriate group, for proactive design changes and implementation or business risk assessment for MI causal factors
Develop and deliver Post-Mortem reports for distribution to MS executive audience(s)
Skills & Qualifications
Demonstrated strategic and tactical thinking, quantitative and analytical skills, while under pressure
Strong understanding of logical IT principles such as Active Directory, Windows Server, IIS, SQL, Web services, and their applications in robust high availability environments
Knowledge and exposure with distributed systems across hyper-scale, cloud-based environments
Working knowledge of physical IT infrastructures such as Enterprise Server Platforms and related IT architectures and equipment
Solid understanding of large scale networking, including OSI Model, DNS, WINS, TCP/IP, VLANs, DHCP, Routing, ACLs, switching protocols, etc.
Understanding and knowledge of physical datacenters and their related infrastructure or resources such as power, rack space, CE Infrastructures (e.g. UPS, Generators, AHU) etc.
Flexibility and willingness to support a 24x7 global operation via off-hours support, on-call availability, or other as needed per rhythm and needs of the business
REQUIRED Skills & Experience
At least four (4) years of equivalent work experience in a datacenter or similar enterprise operations environment
Excellent problem resolution, judgment, negotiation and decision-making skills
Ability to balance competing demands for resources and adapt to changing priorities
Excellent written and oral communication skills; with special focus on customer/client level interaction
Capability for writing reports and presenting to Microsoft executive level audiences
Operations experience in a 24x7x365 support model
Practical experience with incident/outage and crisis management
Preferred Skills & Experience
Working knowledge of ITIL/MOF incident, problem, and change management components
PMP and Six Sigma DMAIC process improvement skills a plus
BS/BA in Computer Science, math, telecommunications, or equivalent education or experience