VP, Distinguished Engineer, Windows Applied Sciences

Introducing Mu language model and how it enabled the agent in Windows Settings

Vivek Pradeep — Mon, 23 Jun 2025 16:00:22 +0000

We are excited to introduce our newest on-device small language model, Mu. This model addresses scenarios that require inferring complex input-output relationships and has been designed to operate efficiently, delivering high performance while running locally. Specifically, this is the language model that powers the agent in Settings,  available to Windows Insiders in the Dev Channel with Copilot+ PCs, by mapping natural language input queries to Settings function calls. Mu is fully offloaded onto the Neural Processing Unit (NPU) and responds at over 100 tokens per second, meeting the demanding UX requirements of the agent in Settings scenario. This blog will provide further details on Mu’s design and training and how it was fine-tuned to build the agent in Settings.

Model training Mu

Enabling Phi Silica to run on NPUs provided us with valuable insights about tuning models for optimal performance and efficiency. These informed the development of Mu, a micro-sized, task-specific language model designed from the ground up to run efficiently on NPUs and edge devices. [caption id="attachment_179775" align="alignnone" width="1024"] Encoder-Decoder Architecture compared to Decoder-only Architecture[/caption] Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture, meaning an encoder first converts the input into a fixed-length latent representation, and a decoder then generates output tokens based on that representation. This design yields significant efficiency benefits. The figure above illustrates how an encoder-decoder reuses the input's latent representation whereas a decoder-only must consider the full input + output sequence. By separating the input tokens from output tokens, Mu’s one-time encoding greatly reduces computation and memory overhead. In practice, this translates to lower latency and higher throughput on specialized hardware. For example, on a Qualcomm Hexagon NPU (a mobile AI accelerator), Mu’s encoder–decoder approach achieved about 47% lower first-token latency and 4.7× higher decoding speed compared to a decoder-only model of similar size. These gains are crucial for on-device and real-time applications. Mu’s design was carefully tuned for the constraints and capabilities of NPUs. This involved adjusting model architecture and parameter shapes to better fit the hardware’s parallelism and memory limits. We chose layer dimensions (such as hidden sizes and feed-forward network widths) that align with the NPU’s preferred tensor sizes and vectorization units, ensuring that matrix multiplications and other operations run at peak efficiency. We also optimized the parameter distribution between the encoder and decoder – empirically favoring a 2/3–1/3 split (e.g. 32 encoder layers vs 12 decoder layers in one configuration) to maximize performance per parameter. Additionally, Mu employs weight sharing in certain components to reduce the total parameter count. For instance, it ties the input token embeddings and output embeddings, so that one set of weights is used for both representing input tokens and generating output logits. This not only saves memory (important on memory-constrained NPUs) but can also improve consistency between encoding and decoding vocabularies. Finally, Mu restricts its operations to those NPU-optimized operators supported by the deployment runtime. By avoiding any unsupported or inefficient ops, Mu fully utilizes the NPU’s acceleration capabilities. These hardware-aware optimizations collectively make Mu highly suited for fast, on-device inference.

Packing performance in a tenth the size

Mu adds three key transformer upgrades that squeeze more performance from a smaller model:

Dual LayerNorm (pre- and post-LN) – normalizing both before and after each sub-layer keeps activations well-scaled, stabilizing training with minimal overhead.

Rotary Positional Embeddings (RoPE) – complex-valued rotations embed relative positions directly in attention, improving long-context reasoning and allowing seamless extrapolation to sequences longer than those seen in training.

Grouped-Query Attention (GQA) – sharing keys / values across head groups slashes attention parameters and memory while preserving head diversity, cutting latency and power on NPUs.

Training techniques such as warmup-stable-decay schedules and the Muon optimizer were used to further refine its performance. Together, these choices deliver stronger accuracy and faster inference within Mu’s tight edge-device budget. We trained Mu using A100 GPUs on Azure Machine Learning, taking place over several phases. Following the techniques pioneered first in the development of the Phi models, we began with pre-training on hundreds of billions of the highest-quality educational tokens, to learn language syntax, grammar, semantics and some world knowledge. To continue to enhance accuracy, the next step was distillation from Microsoft’s Phi models. By capturing some of the Phi’s knowledge, Mu models achieve remarkable parameter efficiency. All of this yields a base model that is well-suited to a variety of tasks – but pairing with task-specific data along with additional fine-tuning through low-rank adaption (LoRA) methods, can dramatically improve the performance of the model. We evaluated Mu's accuracy by fine-tuning on several tasks, including SQUAD, CodeXGlue and Windows Settings agent (which we will talk more about later in this blog). For many tasks, the task-specific Mu achieves remarkable performance despite its micro-size of a few hundred million parameters. When comparing Mu to a similarly fine-tuned Phi-3.5-mini, we found that Mu is nearly comparable in performance despite being one-tenth of the size, capable of handling tens of thousands of input context lengths and over a hundred output tokens per second.

Task \ Model	Fine-tuned Mu	Fine-tuned Phi
SQUAD	0.692	0.846
CodeXGlue	0.934	0.930
Settings Agent	0.738	0.815

Model quantization and model optimization

To enable the Mu model to run efficiently on-device, we applied advanced model quantization techniques tailored to NPUs on Copilot+ PCs. We used Post-Training Quantization (PTQ) to convert the model weights and activations from floating point to integer representations – primarily 8-bit and 16-bit. PTQ allowed us to take a fully trained model and quantize it without requiring retraining, significantly accelerating our deployment timeline and optimizing for efficiently running on Copilot+ devices. Ultimately, this approach preserved model accuracy while drastically reducing memory footprint and compute requirements without impacting the user experience. Quantization was just one part of the optimization pipeline. We also collaborated closely with our silicon partners at AMD, Intel and Qualcomm to ensure that the quantized operations when running Mu were fully optimized for the target NPUs. This included tuning mathematical operators, aligning with hardware-specific execution patterns and validating performance across different silicon. The optimization steps result in highly efficient inferences on edge devices, producing outputs at more than 200 tokens/second on a Surface Laptop 7. https://www.youtube.com/watch?si=P1nIObhNUVckI7yl&v=A2geTQes0Pw&feature=youtu.be

Mu running a question-answering task on an edge device, using context sourced from Wikipedia: (https://en.wikipedia.org/wiki/Microsoft)

Notice the fast token throughputs and ultra-fast time to first token responses despite the large amount of input context provided to the model. By pairing state-of-the-art quantization techniques with hardware-specific optimizations, we ensured that Mu is highly effective for real-world deployments on resource-constrained applications. In the next section, we go into detail on how Mu was fine-tuned and applied to build the new Windows agent in Settings on Copilot+ PCs.

Model tuning the agent in Settings

To improve Windows' ease of use, we focused on addressing the challenge of changing hundreds of system settings. Our goal was to create an AI-powered agent within Settings that understands natural language and changes relevant undoable settings seamlessly. We aimed to integrate this agent into the existing search box for a smooth user experience, requiring ultra-low latency for numerous possible settings. After testing various models, Phi LoRA initially met precision goals but was too large to meet latency targets. Mu, with the right characteristics, required task-specific tuning for optimal performance in Windows Settings. While baseline Mu in this scenario excelled in terms of performance and power footprint, it incurred a 2x precision drop using the same data without any fine-tuning.  To close the gap, we scaled training to 3.6M samples (1300x) and expanded from roughly 50 settings to hundreds of settings. By employing synthetic approaches for automated labelling, prompt tuning with metadata, diverse phrasing, noise injection and smart sampling, the Mu fine-tune used for Settings Agent successfully met our quality objectives. The Mu model fine-tune achieved response times of under 500 milliseconds, aligning with our goals for a responsive and reliable agent in Settings that scaled to hundreds of settings. The below image shows how the experience is integrated with an example showing the mapping from a natural use language query to a Settings action being surfaced by the UI. [caption id="attachment_179778" align="alignnone" width="1024"] Screenshot demonstrating the agent in Settings[/caption] To further address the challenge of short and ambiguous user queries, we curated a diverse evaluation set combining real user inputs, synthetic queries and common settings, ensuring the model could handle a wide range of scenarios effectively. We observed that the model performed best on multi-word queries that conveyed clear intent, as opposed to short or partial-word inputs, which often lack sufficient context for accurate interpretation. To address this gap, the agent in Settings is integrated into the Settings search box, enabling short queries that don’t meet the multi-word threshold to continue to surface lexical and semantic search results in the search box, while allowing multi-word queries to surface the agent to return high precision actionable responses.  Managing the extensive array of Windows settings posed its own challenges, particularly with overlapping functionalities. For instance, even a simple query like "Increase brightness" could refer to multiple settings changes – if a user has dual monitors, does that mean increasing brightness to the primary monitor or a secondary monitor? To address this, we refined our training data to prioritize the most used settings as we continue to refine the experience for more complex tasks.

What’s ahead

We welcome feedback from users in the Windows Insiders program as we continue to refine the experience for the agent in Settings. As we’ve shared in our previous blogs, these breakthroughs wouldn’t be possible without the support of efforts from the Applied Science Group and our partner teams in WAIIA and WinData that contributed to this work, including: Adrian Bazaga, Archana Ramesh, Carol Ke, Chad Voegele, Cong Li, Daniel Rings, David Kolb, Eric Carter, Eric Sommerlade, Ivan Razumenic, Jana Shen, John Jansen, Joshua Elsdon, Karthik Sudandraprakash, Karthik Vijayan, Kevin Zhang, Leon Xu, Madhvi Mishra, Mathew Salvaris, Milos Petkovic, Patrick Derks, Prateek Punj, Rui Liu, Sunando Sengupta, Tamara Turnadzic, Teo Sarkic, Tingyuan Cui, Xiaoyan Hu, Yuchao Dai.

Enabling multimodal functionality for Phi Silica

Vivek Pradeep — Fri, 25 Apr 2025 17:00:04 +0000

Introduction

Expanding on the breakthrough efficiencies of Phi Silica, Microsoft’s state-of-the-art on-device small language model (SLM), this blog shows how we added vision-based multimodal capabilities to it. This additional dimension unlocks new possibilities for local SLMs on Windows, as illustrated by a couple of exciting new experiences for accessibility and productivity which we also dive into in this blog. By introducing support for image understanding, Windows now comes with a built-in multimodal SLM offloaded to the NPU on your Copilot+ PC (available on Snapdragon based Copilot+ devices and upcoming on Intel and AMD Copilot+ PCs), which powers several of our experiences and developer APIs. Effectively understanding images opens up numerous possibilities for reasoning over multimodal inputs, spanning images and text to generate textual output. After implementing Phi Silica with text capabilities, it was crucial that the addition of vision capability did not necessitate deploying a separate, dedicated vision SLM on-device. This consideration is especially important as we strive to be resource-conscious regarding disk space, memory utilization and compute resources with built-in OS language models. The solution we developed coexists with the existing Phi Silica and other models we deploy on-device, extensively reusing them while only adding a relatively small 80-million projector model overhead. Integrating multiple models to introduce new capabilities by training smaller connectors is a method we anticipate will become increasingly prevalent in the client-AI space. Instead of updating the base Phi Silica SLM weights, we feed adapted multimodal inputs to the existing Phi Silica embedding layer. The multimodal input adaptation is done using the vision encoder and the small projector model. For the vision encoder, we reuse the Florence image encoder, which is already being deployed in the Windows Recall (Preview) and the improved Windows search features. The small multimodal projector module, which translates vision embeddings into Phi-compatible embeddings, is trained afresh while the vision encoder remains frozen to avoid impacting existing use cases. Further, we ensure that the behavior of the newly introduced multimodal component is compatible with the existing quantized vision encoder and Phi Silica with acceptable quality. This ensures the best user experience, at minimal extra memory footprint. Phi Silica powered Image Description needs to coexist and often run concurrently with other SLM based experiences like Click to Do (Preview), text rewrite and summarize capabilities, in addition to the user’s own AI workloads. Reusing all the existing components also helps save cost and time on training additional components, optimizes feature loading time and reduces overall memory footprint, providing the user an improved experience. https://youtu.be/z8d_CsKOLVc Real-time demo of Multimodal Phi Silica running on Copilot+ PCs (Snapdragon X Series)

How multimodal functionality for Phi Silica improves accessibility

Understanding the content on the screen, whether text or images, is an important step towards making computing more accessible. Today, many of Microsoft’s products, like Word and PowerPoint, automatically generate Alt Text for images, making it easier for screen readers like Microsoft Narrator to describe the content on the screen. The current Alt Text generation methods leverage cloud-based models to provide a short visual summary. The multimodal functionality for Phi Silica enhances the description of screen contents for people who are blind or with low vision that use screen readers. Phi Silica can generate descriptions with varying levels of detail, from shorter Alt Text to comprehensive descriptions making the AI-generated Alt Text more useful to the person. Device-based Phi Silica makes these high-quality descriptions possible, delivering a more accessible and performant experience. https://youtu.be/30RMRtSbjAw Image description using multimodal functionality for Phi Silica in Windows Narrator live on Copilot+ PC devices (Snapdragon X Series) We now describe in detail the components of the multimodal functionality for Phi Silica, starting with the vision encoder used for the extraction of image tokens to train the modality projector, and finally followed by the evaluation of the system.

Extracting the Vision Embedding

To extract the Vision Embeddings for the image, we use the Florence image encoder (Florence: A New Foundation Model for Computer Vision - Microsoft Research). The visual features from the input image are fed into a modality projector model. The projector is trained to produce embeddings aligned with the Phi Silica embedding space and can be fed directly into Phi Silica, alongside Phi Silica embeddings of the accompanying text. Our modality projector has a simple architecture comprising two linear layers stacked on top of each other. In addition to being efficient at the inference time, this design minimizes the number of new parameters added to our ecosystem of AI models, enabling multimodal functionality with just a few million new parameters, compared to additional multi-billion parameters if we had to deploy a separate vision language model. To maintain high inference efficiency, we train the system to operate with only one crop over the input image, unlike rival models that require multiple crops. Thus, our training approach ensures that our system needs only 49 visual tokens from the vision encoder per image making the end-to-end system efficient. [caption id="attachment_179694" align="alignnone" width="492"] Florence Image tokens extraction for projector training[/caption] During the training of the vision projector, both the vision encoder and the language model, Phi Silica, remain frozen. This aligns with our ecosystem design goal to allow maximum reuse of foundation models across scenarios. Both the vision encoder and Phi Silica run in quantized form on the NPU. To avoid any degradation caused by quantized vision encoder, we preprocess the training data to generate the image tokens from quantized vison encoder and use it during training. To meet the memory constraints and throughput latency, post-training quantization is performed on Phi Silica enabling it to run on a 4-bit weight precision using QuaRot. Since the projector is trained against the unrotated Phi Silica, we use the same random Hadamard transform to rotate the embeddings coming from the visual stream before feeding them into Phi Silica. To ensure that the scale of the embeddings from text and vision are captured in the quantization process, we included some calibration data from the output of the projector when quantizing activations of the Phi Silica. This ensures that the activations coming in from the visual stream can be accurately represented. This calibration is done once, and the projector is designed with appropriate normalization so that subsequent training ensures the range output of the projector is within the expected range for quantized Phi Silica. [caption id="attachment_179695" align="alignnone" width="602"] Training the projector with frozen Florence vision encoder and Phi Silica base model[/caption] Although our vision encoder model can understand text from images, for scenarios requiring precise text extraction, such as chart or graph understanding, an OCR (Optical Character Recognition) model may be used alongside the vision encoder and projector. To facilitate this process, we fuse the projected visual information with the textual information before feeding it to Phi Silica. This augments Phi Silica with a visual understanding capability while retaining its original capabilities. [caption id="attachment_179696" align="alignnone" width="602"] Image description system using quantized Phi Silica and extracted OCR on Copilot+ devices[/caption] The image descriptions generated by multimodal functionality for Phi Silica can be used in various settings on Copilot+ PCs: from generating short, accurate Alt Text to detailed image readings, enabling rich interactions with images. On average, a short description takes around 4 seconds and is approximately 135 characters long, while a longer description takes around 7 seconds and ranges between 400-450 characters. The current version is optimized for English, with future updates to support descriptions in other languages over time. An example of image description using Phi Silica: A common example of Alt Text today: A map of the island Multimodal functionality for Phi Silica Short Caption/Description: The image depicts a map of the Hawaiian island of Oahu, showing various locations such as Kahuku Point, Kawela Bay, Kahuku and other points and villages, including Pearl City and Ewa Beach. Multimodal functionality for Phi Silica Long Description for Accessibility: The image is a map of the Hawaiian island of Oahu, depicting various locations and geographical features. Key locations include Kahuku Point, Kawela Bay, Kahuku and Kauai. Other notable locations are Laie, Koolau, Hauula, Punaluu, Mokulēia, Anahulu, Waialua, Kaaawa and Kaukonahua. The Pacific Ocean is also visible. Other locations mentioned are Whitmore Village, Mt. Kala, Kualoa Point, Wahiawa and Kaneohe. The image also shows various villages and towns such as Mililani, Ahuimanu, Pearl City and Waipio. Other notable locations include Kahulu, Nānākuli, Koolau and Waimanalo.

Evaluation

We evaluate the image descriptions generated by the multimodal functionality for Phi Silica with respect to existing foundation models like Florence used currently for describing images. The evaluation is conducted using the LLM-as-a-judge technique: we prompt GPT-4o to score the responses between 0 and 5 focusing on the accuracy and completeness, given the image and the generated description. The validation dataset is divided into various categories like natural photographs, charts, maps, diagrams, tables and screenshots, to represent a generic distribution of the images. The radar chart compares the quality of image descriptions for existing Florence driven short accessibility descriptions and multimodal functionality for Phi Silica generated short and detailed descriptions.

Conclusion

In conclusion, we introduce the NPU-enabled multimodal functionality for Phi Silica, capable of performing image descriptions within the Windows ecosystem. By integrating existing components like the Florence vision encoder with Phi Silica, this solution provides an efficient on-device feature to generate detailed, meaningful and contextually rich descriptions of on-screen content. By providing both short and detailed descriptions, the multimodal functionality for Phi Silica enhances Alt Text generation and improves accessibility for blind and low vision individuals by leveraging local on-device models running on the NPU. These breakthroughs wouldn’t be possible without the support of efforts from the Applied Science Group that contributed to this work, including Daniel Rings, Dimitrios Mallios, Eric Sommerlade, Henry Jackson-Flux, Karthik Vijayan, Mohsen Fayyaz, Parth Pathak, Pashmina Cameron, Sunando Sengupta, Tamara Turnadzic, Vojin Dedovic and extended thanks to our partner teams Azure GenAI, Windows Developer Platform, Office Growth Ecosystems and Windows Accessibility. Reference

Phi Silica, small but mighty on-device SLM

Vivek Pradeep — Fri, 06 Dec 2024 18:20:29 +0000

Introduction

This blog is the first installment in a new series of technical content designed to provide insights into the AI innovation on Windows. Today we will share how the Applied Sciences team used a multi-interdisciplinary approach to achieve a breakthrough in power efficiency, inference speed and memory efficiency for a state-of-the-art small language model (SLM), Phi Silica. Integrated into Windows 11 Copilot+ PCs (starting with Snapdragon X Series), this SLM powers several features of the latest generation of Copilot+ PC experiences: Click to Do (Preview), on-device rewrite and summarize capabilities in Word and Outlook, and a turnkey pre-optimized SLM for developers to utilize.

Background

In May, we introduced Copilot+ PCs, these devices include a Neural Processing Unit (NPU) capable of over 40 trillion operations per second (TOPS). During our May announcement, we also unveiled Phi Silica, the new on-device SLM available starting with Snapdragon X Series NPUs. Phi Silica is the sister-series of Phi models that leverages the NPU on Copilot+ PCs. At Ignite in November, we also announced that developers can access Phi Silica API starting in January 2025. Developers can bring language intelligence capabilities into their apps without needing to worry about model optimization or customization as Phi Silica is pre-tuned and ships inbox. NPU devices with all day battery life claims run sustained AI workloads over long periods of time, in the background, with minimal impact to system fundamentals and resources. Connected to, and enhanced by the cloud, Copilot+ PCs can now achieve a level of performance never seen before – they are up to 20x more powerful¹ and up to 100x as efficient² for running AI workloads, and have smaller footprints than GPUs per TOPS/Watt/dollar. NPUs can sustain AI workloads that exhibit emergent behavior (3 to 7B parameter SLMs) in a semi-continuous loop, allowing users to make limitless low-latency queries to the model, without incurring additional subscription fees. This is a paradigm shift in compute; we now have the ability to run powerful reasoning agents as part of background operating system (OS) services, unlocking the potential for innovation across the range of our applications and services. https://www.youtube.com/embed/lo-uIlQbfUs

Copilot+ PC: A new AI era at work

Original floating-point model

Phi Silica is based on a Cyber-EO compliant derivative of Phi-3.5-mini, developed specifically for Windows 11. It has a 4k context length, supports multiple languages including Tier 1 languages and others [English, Chinese (Simplified), French, German, Italian, Japanese, Portuguese, Spanish] and includes key improvements necessary for in-product experiences. [caption id="attachment_179256" align="alignnone" width="1024"] Cyber-EO compliant Phi-3.5 model benchmark performance (measured on BabelBench)[/caption] A language model, such as Phi, consists of several components: [caption id="attachment_179258" align="alignnone" width="1024"] Language model generation process[/caption]

The tokenizer breaks down the input text into smaller units and maps them to an index based on a pre-specified vocabulary. Tokenization forms a mapping between the language of humans and the language of models.
The detokenizer performs the reverse operation.
The embedding model transforms every discrete input token ID into a continuous, higher dimensional vector that captures semantic information in the space of language as understood by the language model. The direction of the normalized embedding vector encodes the context and the meaning of the text.
The transformer block transforms these incoming vectors to output vectors (or output hidden states) that point in the direction of the token that should follow the current one.
The language model head computes the most likely token based on the output vectors.

Generating a response to a prompt consists of two distinct phases of operation of the transformer block:

Context processing: The language model processes input tokens to compute the key-value (KV) cache and generate hidden states and the first token. This involves intense parallel computation, mainly matrix multiplications, requiring high computational power.

Token iteration: Tokens are generated one by one (i.e. autoregressively) and each new token becomes part of the extended context to predict the next one. Generation stops when an end token is produced, or a user-defined condition is met.

Running the aforementioned stages for even SLMs such as Phi, with their billions of parameters, can place considerable strain on your device. The context processing stage requires significant computational resources, which impacts the CPU and running applications, and involves high power usage when GPUs are employed due to their efficiency in TOPS per Watt. In contrast, the token iteration stage demands substantial memory for storing and accessing the KV cache for each token generation step. While it needs less computation, efficient memory access is crucial for maintaining performance. Memory constraints make efficient token generation challenging. NPUs within Copilot+ PCs are built to be power-efficient, capable of executing several TOPS within a single-digit Watt range. On Copilot+ PC devices with Snapdragon X Elite processors, Phi Silica’s context processing consumes only 4.8mWh of energy on the NPU, while the token iterator exhibits a 56% improvement in power consumption compared to operation on the CPU. Consequently, we can execute Phi Silica on your device without burdening the CPU and GPU, ensuring efficient memory and power consumption, thereby allowing this highly capable and versatile model to run seamlessly, with minimal impact on your primary applications and experiences. As NPUs are domain-specific processors, we have employed various techniques to achieve an optimal balance between efficiency and performance without compromising accuracy. We are eager to share these techniques and hope they can be applied to other small language models as well. Our discussion will primarily focus on optimizing and offloading the transformer block to the NPU. The tokenizer, embedding and language model head are not compute-intensive but involve lookups; therefore, we allocate these tasks to the CPU.

Creating Phi Silica

Considering the size of the original floating-point model, memory limitations of the target hardware, as well as the desired performance metrics in terms of speed, memory usage and power efficiency, it was clear that Phi Silica should have the following characteristics:

4-bit weight quantization to ensure high speed and a low memory footprint during inferencing

Low idle memory consumption to support pinned memory and eliminate initialization costs

Rapid time to first token for shorter prompts to enhance interactivity

A context length of 2k or greater to ensure real-world usability

NPU-based operation to achieve power efficiency in sustained usage

High accuracy across multiple languages

Small model disk size to make distribution at Windows scale efficient

We designed Phi Silica with these goals in mind for the current generation NPUs. In doing so, we pushed the envelope on what’s possible today across several levels of the stack such as post-training quantization, efficient resource use in inference software and targeted silicon-specific optimizations across operator placement and model graph. The result is a model that delivers, with the bulk of compute offloaded:

Time to first token: 230ms for short prompts

Throughput rate: Up to 20 tokens/s

Context length: 2k (with support for 4k coming shortly)

Sustained NPU-based context processing and token iteration

https://youtu.be/TnPTvUhrEqk

Real-time demo of Phi Silica running on Copilot+ PCs (Snapdragon X Elite)

Post-training quantization

In a bid to achieve true low-precision inference by quantizing both weights and activations, Microsoft and academic researchers collaborated to create QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs (pronounced “carrot”). QuaRot acts as a pre-quantization modifier, enabling the end-to-end quantization of language models, including all weights, activations and KV cache, down to 4-bits. By rotating language models to remove outliers from the hidden state without affecting the output, QuaRot facilitates high-quality quantization at lower bit-widths.  The ingenuity of QuaRot is anchored in two fundamental concepts:

Incoherence processing: Previous weight-only quantization methods such as QuIP and QuIP# employ rotations to pre-multiply and post-multiply the weight matrices (and Hessians). A weight matrix has high incoherence when its largest element is an outlier relative to the average element's magnitude, making it difficult to quantize. Incoherence processing reduces the incoherence in weight matrices by rotating them to make quantization easier. However, this comes at an increased computational cost of rotations and de-rotations for each weight matrix.

Computational invariance: QuaRot extends the idea of computational invariance introduced in SliceGPT. Computational invariance means that a transformation to the weight matrices does not change the output. QuaRot uses random Hadamard transforms for the rotations and applies incoherence processing in a computationally invariant manner. This reduces the computational overhead because the rotations and de-rotations across layers can be fused and skipped, leaving only an ingoing and outgoing rotation outside the transformer block. Furthermore, QuaRot allows activations to be incoherence-processed, making activation quantization easier.

https://youtu.be/FpAXaklev14

QuaRot uses rotations to remove outliers to make quantization easier

This results in an equivalent network in a rotated space, allowing activations, weights and KV cache to be quantized to 4-bits, with minimal accuracy loss.

Realizing gains from a 4-bit model

The 4-bit weight quantization optimized our memory footprint. However, adapting QuaRot to enable 4-bit quantized weight model inference on an NPU necessitated several adaptations due to specifics of quantization support in the software stack for NPUs. The final 4-bit Phi Silica model comprises of:

Rotated network: In the base floating-point ONNX model, we convert the LayerNorm transformer network into an RMS-Norm transformer network and used fused Hadamard rotations to obtain an equivalent rotated network.

Embedding layer: A fused one-time ingoing QuaRot rotation.

Activations: Asymmetric per-tensor round-to-nearest quantization to unsigned 16-bit unsigned integers from ONNX.

Weights: Symmetric per-channel quantization to 4-bit integers from QuaRot with GPTQ, copied into the rotated network.

Linear layers: To get the best latency on the current NPU stack, we converted all linear layers into 1x1 convolutional layers (Conv2D). This improved efficiency for the specific matrix sizes involved in Phi Silica.

Selective mixed precision: To further enhance accuracy, we identified several quantized weights that exhibited larger reconstruction errors and selectively quantized them using per-tensor 8-bit quantization. This is advisable for NPU-based inference to mitigate the effect of static quantization of all activations, but it is important to use this method sparingly to keep the overall model size small. In practice, we used 8-bit quantization for 4-8 out of 128 weight matrices.

Language model head: A fused one-time outgoing QuaRot de-rotation. We also quantize the language model head with 4-bit block-wise quantization to keep the memory usage low.

We observed that QuaRot significantly improves quantization accuracy, compared to the de-facto round-to-nearest quantization, particularly for low granularity settings such as per-channel quantization. The following table presents benchmark results before and after 4-bit quantization.

Zero-shot task (lm-eval harness)	Floating-point model (%)	4-bit QuaRot weights (float activations) (%)
piqa	80.47	79.76
winogrande	72.77	72.38
arc_challenge	63.48	60.49
arc_easy	85.69	82.74
hellaswag	77.14	75.13
mmlu_abstract_algebra	45.00	38.00
mmlu_business_ethics	76.00	73.00
mmlu_college_computer_science	57.00	48.00
mmlu_college_mathematics	40.00	38.00
mmlu_conceptual_physics	71.91	67.23
mmlu_formal_logic	53.97	50.00
mmlu_machine_learning	57.14	52.67

Improving memory efficiency

Keeping Phi Silica persistent in memory to handle sustained inference requires the memory usage of the model to be tightly bounded. We optimized the memory efficiency of the model through an iterative process of accurate memory measurements and addressing the most pressing memory issue. Some key techniques included:

Weight sharing: The context processor and token iterator share the same set of quantized weights and most activation quantization parameters, which halved memory usage and accelerated model initialization. This was achieved by having the two model graphs reference the shared weights in ONNX Runtime.

Memory-mapped embeddings: The embedding layer scales with the vocabulary size and the embedding dimension. Using a memory-mapped file for the embedding matrix and implementing the layer as a lookup table effectively reduced the dynamic memory footprint to zero because it eliminated the need for this large matrix to be held in memory.

Disabling arena allocator: By default, ONNX Runtime uses an arena allocator, which results in excessive pre-allocation of memory. Arena allocation helps to reduce frequent memory allocations and deallocations and can be beneficial in some cases, but it leads to higher initial memory usage. For Phi Silica, the pattern of memory usage is pre-determined, so disabling this behavior improved memory efficiency overall.

The combined effect of these changes, with the 4-bit quantized model led to a ~60% reduction in memory usage.

Expanding the context length

Expanding the context length beyond the sequence length of the context processor despite static tensor shape requirements of the NPU-facing software stack was crucial for enabling real-world applications. To expand the context length and enable streaming prompt processing, we came up with two key innovations that work in tandem: Sliding window: Instead of processing the entire prompt, we process it in smaller chunks of size N (with padding applied to the last chunk if necessary). This reduces the effective sequence length of the chunked context model to N while keeping the total context length the same as before. We process each chunk sequentially and update the KV cache to maintain history. We use N=64. This approach unlocks faster processing of shorter prompts without sacrificing speed on longer prompts, i.e. prompt processing time scales with the prompt length. https://youtu.be/CEvukvQl0hs

Context processing and token iteration process within Phi Silica

Dynamic and shared KV cache: A context processor that runs wholly on the NPU but has a read-only key-value cache is highly efficient, but this limits the context length to the sequence length of the context processor. We experimented with different ways of splitting context processing across the NPU and CPU to find a good balance of speed and flexibility. The best configuration involved doing only the GroupQueryAttention operation on the CPU. This enabled a read-write, dynamic-sized KV cache for context-processing, which can then be expanded during iteration for token generation. A dynamic-sized read-write KV cache can be shared across context processing chunks, which maintains history, but also across context processing and token iteration which improves memory efficiency. Input/Output binding pre-allocates sufficient memory during context processing and enables context processing and token iteration to share a single KV cache efficiently; this improves runtime latency significantly. Memory efficiency of KV cache management is crucial because the context KV cache scales quadratically with the context length. [caption id="attachment_179259" align="alignnone" width="1024"] Attention subgraphs in Phi Silica[/caption] The resulting Phi Silica model features improved first token latency for shorter prompts and improves memory efficiency while retaining most of the power-efficiency afforded by a largely NPU-based operation. [caption id="attachment_179268" align="alignnone" width="1024"] Time to first token and throughput measurements for Phi Silica[/caption]

Safety alignment, Responsible AI and content moderation

The floating point model from which Phi Silica is derived has undergone safety alignment using a five stage ‘break-fix’ methodology similar to the one outlined in this technical report: Phi-3 Safety Post-Training: Aligning Language Models with a “Break-Fix” Cycle. Phi Silica model, the system design and the API undergo a Responsible AI impact assessment and deployment safety board reviews. Local content moderation is available in the Phi Silica developer API. An overview of this process can be reviewed here: Get started with Phi Silica in the Windows App SDK.

Closing

We pushed the boundaries of what’s possible with today’s NPUs in a rapidly evolving, complex technical landscape. By advancing quantization research, we have achieved remarkable gains in three critical areas with Phi Silica: memory efficiency, power efficiency and inference latencies, without compromises in quality or functionality. These results underscore Microsoft's commitment to developing models that are not only powerful in capability but also highly efficient. By including Phi Silica in the operating system on Copilot+ PCs, Microsoft is ensuring that these powerful and efficient models are seamlessly integrated into Windows 11 experiences on Copilot+ PCs, empowering users to achieve more with their devices. It has taken a village to innovate across multiple layers of the stack and create Phi Silica, and we are grateful to all our team members, including (listed in alphabetical order), Aditya Rastogi, Aleksandar Uzelac, Andrey Rybalchenko, Ayyoob Imani, Bogdan Radaković, Chad Voegele, Daniel Rings, David Wong, Đjorđje Marjanović, Dusan Erdeljan, Ebey Abraham, Eric Sommerlade, Goran Dubajić, Hari Govind, Henry Jackson-Flux, Ivan Razumenić, James Hensman, Jordan Van der Kroon, Joshua Elsdon, Julio Soldevilla Estrada, Karthik Vijayan, Kosta Rakonjac, Marat Saidov, Miloš Stojanović, Milomir Stefanović, Pashmina Cameron, Parth Pathak, Rahul Amlekar, Ruomei Yan, Saša Galić, Savo Ičagić, Sunando Sengupta, Tadija Šebez, Tamara Turnadzic, Tammany Grant, Teo Šarkić, Vijay Sundaram, Vojin Dedović, Xiaoyan Hu. The Phi Silica team also extends thanks to partner teams who supported the work, including GenAI, Microsoft Research, ONNX Runtime, Windows Developer Platform and Windows Silicon. Editor's note – Dec. 6, 2024 – In an earlier version, the 'Time to first token' graph units were mistakenly labelled as being in milliseconds (ms) instead of seconds (s). This has now been corrected. ¹Tested April 2024 using debug application for Windows Studio Effects workload comparing pre-release Copilot+ PC builds with Snapdragon Elite X 12 Core to Windows 11 PC with Intel 12th gen i7 configuration. ² Tested April 2024 using Phi SLM workload running 512-token prompt processing in a loop with default settings comparing pre-release Copilot+ PC builds with Snapdragon Elite X 12 Core and Snapdragon X Plus 10 core configurations (QNN build) to Windows 11 PC with NVIDIA 4080 GPU configuration (CUDA build).