Skip to content

Commit d8808a0

Browse files
committed
edits
Signed-off-by: Chris Abraham <[email protected]>
1 parent 0d73a5e commit d8808a0

File tree

1 file changed

+11
-3
lines changed

1 file changed

+11
-3
lines changed

_posts/2025-01-14-genai-acceleration-intel-xeon.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Figure 1. Weight-only Quantization Pattern. Source: Mingfei Ma, Intel
3939

4040

4141

42-
* Weight Prepacking & Micro Kernel Design.
42+
* **Weight Prepacking & Micro Kernel Design.**
4343

4444
To maximize throughput, GPTFast allows model weights to be prepacked into hardware-specific layouts on int4 using internal PyTorch ATen APIs. Inspired by Llama.cpp, we prepacked the model weights from [N, K] to [N/kNTileSize, K, kNTileSize/2], with kNTileSize set to 64 on avx512. First, the model weights are blocked along the N dimension, then the two innermost dimensions are transposed. To minimize de-quantization overhead in kernel computation, we shuffle the 64 data elements on the same row in an interleaved pattern, packing Lane2 & Lane0 together and Lane3 & Lane1 together, as illustrated in Figure 2.
4545

@@ -97,34 +97,42 @@ Diffusion Fast offers a simple and efficient PyTorch native acceleration for tex
9797
SDPA is a key mechanism used in transformer models, PyTorch provides a fused implementation to show large performance benefits over a naive implementation.
9898

9999

100-
**Model Usage on Native PyTorch CPU**
100+
## Model Usage on Native PyTorch CPU
101101

102102

103103
### [GPTFast](https://github.com/pytorch-labs/gpt-fast)
104104

105105
To launch WOQ in GPTFast, first quantize the model weights. For example, to quantize with int4 and group size of 32:
106106

107+
```
107108
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 –group size 32
109+
```
108110

109111
Then run generation by passing the int4 checkpoint to generate.py
110112

113+
```
111114
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4.g32.pth --compile --device $DEVICE
115+
```
112116

113117
To use CPU backend in GPTFast, simply switch DEVICE variable from cuda to CPU.
114118

115119
### [Segment Anything Fast](https://github.com/pytorch-labs/segment-anything-fast)
116120

121+
```
117122
cd experiments
118123
119124
export SEGMENT_ANYTHING_FAST_USE_FLASH_4=0
120125
121126
python run_experiments.py 16 vit_b &lt;pytorch_github> &lt;segment-anything_github> &lt;path_to_experiments_data> --run-experiments --num-workers 32 --device cpu
122127
123128
python run_experiments.py 16 vit_h &lt;pytorch_github> &lt;segment-anything_github> &lt;path_to_experiments_data> --run-experiments --num-workers 32 --device cpu
129+
```
124130

125131
### Use [Diffusion Fast](https://github.com/huggingface/diffusion-fast)
126132

133+
```
127134
python run_benchmark.py --compile_unet --compile_vae --device=cpu
135+
```
128136

129137
## Performance Evaluation
130138

@@ -134,7 +142,7 @@ We ran llama-2-7b-chat model based on [test branch](https://github.com/yanbing-j
134142

135143

136144

137-
* Use torch.compile to automatically fuse elementwise operators.
145+
* Use `torch.compile` to automatically fuse elementwise operators.
138146
* Reduce memory footprint with WOQ-int8.
139147
* Further reduce memory footprint with WOQ-int4.
140148
* Use AVX512 which enables faster de-quant in micro kernels.

0 commit comments

Comments
 (0)