-
Notifications
You must be signed in to change notification settings - Fork 250
Enabling EmbeddingQuantizer and SharedEmbeddingQuantizer #1525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enabling EmbeddingQuantizer and SharedEmbeddingQuantizer #1525
Conversation
…aredEmbeddingQuantizer
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1525
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 12e3c65 with merge base 5f8f35d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Looks like the imports aren't happy. I wonder if we need a torchao pin bump? |
torchchat/utils/quantize.py
Outdated
weight_dtype = getattr(torch, f"int{bit_width}") | ||
|
||
try: | ||
quantize_( | ||
model, | ||
model, | ||
int8_dynamic_activation_intx_weight( | ||
weight_dtype=weight_dtype, | ||
granularity=granularity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
granularity => weight_granularity
has_weight_zeros=True => weight_mapping_type=MappingType.ASYMMETRIC
has_weight_zeros=False => weight_mapping_type=MappingType.SYMMETRIC
torchchat/utils/quantize.py
Outdated
@@ -154,45 +170,86 @@ def quantize_model( | |||
print("Encountered error during quantization: {e}") | |||
print("Trying with PlainLayout") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use QDQLayout instead
Yeah, you will need to update the torchao pin to something more recent (just pick the latest commit in torchao): https://github.com/pytorch/torchchat/blob/main/install/.pins/torchao-pin.txt |
…und of PR comments. Fixes to usage of EmbeddingQuantizer and SharedEmbeddingQuantizer
@Jack-Khuu I think its ready to be merged (hopefully haha). Thanks so much for helping out - really appreciate the support from you and Scott :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks legit, just some small style nits
@metascroy can you give it a glance?
|
||
import inspect | ||
|
||
|
||
def get_named_parameters(func: Callable) -> List[str]: | ||
# Get the signature of the function | ||
|
||
signature = inspect.signature(func) | ||
|
||
# Extract the parameters from the signature | ||
|
||
parameters = signature.parameters | ||
|
||
# Filter and return named parameters | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind undoing the whitespaces?
@@ -110,23 +125,73 @@ def quantize_model( | |||
|
|||
if isinstance(quantize_options, str): | |||
quantize_options = json.loads(quantize_options) | |||
|
|||
for quantizer, q_kwargs in quantize_options.items(): | |||
if quantizer not in quantizer_class_dict: | |||
raise RuntimeError(f"unknown quantizer {quantizer} specified") | |||
else: | |||
# Use tensor subclass API for int4 weight only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, Comment got split off from:if (device in ["cuda", "xpu", "npu"]) and quantizer == "linear:int4":
# default setup for affine quantization of activations | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on the white spaces :)
granularity=weight_granularity, | ||
mapping_type=weight_mapping_type, | ||
use_fallback=False, | ||
).quantize(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@metascroy These don't use the quantize_ APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment, no. They are numerically equivalent to quantize_ (and we have tests for this).
The main concern I have is shared embedding quantization must be done first. Not sure how to ensure that in torchchat. cc @Jack-Khuu |
@@ -110,23 +125,73 @@ def quantize_model( | |||
|
|||
if isinstance(quantize_options, str): | |||
quantize_options = json.loads(quantize_options) | |||
|
|||
for quantizer, q_kwargs in quantize_options.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillondesilva Small detail regarding "experimental:shared" that @metascroy brought up is that the order of operation matter for this particular quant
We can just do this to enforce it
for quantizer, q_kwargs in quantize_options.items(): | |
# "experimental:shared" Must be executed first | |
shared_experimental_quant = quantize_options.pop("experimental:shared", None) | |
if shared_experimental_quant is not None: | |
quantize_options = {"experimental:shared": shared_experimental_quant} | quantize_options | |
for quantizer, q_kwargs in quantize_options.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you use the following code to reorder all quantization options in the order they are in quantizer_class_dict
(and adjust that dict to be in the order you want - which right now is just move this experimental:shared quantizer to the top):
ordered_quantize_options = { }
for quantizer in quantizer_class_dict.keys():
if quantizer in quantize_options:
ordered_quantize_options |= { quantizer : quantize_options.pop(quantizer) }
if len(quantize_options) != 0:
raise RuntimeError(f"unknown quantizer(s) {quantize_options.keys()} specified")
quantize_options = ordered_quantize_options
PS: This is written to subsume the check of line 129-132 because it raises on any unknown quantizers.
cc: @metascroy
Last little bit: can you add the new quants into the CI https://github.com/pytorch/torchchat/blob/main/.github/workflows/pull.yml Essentially whereever you see |
To really test shared embedding, you need to test a model that has embeddings shared with unembeddings. stories110M (currently used in CI) is not one of them. Some examples: llama1B, llama3B, phi4-mini, etc. |
Hmmm ok let's do this @dillondesilva you can ignore the comments about testing for now We'll make a different PR for testing, since it's a tad more involved |
if quantizer == "experimental:embedding": | ||
group_size = q_kwargs["groupsize"] | ||
bit_width = q_kwargs["bitwidth"] | ||
has_weight_zeros = q_kwargs["has_weight_zeros"] | ||
weight_granularity = ( | ||
PerAxis() if group_size == -1 else PerGroup(group_size) | ||
) | ||
weight_dtype = getattr(torch, f"int{bit_width}") | ||
weight_mapping_type = ( | ||
MappingType.ASYMMETRIC | ||
if has_weight_zeros | ||
else MappingType.SYMMETRIC | ||
) | ||
|
||
try: | ||
model = EmbeddingQuantizer( | ||
weight_dtype=weight_dtype, | ||
granularity=weight_granularity, | ||
mapping_type=weight_mapping_type, | ||
use_fallback=False, | ||
).quantize(model) | ||
except Exception as e: | ||
print( | ||
"Encountered error during quantization with experimental EmbeddingQuantization: {e}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you not put this into EmbeddingQuantizer class, or a subclass derived from EmbeddingQuantizer.
At a minimum 134-146 seem to be copy pasta that's replicated multiple times, e.g., right below L159-171, and 195-204, 241-252, ...
Yep no worries - sounds like a plan! I'll hop onto these changes soon. |
Overview
This PR enables the use of EmbeddingQuantizer and SharedEmbeddingQuantizer as quantization configuration options.
Running lintrunner appears to have changed several lines in this file. However, the edits made strictly to enable these new experimental quantizer can be found on the following lines:
Lines 46-49
: Imports for EmbeddingQuantizer and SharedEmbeddingQuantizerLines 202-234
: Logic for setting EmbeddingQuantizer and SharedEmbeddingQuantizer optionsLines 1033-1034
: Including options to map quantization config types to the corresponding quantizers.