Model Compression Vs Finetuning
Model Compression Vs Finetuning
Model Compression Techniques:
Common Model Compression Techniques
1. Pruning
Removes unnecessary weights or neurons from the model:
-
Weight pruning: Set small-magnitude weights to zero.
-
Structured pruning: Remove entire filters, channels, or layers for better hardware efficiency.
2. Quantization
Reduces the precision of the weights and activations:
-
Post-training quantization: Convert a trained model (e.g., from float32 to int8).
-
Quantization-aware training (QAT): Train with quantization simulated during training for higher accuracy.
3. Knowledge Distillation
A smaller "student" model learns to mimic the behavior of a larger "teacher" model:
-
Helps retain performance by transferring learned knowledge.
4. Low-Rank Factorization
Decomposes large weight matrices into products of smaller matrices:
-
Reduces number of parameters and computation.
5. Weight Sharing
Groups similar weights and shares the same value:
-
Reduces model size with minimal impact on accuracy.
6. Compact Network Design
Use lightweight architectures from the start:
-
Examples: MobileNet, SqueezeNet, EfficientNet.
7. Neural Architecture Search (NAS)
Automatically finds efficient architectures under constraints:
-
Tools like Google's AutoML, FBNet.
8. Parameter Quantization and Huffman Coding
Used in models like Deep Compression:
-
Combine quantization, pruning, and entropy coding for aggressive compression.
Model Fine Tuning
Common Fine-Tuning Techniques
1. Full Fine-Tuning
-
Update all layers of the pretrained model using the new dataset.
-
Most flexible but can lead to overfitting if data is small.
-
Example: Fine-tuning all of BERT’s layers on a sentiment analysis dataset.
2. Feature Extraction (Partial Fine-Tuning)
-
Freeze most of the pretrained layers and only train the final layers (e.g., classifier).
-
Faster and prevents overfitting when data is limited.
-
Common in CNNs (e.g., using ResNet as a feature extractor).
3. Layer-wise Freezing/Unfreezing
-
Gradually unfreeze layers during training:
-
First, train the top layers.
-
Then slowly unfreeze lower layers for fine-grained tuning.
-
-
Useful for transfer learning in NLP and CV.
4. Discriminative Learning Rates
-
Use different learning rates for different layers:
-
Lower learning rate for early (frozen or pretrained) layers.
-
Higher learning rate for newly added or task-specific layers.
-
-
Helps stabilize training and retain learned representations.
5. Adapter Layers (Parameter-Efficient Fine-Tuning)
-
Add small bottleneck layers (adapters) to a frozen backbone and only train those adapters.
-
Used widely in NLP (e.g., with BERT, T5) to minimize the number of trainable parameters.
6. Prompt Tuning / Prefix Tuning (NLP-specific)
-
Add learnable prompts or prefixes to the input of a frozen language model.
-
Efficient for large language models when full fine-tuning is too expensive.
7. Low-Rank Adaptation (LoRA)
-
Introduce low-rank matrices into pretrained layers and train only those.
-
Reduces memory and computation while achieving competitive performance.
When to Use Fine-Tuning
-
When you have a small or domain-specific dataset.
-
When pretraining from scratch is too resource-intensive.
-
When model performance on your specific task needs improvement.
Difference between model compression and finetuning
Model compression and fine-tuning serve different purposes in the machine learning lifecycle, though they can sometimes be used together. Here's a clear comparison:
Simple Analogy:
-
Model compression is like zipping a file to make it easier to store or send.
-
Fine-tuning is like editing a document to make it more relevant for a specific audience.
Feature | Model Compression | Fine-tuning |
---|---|---|
Goal | Reduce model size, latency, or resource usage | Improve or adapt model performance on a new or specific dataset |
Main Focus | Efficiency (smaller, faster models) | Accuracy and adaptation (often domain-specific) |
Techniques Used | Pruning, quantization, distillation, low-rank factorization | Continue training on new data (often with smaller learning rate) |
Changes to Architecture | Often modifies or simplifies architecture | Typically keeps the same architecture |
When Used | After training a large model to optimize for deployment | When transferring a pretrained model to a new task or dataset |
Impact on Accuracy | May cause slight degradation (ideally minimal) | Often improves accuracy on the new task |
Example Use Case | Deploying a model on mobile or embedded systems | Adapting BERT for a legal text classification task |
Comments
Post a Comment