I wanted to kick off a discussion on a topic that constantly comes up in our edge deployment projects: the practical trade-offs between Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT).
We’re currently trying to deploy a distilled BERT-base model for text classification onto an ARM-based embedded system. Our goal is sub-20ms inference latency with minimal accuracy degradation.
My questions to the community are:
When do you consider PTQ “good enough”? What’s your typical threshold for an acceptable accuracy/F1 score drop?
For those who have implemented QAT, what were the most significant architectural changes or considerations you had to make, especially concerning PyTorch/TensorFlow framework specifics?
Any war stories or real-world project examples where you had to make this choice, and what ultimately swayed your decision (dev time, model size, specific hardware)?
Looking forward to your insights!
Report
There was a problem reporting this post.
Block Member?
Please confirm you want to block this member.
You will no longer be able to:
See blocked member's posts
Mention this member in posts
Invite this member to groups
Message this member
Add this member as a connection
Please note:
This action will also remove this member from your connections and send a report to the site admin.
Please allow a few minutes for this process to complete.