How we trained an AI to judge squat depth
In a sanctioned powerlifting meet, three referees decide whether your squat counts. The rule is simple to state and brutal to satisfy: the crease of your hip has to drop below the top of your knee. Miss it by a centimeter and the lift is three red lights.
We set out to build software that makes that same call from a single photo. This is the story of what broke along the way, and the lessons that survived contact with reality. We're not going to hand you our model weights or our recipe, but we'll be honest about every wall we walked into, because the walls turned out to be more instructive than the wins.
Why "just use a pose model" doesn't work
If you've touched computer vision recently, your first instinct is right: off-the-shelf pose estimators already find hips and knees. Feed in a photo, get back a skeleton with a dot on each joint. Problem solved?
Not even close. A generic pose model is trained to put a keypoint somewhere reasonable on the hip, good enough to animate a video-game character or count reps. But "the hip" to a pose model is a fuzzy region near the pelvis. "The hip crease" to a referee is a specific anatomical fold, and depth is decided by where that fold sits relative to the top of the kneecap. The difference between a generic hip keypoint and the actual crease is often larger than the entire pass/fail margin of the lift.
So the real problem was never "find the joints." It was: take a coarse skeleton and refine two specific landmarks to a precision where a few pixels decide the verdict. That framing shaped everything that followed.
The hardest bug is the one inflating your numbers
Early on, our model looked great. Suspiciously great. Each new round of training reported better accuracy than the last, and we were patting ourselves on the back.
Then we found two things. First, a handful of images in our evaluation set had also been in our training set. The model wasn't generalizing on those, it had memorized them. A leak that small doesn't sound like much, but on a precision task with a small evaluation set, it quietly lifts every score and flatters every experiment.
Second, and worse, a bug in our data-loading code was silently dropping a large fraction of our training examples. We had spent weeks running dozens of experiments and ranking them against each other, and a chunk of that ranking was built on a foundation that was both leaking and starved of data at the same time.
Here's the uncomfortable part: once we fixed both bugs and re-ran everything honestly, our headline accuracy got worse. Not because the model degraded, but because we'd been grading on a curve. We made a rule for ourselves after that, and we still follow it: every result is tagged honest or not-trustworthy, and we never quote a number we can't reproduce on data the model has never seen. It's slower. It's also the only reason we trust anything we ship.
If there's one thing to take from this whole piece: a model that's wrong and a model that's lying to you look identical on a dashboard. The work is telling them apart.
More data made it worse
Conventional wisdom says throw more data at the problem. We had multiple sources of training images and the obvious move was to combine all of them.
When we did, accuracy dropped. It turned out a carefully curated, smaller set of clean examples beat the everything-in-the-pot version. The extra data wasn't free: some of it was noisier, some of it was subtly off-distribution from how people actually photograph their squats, and mixing it in dragged the model toward the average instead of the truth. The win came from being willing to exclude data, which feels wrong every time you do it.
"More data" is advice for when your data is clean and your problem is variety. When your problem is precision, more data is often just more noise wearing a helpful disguise.
Synthetic data is a scalpel, not a firehose
We experimented with generating synthetic squat images to expand our training set. The results were not "synthetic is good" or "synthetic is bad," they were maddeningly specific. The same batch of synthetic images helped the model on one of the two landmarks and hurt it on the other.
That forced a better conclusion: synthetic data shifts the distribution your model learns, and whether that shift helps depends entirely on the task. Used as a scalpel, aimed at the exact gap you're trying to close, it earns its place. Used as a firehose to bulk up sample counts, it just moves your problem somewhere you weren't looking.
The famous pretrained backbone didn't transfer
At one point we tried the textbook upgrade: swap our small from-scratch vision component for a well-known backbone pretrained on millions of internet images. This is supposed to be free money, transfer learning, standing on the shoulders of giants.
It failed across the board. Every variant was worse than the humble from-scratch version it replaced.
The reason, in hindsight, is obvious. Those famous backbones learned their features on large, content-rich photos: dogs, cars, faces. We were feeding them tiny, tightly cropped patches around a single joint. The high-level features that make a network great at "is this a golden retriever" are close to useless on a postage-stamp crop of a knee. And the bigger network had far more parameters to fit on a dataset that, by computer-vision standards, is small. More capacity plus the wrong priors is a recipe for overfitting, not magic.
Borrowed knowledge only transfers when the new problem rhymes with the old one. Ours didn't.
Knowing when to stop
Eventually our error on held-out photos got small enough that we hit a different kind of wall: the model started disagreeing with our human labels about as often as two careful humans disagree with each other. When you're squabbling over a couple of pixels, you're no longer measuring the model, you're measuring the ambiguity in the ground truth itself.
That's a strange and important place to reach. It means the next gains don't come from another clever training trick; they come from a harder, larger evaluation set and better labels. We ran several more rounds of experiments that each came back "within noise," no clear winner, and the discipline was to not ship them, to not fool ourselves into seeing improvement in what was really just randomness between training runs.
A lot of machine learning work is knowing the difference between a real gain and a lucky roll of the dice. We've shipped fewer models because of that rule, and slept better for it.
What we're comfortable not telling you
You'll notice we haven't given you architecture diagrams, hyperparameters, or our exact accuracy figures. That's deliberate. Those are the parts that took years and money to find, and they're what make the product hard to copy. The lessons above, though, cost us just as much, and they generalize far beyond squats.
If you lift, all of this disappears behind a single screen: you record a rep, and a moment later you get a verdict you can trust, the same call three referees would make, minus the travel and the entry fee. That trust didn't come from a model. It came from being ruthlessly honest about when the model was wrong.
See it call your depth
Squat Eye uses AI pose estimation to measure your hip crease relative to your knee, frame by frame. Private, on-device, free.
Download Squat Eye