diff --git a/docs/solutions/pose.md b/docs/solutions/pose.md index 9190484e71..064e2eb193 100644 --- a/docs/solutions/pose.md +++ b/docs/solutions/pose.md @@ -2,6 +2,8 @@ layout: default title: Pose parent: Solutions +has_children: true +has_toc: false nav_order: 5 --- @@ -21,10 +23,9 @@ nav_order: 5 ## Overview Human pose estimation from video plays a critical role in various applications -such as -[quantifying physical exercises](#pose-classification-and-repetition-counting), -sign language recognition, and full-body gesture control. For example, it can -form the basis for yoga, dance, and fitness applications. It can also enable the +such as [quantifying physical exercises](./pose_classification.md), sign +language recognition, and full-body gesture control. For example, it can form +the basis for yoga, dance, and fitness applications. It can also enable the overlay of digital content and information on top of the physical world in augmented reality. @@ -387,121 +388,6 @@ on how to build MediaPipe examples. * Target: [`mediapipe/examples/desktop/upper_body_pose_tracking:upper_body_pose_tracking_gpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/desktop/upper_body_pose_tracking/BUILD) -## Pose Classification and Repetition Counting - -One of the applications -[BlazePose](https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html) -can enable is fitness. More specifically - pose classification and repetition -counting. In this section we'll provide basic guidance on building a custom pose -classifier with the help of a -[Colab](https://drive.google.com/file/d/19txHpN8exWhstO6WVkfmYYVC6uug_oVR/view?usp=sharing) -and wrap it in a simple -[fitness app](https://mediapipe.page.link/mlkit-pose-classification-demo-app) -powered by [ML Kit](https://developers.google.com/ml-kit). Push-ups and squats -are used for demonstration purposes as the most common exercises. - -![pose_classification_pushups_and_squats.gif](../images/mobile/pose_classification_pushups_and_squats.gif) | -:--------------------------------------------------------------------------------------------------------: | -*Fig 4. Pose classification and repetition counting with MediaPipe Pose.* | - -We picked the -[k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) -(k-NN) as the classifier. It's simple and easy to start with. The algorithm -determines the object's class based on the closest samples in the training set. -To build it, one needs to: - -* Collect image samples of the target exercises and run pose prediction on - them, -* Convert obtained pose landmarks to a representation suitable for the k-NN - classifier and form a training set, -* Perform the classification itself followed by repetition counting. - -### Training Set - -To build a good classifier appropriate samples should be collected for the -training set: about a few hundred samples for each terminal state of each -exercise (e.g., "up" and "down" positions for push-ups). It's important that -collected samples cover different camera angles, environment conditions, body -shapes, and exercise variations. - -![pose_classification_pushups_un_and_down_samples.jpg](../images/mobile/pose_classification_pushups_un_and_down_samples.jpg) | -:--------------------------------------------------------------------------------------------------------------------------: | -*Fig 5. Two terminal states of push-ups.* | - -To transform samples into a k-NN classifier training set, either -[basic](https://drive.google.com/file/d/1z4IM8kG6ipHN6keadjD-F6vMiIIgViKK/view?usp=sharing) -or -[extended](https://drive.google.com/file/d/19txHpN8exWhstO6WVkfmYYVC6uug_oVR/view?usp=sharing) -Colab could be used. They both use the -[Python Solution API](#python-solution-api) to run the BlazePose models on given -images and dump predicted pose landmarks to a CSV file. Additionally, the -extended Colab provides useful tools to find outliers (e.g., wrongly predicted -poses) and underrepresented classes (e.g., not covering all camera angles) by -classifying each sample against the entire training set. After that, you'll be -able to test the classifier on an arbitrary video right in the Colab. - -### Classification - -Code of the classifier is available both in the -[extended](https://drive.google.com/file/d/19txHpN8exWhstO6WVkfmYYVC6uug_oVR/view?usp=sharing) -Colab and in the -[ML Kit demo app](https://mediapipe.page.link/mlkit-pose-classification-demo-app). -Please refer to them for details of the approach described below. - -The k-NN algorithm used for pose classification requires a feature vector -representation of each sample and a metric to compute the distance between two -such vectors to find the nearest pose samples to a target one. - -To convert pose landmarks to a feature vector, we use pairwise distances between -predefined lists of pose joints, such as distances between wrist and shoulder, -ankle and hip, and two wrists. Since the algorithm relies on distances, all -poses are normalized to have the same torso size and vertical torso orientation -before the conversion. - -![pose_classification_pairwise_distances.png](../images/mobile/pose_classification_pairwise_distances.png) | -:--------------------------------------------------------------------------------------------------------: | -*Fig 6. Main pairwise distances used for the pose feature vector.* | - -To get a better classification result, k-NN search is invoked twice with -different distance metrics: - -* First, to filter out samples that are almost the same as the target one but - have only a few different values in the feature vector (which means - differently bent joints and thus other pose class), minimum per-coordinate - distance is used as distance metric, -* Then average per-coordinate distance is used to find the nearest pose - cluster among those from the first search. - -Finally, we apply -[exponential moving average](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) -(EMA) smoothing to level any noise from pose prediction or classification. To do -that, we search not only for the nearest pose cluster, but we calculate a -probability for each of them and use it for smoothing over time. - -### Repetition Counter - -To count the repetitions, the algorithm monitors the probability of a target -pose class. Let's take push-ups with its "up" and "down" terminal states: - -* When the probability of the "down" pose class passes a certain threshold for - the first time, the algorithm marks that the "down" pose class is entered. -* Once the probability drops below the threshold, the algorithm marks that the - "down" pose class has been exited and increases the counter. - -To avoid cases when the probability fluctuates around the threshold (e.g., when -the user pauses between "up" and "down" states) causing phantom counts, the -threshold used to detect when the state is exited is actually slightly lower -than the one used to detect when the state is entered. It creates an interval -where the pose class and the counter can't be changed. - -### Future Work - -We are actively working on improving BlazePose GHUM 3D's Z prediction. It will -allow us to use joint angles in the feature vectors, which are more natural and -easier to configure (although distances can still be useful to detect touches -between body parts) and to perform rotation normalization of poses and reduce -the number of camera angles required for accurate k-NN classification. - ## Resources * Google AI Blog: @@ -512,5 +398,3 @@ the number of camera angles required for accurate k-NN classification. * [Models and model cards](./models.md#pose) * [Web demo](https://code.mediapipe.dev/codepen/pose) * [Python Colab](https://mediapipe.page.link/pose_py_colab) -* [Pose Classification Colab (Basic)](https://mediapipe.page.link/pose_classification_basic) -* [Pose Classification Colab (Extended)](https://mediapipe.page.link/pose_classification_extended) diff --git a/docs/solutions/pose_classification.md b/docs/solutions/pose_classification.md new file mode 100644 index 0000000000..9595dc7d1c --- /dev/null +++ b/docs/solutions/pose_classification.md @@ -0,0 +1,142 @@ +--- +layout: default +title: Pose Classification +parent: Pose +grand_parent: Solutions +nav_order: 1 +--- + +# Pose Classification +{: .no_toc } + +
+ + Table of contents + + {: .text-delta } +1. TOC +{:toc} +
+--- + +## Overview + +One of the applications +[BlazePose](https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html) +can enable is fitness. More specifically - pose classification and repetition +counting. In this section we'll provide basic guidance on building a custom pose +classifier with the help of [Colabs](#colabs) and wrap it in a simple +[fitness app](https://mediapipe.page.link/mlkit-pose-classification-demo-app) +powered by [ML Kit](https://developers.google.com/ml-kit). Push-ups and squats +are used for demonstration purposes as the most common exercises. + +![pose_classification_pushups_and_squats.gif](../images/mobile/pose_classification_pushups_and_squats.gif) | +:--------------------------------------------------------------------------------------------------------: | +*Fig 1. Pose classification and repetition counting with MediaPipe Pose.* | + +We picked the +[k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) +(k-NN) as the classifier. It's simple and easy to start with. The algorithm +determines the object's class based on the closest samples in the training set. + +**To build it, one needs to:** + +1. Collect image samples of the target exercises and run pose prediction on + them, +2. Convert obtained pose landmarks to a representation suitable for the k-NN + classifier and form a training set using these [Colabs](#colabs), +3. Perform the classification itself followed by repetition counting (e.g., in + the + [ML Kit demo app](https://mediapipe.page.link/mlkit-pose-classification-demo-app)). + +## Training Set + +To build a good classifier appropriate samples should be collected for the +training set: about a few hundred samples for each terminal state of each +exercise (e.g., "up" and "down" positions for push-ups). It's important that +collected samples cover different camera angles, environment conditions, body +shapes, and exercise variations. + +![pose_classification_pushups_un_and_down_samples.jpg](../images/mobile/pose_classification_pushups_un_and_down_samples.jpg) | +:--------------------------------------------------------------------------------------------------------------------------: | +*Fig 2. Two terminal states of push-ups.* | + +To transform samples into a k-NN classifier training set, both +[`Pose Classification Colab (Basic)`] and +[`Pose Classification Colab (Extended)`] could be used. They use the +[Python Solution API](./pose.md#python-solution-api) to run the BlazePose models +on given images and dump predicted pose landmarks to a CSV file. Additionally, +the [`Pose Classification Colab (Extended)`] provides useful tools to find +outliers (e.g., wrongly predicted poses) and underrepresented classes (e.g., not +covering all camera angles) by classifying each sample against the entire +training set. After that, you'll be able to test the classifier on an arbitrary +video right in the Colab. + +## Classification + +Code of the classifier is available both in the +[`Pose Classification Colab (Extended)`] and in the +[ML Kit demo app](https://mediapipe.page.link/mlkit-pose-classification-demo-app). +Please refer to them for details of the approach described below. + +The k-NN algorithm used for pose classification requires a feature vector +representation of each sample and a metric to compute the distance between two +such vectors to find the nearest pose samples to a target one. + +To convert pose landmarks to a feature vector, we use pairwise distances between +predefined lists of pose joints, such as distances between wrist and shoulder, +ankle and hip, and two wrists. Since the algorithm relies on distances, all +poses are normalized to have the same torso size and vertical torso orientation +before the conversion. + +![pose_classification_pairwise_distances.png](../images/mobile/pose_classification_pairwise_distances.png) | +:--------------------------------------------------------------------------------------------------------: | +*Fig 3. Main pairwise distances used for the pose feature vector.* | + +To get a better classification result, k-NN search is invoked twice with +different distance metrics: + +* First, to filter out samples that are almost the same as the target one but + have only a few different values in the feature vector (which means + differently bent joints and thus other pose class), minimum per-coordinate + distance is used as distance metric, +* Then average per-coordinate distance is used to find the nearest pose + cluster among those from the first search. + +Finally, we apply +[exponential moving average](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average) +(EMA) smoothing to level any noise from pose prediction or classification. To do +that, we search not only for the nearest pose cluster, but we calculate a +probability for each of them and use it for smoothing over time. + +## Repetition Counting + +To count the repetitions, the algorithm monitors the probability of a target +pose class. Let's take push-ups with its "up" and "down" terminal states: + +* When the probability of the "down" pose class passes a certain threshold for + the first time, the algorithm marks that the "down" pose class is entered. +* Once the probability drops below the threshold, the algorithm marks that the + "down" pose class has been exited and increases the counter. + +To avoid cases when the probability fluctuates around the threshold (e.g., when +the user pauses between "up" and "down" states) causing phantom counts, the +threshold used to detect when the state is exited is actually slightly lower +than the one used to detect when the state is entered. It creates an interval +where the pose class and the counter can't be changed. + +## Future Work + +We are actively working on improving BlazePose GHUM 3D's Z prediction. It will +allow us to use joint angles in the feature vectors, which are more natural and +easier to configure (although distances can still be useful to detect touches +between body parts) and to perform rotation normalization of poses and reduce +the number of camera angles required for accurate k-NN classification. + +## Colabs + +* [`Pose Classification Colab (Basic)`] +* [`Pose Classification Colab (Extended)`] + +[`Pose Classification Colab (Basic)`]: https://mediapipe.page.link/pose_classification_basic +[`Pose Classification Colab (Extended)`]: https://mediapipe.page.link/pose_classification_extended