Code repository for the paper "Tracking People by Predicting 3D Appearance, Location & Pose".
Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Jitendra Malik.
This code repository provides a code implementation for our paper PHALP, with installation, a demo code to run on any videos, preparing datasets, and evaluating on datasets.
This branch contains code supporting our latest work: 4D-Humans.
For the original PHALP code, please see the initial release branch.
After installing the PyTorch dependency, you may install our phalp
package directly as:
pip install phalp[all]@git+https://github.com/brjathu/PHALP.git
Step-by-step instructions
git clone https://github.com/brjathu/PHALP.git
cd PHALP
conda create -n phalp python=3.10
conda activate phalp
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -e .[all]
To run our code on a video, please specifiy the input video video.source
and an output directory video.output_dir
:
python scripts/demo.py video.source=assets/videos/gymnasts.mp4 video.output_dir='outputs'
The output directory will contain a video rendering of the tracklets and a .pkl
file containing the tracklets with 3D pose and shape (see structure below).
You can specify various kinds of input sources. For example, you can specify a video file, a youtube video, a directory of images:
# for a video file
python scripts/demo.py video.source=assets/videos/vid.mp4
# for a youtube video
python scripts/demo.py video.source=\'"https://www.youtube.com/watch?v=xEH_5T9jMVU"\'
# for a directory of images
python scripts/demo.py video.source=<dirtory_path>
Custom bounding boxes
In addition to these options, you can also give images and bounding boxes as inputs, so the model will only do tracking using the given bounding boxes. To do this, you need to specify the video.source
as a .pkl
file, where each key is the frame name and the absolute path to the image is computed as os.path.join(video.base_path, frame_name)
. The value of each key is a dictionary with the following keys: gt_bbox
, gt_class
, gt_track_id
. Please see the following example. gt_boxes
is a np.ndarray
of shape (N, 4)
where each row is a bounding box in the format of [x1, y1, x2, y2]
. You can also give gt_class
and gt_track_id
to store it in the final output.
gt_data[frame_id] = {
"gt_bbox": gt_boxes,
"extra_data": {
"gt_class": [],
"gt_track_id": [],
}
}
Here is an example, of how to give bounding boxes and track-ids to the model and get the renderings.
mkdir assets/videos/gymnasts
ffmpeg -i assets/videos/gymnasts.mp4 -q:v 2 assets/videos/gymnasts/%06d.jpg
python scripts/demo.py \
render.enable=True \
video.output_dir=test_gt_bbox \
use_gt=True \
video.base_path=assets/videos/gymnasts \
video.source=assets/videos/gt_tracks.pkl
You can specify the start and end of the video to be tracked, e.g. track from frame 50 to 100:
python scripts/demo.py video.source=assets/videos/vid.mp4 video.start_frame=50 video.end_frame=100
Tracking without extracting frames
However, if the video is too long and extracting the frames is too time consuming, you can set video.extract_video=False
. This will use the torchvision backend and it will only keep the timestamps of the video in memeory. If this is enabled, you can give start time and end time of the video in seconds.
python scripts/demo.py video.source=assets/videos/vid.mp4 video.extract_video=False video.start_time=1s video.end_time=2s
We support multiple types of visualization in render.type
: HUMAN_MESH
(default) renders the full human mesh, HUMAN_MASK
visualizes the segmentation masks, HUMAN_BBOX
visualizes the bounding boxes with track-ids, TRACKID_<id>_MESH
renders the full human mesh but for track <id>
only:
# render full human mesh
python scripts/demo.py video.source=assets/videos/vid.mp4 render.type=HUMAN_MESH
# render segmentation mask
python scripts/demo.py video.source=assets/videos/vid.mp4 render.type=HUMAN_MASK
# render bounding boxes with track-ids
python scripts/demo.py video.source=assets/videos/vid.mp4 render.type=HUMAN_BBOX
# render a single track id, say 0
python scripts/demo.py video.source=assets/videos/vid.mp4 render.type=TRACKID_0_MESH
More rendering types
In addition to these setting, for rendering meshes, PHALP uses head-mask visiualiztion, which only renders the upper body on the person to allow users to see the actually person and the track in the same video. To enable this, please set `render.head_mask=True`.# for rendering detected and occluded people
python scripts/demo.py video.source=assets/videos/vid.mp4 render.head_mask=True
You can also visualize the 2D projected keypoints by setting render.show_keypoints=True
[TODO].
By default, PHALP does not track through shot boundaries. To enable this, please set detect_shots=True
.
# for tracking through shot boundaries
python scripts/demo.py video.source=assets/videos/vid.mp4 detect_shots=True
Additional Notes
- For debugging purposes, you can set
debug=True
to disable rich progress bar.
The .pkl
file containing tracks, 3D poses, etc. is stored under <video.output_dir>/results
, and is a 2-level dictionary:
Detailed structure
import joblib
results = joblib.load(<video.output_dir>/results/<video_name>.pkl)
results = {
# A dictionary for each frame.
'vid_frame0.jpg': {
'2d_joints': List[np.array(90,)], # 45x 2D joints for each detection
'3d_joints': List[np.array(45,3)], # 45x 3D joints for each detection
'annotations': List[Any], # custom annotations for each detection
'appe': List[np.array(4096,)], # appearance features for each detection
'bbox': List[[x0 y0 w h]], # 2D bounding box (top-left corner and dimensions) for each track (detections + ghosts)
'camera': List[[tx ty tz]], # camera translation (wrt image) for each detection
'camera_bbox': List[[tx ty tz]], # camera translation (wrt bbox) for each detection
'center': List[[cx cy]], # 2D center of bbox for each detection
'class_name': List[int], # class ID for each detection (0 for humans)
'conf': List[float], # confidence score for each detection
'frame_path': 'vid_frame0.jpg', # Frame identifier
'loca': List[np.array(99,)], # location features for each detection
'mask': List[mask], # RLE-compressed mask for each detection
'pose': List[np.array(229,)], # pose feature (concatenated SMPL params) for each detection
'scale': List[float], # max(width, height) for each detection
'shot': int, # Shot number
'size': List[[imgw imgh]], # Image dimensions for each detection
'smpl': List[Dict_SMPL], # SMPL parameters for each detection: betas (10), body_pose (23x3x3), global_orient (3x3)
'tid': List[int], # Track ID for each detection
'time': int, # Frame number
'tracked_bbox': List[[x0 y0 w h]], # 2D bounding box (top-left corner and dimensions) for each detection
'tracked_ids': List[int], # Track ID for each detection
'tracked_time': List[int], # for each detection, time since it was last seen
},
'vid_frame1.jpg': {
...
},
...
}
Coming soon.
Coming soon.
Parts of the code are taken or adapted from the following repos:
If you find this code useful for your research or the use data generated by our method, please consider citing the following paper:
@inproceedings{rajasegaran2022tracking,
title={Tracking People by Predicting 3{D} Appearance, Location \& Pose},
author={Rajasegaran, Jathushan and Pavlakos, Georgios and Kanazawa, Angjoo and Malik, Jitendra},
booktitle={CVPR},
year={2022}
}