CV4Animals 2021 summary

It’s been an exciting past few years within animal tracking, and the 2021 CV4Animals workshop is the culmination of all of this. It was fun to see so many animal tracking people gathered in one place, even if virtually. While watching all the presentations from my increasingly hot apartment, something clicked for me and I got a glimpse of the whole field and its future. Or perhaps it was just the heat. Still, here are some of the themes that jumped out to me.

1 Animal datasets


Figure 1: A selection of the latest animal datasets, from the SuperAnimal poster

Datasets featuring animals are becoming increasingly prevalent and rightly so. Modern human pose estimation networks are only possible because of datasets like COCO Keypoints, MPII Human Pose Dataset, and Human 3.6M. A good collection of common ground truth datasets will be crucial in evaluating the many different algorithms for animal pose estimation.

Mackenzie Mathis highlighted a lot of the currently available datasets and how pretraining on these could transfer to other datasets in the SuperAnimal work from her lab. Seems like if you start with a network pretrained to predict animal poses (rather than classify images from Imagenet), it generalizes much better on new animals. Based on this idea, they’ve built a new interface to crowdsource much more data for even better pretrained networks.

There are starting to be a decent amount of datasets, certainly enough to test novel networks. I could see room for more though. There needs to be more cat datasets to match the dog datasets, and there are currently still very few insect or aquatic animal datasets.

Some keypoint datasets that were highlighted at the conference were:

  • Horse-10: 30 horses annotated for benchmarking out of domain robustness
  • Animal-pose: annotations on dogs, cats, cows, horses, and sheep
  • StanfordExtra: 12k images of dogs
  • BADJA: 9 videos of different animals
  • ATRW: Bounding boxes and keypoints for Amur tigers in the wild
  • 3D Cowbirds: 6300 cowbird segmentations, 1000 cowbird keypoints, and a 3D cowbird mesh model
  • AcinoSet (poster): multi-view dataset of a free-running cheetah
  • MacaquePose (poster): 13k images of macaque monkeys in the wild
  • Rat 7M: 7 million multi-view video frames of rats

Some datasets for tracking whole animals from videos highlighted at the conference:

There was also one dataset of images with classes (no boxes or keypoints):

2 Synthetic data


Figure 2: An impressive workflow for generating synthetic data, from the synthetic animated mouse poster

Along with more datasets, there was a lot of interest in using synthetic data to get a lot of valid ground truth by spending hours making an animation instead of spending hours manually annotating images. Although there is more set up, using synthetic data can make it viable to generate millions of annotated frames. I feel like this is a really promising technique and I was somewhat disappointed that none of the main talks mentioned synthetic data.

I was really impressed with the synthetic animated mouse work (image above). Their mouse videos look so good, both the original animation and the modified ones with the style altered to mimic the experiment!! If I were trying to track mice I would definitely look into their setup, perhaps along with a network pretrained on Rat 7M.

There were more synthetic posters as well:

3 3D models for animals

There was an odd focus in the main talks on estimating full 3D animal shape models, whereas this was much less emphasized in the poster sessions. Perhaps this reflects the interests of the organizers or their vision of where the field should be? Personally, I think these models are interesting and useful, but perhaps just as interesting as new datasets, synthetic data, and behavior.

In any case, I did enjoy hearing about all the different ways to skin a cat! It was interesting to contrast Silvia Zuffi’s models of quadrupeds with Ben Biggs’ quadruped models (working with Andrew Fitzgibbon) and with Marc Badger’s bird models. Each brought something different.


Figure 3: How to model a zebra according to Silvia Zuffi, from the Three-D Safari paper

Silvia pioneered the skinned multi-animal linear (SMAL) model pipeline. She built a general 3D model from scans of animal toys, which she could fit to images with annotated keypoints and silhouettes. In follow up work, she showed how to refine the shape of the 3D models for specific animals, and then how to estimate the model directly from images of animals in the wild. She’s been pushing on adding a texture term in the model, so that you can have the 3D models actually look colorful which I find super cool.

Ben Biggs showed how to take the SMAL model and make it work well in videos by throwing a full kitchen sink of optimization criteria. For good measure, he also showed a different way to refine the 3D model shape as well to estimate dogs in the wild. From what I understood, it sounded like the shape refinement may be more precise than Silvia’s, as it could handle the floppy dog ears.


Figure 4: Reconstructed shapes of birds match their evolutionary tree Birds of a Feather paper

I particularly liked Marc Badger’s talk, because I could tell he wanted to tackle the biological questions as much as how to build a robust vision pipeline. As I found, there’s a tradeoff between tackling both, but he seems to managing it quite nicely. He showed how to extend Silvia and Ben’s work (and Angjoo Kanazawa’s, among others) to cowbirds in the lab and then also birds in the wild. Connecting it to biology, he showed how the reconstructed shape of the birds matched their evolutionary lineage, which I found super cool.

4 Understanding behavior


Figure 5: Fly trajectories simulated using an artificial neural network, from this paper

Once we have all the animals tracked, what do we do with all the tracking data? Certainly, there are applications to animation and augmented reality1. But the biologists are particularly interested in understanding how animals behave. There were some interesting perspectives on both unsupervised and supervised decompositions of animal behavior.

In the main talk series, Kristin Branson described her latest models to predict how flies move2, with the aim of deconstructing these models and connecting them to the emerging fly connectome. The questions were really interesting, especially on whether predicting a group of flies in a bowl is harder than a single fly in a bowl, due to the interactions amongst flies. I’m personally still curious how she plans to dive into the fitted models to get insights about behavior.

There were a few interesting posters that showed new ways to classify behavior from videos:

Overall, it feels like understanding behavior from automatically tracked kinematics is still relatively new. There aren’t the same level of datasets with animal video behavior annotations as there are for animal keypoints. The models of behavior are also not as clear as the 3D models of animals. I’m excited to see where this will all go.



Silvia Zuffi showed a slide of cute little fox in someone’s hand. This is what we need from augmented reality.


I’m not sure that her work is published yet, but here is the closest paper I found from her publications.