The ice skating rink collision detection was a project that initially started as a potential client project. An advertising agency had come up with the idea to highlight the collision detection system of their car manufacturer client at the well visited viennese town-hall ice skating rink by doing a live collision detection between ice skaters. Although in the end the project could not be realized due to budgetary reasons, it served as an interesting opportunity to practice computer vision techniques on a real world problem.
Initially, I was only responsible for the tracking part while a motion designer would work on the visual realization of the collision detection. The first challenge was therefore finding a reliable architecture for detecting and tracking dozens of persons on the skating rink, where the illumination was dark and contained a lot of hard shadows. The SORT framework by Bewley at al combined with a TensorFlow based SSD-Inception object detector turned out to be quite accurate for this task while still being executable in real time.
The opening of the ice skating rink would have been our deadline and there was no way of testing the project beforehand on real footage. I therefore created a 3d scene in my go-to 3d package Houdini, containing a large crowd of 3d characters on an ice skating rink from which I could generate quite realistic test footage for the tracker. This scene also served as a way to generate training data, as I had wrote python scripts that automatically annotated the rendered images with bounding boxes. The scene contained various animated characters downloaded from Adobe Mixamo, and represented a large variety of body-, pose-, and clothing types. I subsequently refined this scene in various iterations to create more and more variation and prevent overfitting to the synthetically generated data. After introducing random objects and shapes on the ice area, as well as large camera and lighting variations, I was able to achieve good tracking performance on real footage.
1st generation of rendered training data: Variance mainly in the persons, although some lighting variation is noticable as well.
2nd generation: Randomly introduced non-human objects and more camera variation greatly improved the generalization ability of the neural network based person detector.
3rd generation of the synthetic training dataset: ImageNet background images and random line shapes on the ground resulted in good tracking performance on hockey videos found on youtube.
As the development progressed and the deadline came closer, the chance of the client giving green light to the full budget in time progressively vanished. I had found a university lecture where I was able to incorporate this project as an excercise, so at some point it became clear that in order to finish the project I’d have to do the visualization work as well. I decided to go with a concept that mirrored the probabilistic nature of predicting possible collisions. Sampling trajectories for each person and then intersecting them seemed like the most natural way to probabilistically find collisions. A sampling operation needs a distribution to sample from, so I spent a few days studying the Kalman filter, which can predict and update parameters based on a noisy signal. After some experimentation I found a way to get the Kalman filter to give me gaussian distributions where I could sample position, speed and acceleration from to generate trajectories with a second order newtonian motion model. It was important for me to reflect the increased uncertainty in areas further apart from the camera, so I transformed the estimated measurement covariance matrix of the tracking position in image space with the homography I also used to project the image space positions onto the ground plane. This transformation is not as straight forward as it might appear, as the homography transformation is done in homogeneous coordinates and therefore needs a division to go back to inhomogeneous coordinates used in the Kalman filter. After some experimentation I found that transforming the eigenvectors of the covariance matrix separately and then re-estimating the covariance by their outer product was a good linear approximation to this non-linear transformation, providing me with distributions that reflected the higher variance at distance quite well. I then sampled five 2 second (50 frames) trajectories for each person at every frame to predict collisions. I implemented the intersection calculation in a kind of brute-force but effective way, by additively rendering the trajectories into an off-screen OpenGL buffer and finding positions where pixel values were greater than one. To prevent false detection of self-collisions, I additively rendered three values per trajectory into three separate channels: Quadratic person ID, linear person ID and constant one. In the case of a pure self collision, dividing the ID channels by the constant channel (serving as trajectory count per pixel) and taking the square root of the quadratic ID would result in the ID channels being identical. The final prototype can be seen in this video:
Note that the test sequence is rendered. We have not been able to record real world footage of the scene due to access restrictions to the balcony of the viennese town hall. The right side shows the ground plane projection of the scene.