Generative adversarial networks (GANs) have attained photo-realistic quality. However, it remains an open challenge of how to best control the image content. We introduce LatentKeypointGAN, a two-stage GAN that is trained end-to-end on the classical GAN objective yet internally conditioned on a set of sparse keypoints with associated appearance embeddings that respectively control the position and style of the generated objects and their parts. A major difficulty that we address with suitable network architectures and training schemes is disentangling the image into spatial and appearance factors without domain knowledge and supervision signals. We demonstrate that LatentKeypointGAN provides an interpretable latent space that can be used to re-arrange the generated images by re-positioning and exchanging keypoint embeddings, such as combining the eyes, nose, and mouth from different images for generating portraits. In addition, the explicit generation of keypoints and matching images enables a new, GAN-based methodology for unsupervised keypoint detection.
In this video, we fix all the embeddings and only change the keypoint locations. In the first example, we change all keypoint locations. In subsequent examples, individual parts are moved to edit the face locally. It is possible to faithfully move keypoints while the GAN maintains the overall integrity of a natural face.
In this example, we interpolate a subset of the keypoint embeddings from a source to a target person while fixing all other embeddings and the keypoint locations. The changed keypoints are marked with crosses. In the first example, we replace all keypoint embeddings, and in subsequent ones, we exchange only subsets. This allows to exchange features between faces.
Besides keypoint location and embedding, the background embedding can be changed too. Here, we analyze the effect of changing the background while fixing the keypoint locations and keypoint embeddings. For each face, we show the background interpolation between three different backgrounds. The background controls global features such as background texture and hair.
As for bedrooms, we first fix all the keypoint embeddings and only change the keypoint locations. This lets us move key objects in the scene. Because LatentKeypointGAN is an image-based approach, not designed for video animation, the interpolation is at times wobbly. Still, the location can be changed faithfully for creating novel image constellations.
Object style and size can also be edited by interpolating a subset of the keypoint embeddings from a source to a target image while fixing all other embeddings and the keypoint locations. The changed keypoints are marked with crosses.
LatentKeypointGAN is not specific to a particular domain. When trained on human images, we can interpolate the keypoint embeddings to control the human appearance from a source to a target person. As before, all other embeddings and the keypoint locations are fixed. While some appearance interpolations appear to be very fast, they are still smooth. For example, the shape of the face changes smoothly and the hair grows continuously.
Limitations: We believe the reason for the fast changes (not observed for facial and bedroom images) is the lack of variance of the original dataset: only 5 different people exist (although each of them has two or three sets of clothes). We also observe that the human identity is strongly correlated with the right half of the background, which is a feature of the original dataset and can also be observed for related work.
In this experiment, we fix all embeddings and only change the keypoint locations to change human pose.
Limitations: When linearly moving the keypoints we observe occasional discontinuous interpolations. We believe the reason is that these linear interpolations create unnatural poses. For example, moving only the keypoint associated with the right hand should also change the position of the right elbow. This cannot be described by linear interpolation. Since these interpolated human poses are not meaningful, they are not contained in the dataset and, therefore, cannot be reconstructed by our approach.