The dataset was created to support the training of advanced generative models for facial analysis tasks, including facial recognition, expression analysis, and biometric authentication. The specific gap it aimed to fill was the lack of high-quality, annotated synthetic datasets that closely mimic real human facial features with minimal domain gaps. This was essential for developing models capable of performing accurately on real-world data using only synthetic training data.
This dataset was created by a collaboration of researchers from various institutions (Mercedes Benz R&D India; IIIT Hyderabad; Max Planck Institute for Intelligent Systems).
This dataset was supported by Mercedes-Benz Research & Development India through the access to resources such as GPU compute enabled servers to facilitate training and experiments.
The dataset comprises of face image data of synthetically generated human heads along with synthetically generated annotations in the form of PNG, TXT, and JSON files.
The generated dataset which has been used for experiments comprises of 100k samples in the train set and 10k samples in the validation set. This amounts to 100k RGB images of 512x512 resolution of heads, 100k 512x512 images for semantic masks (in color), 100k 128x128 resolution depth maps, and 100k json files containing annotations related to the 68 facial keypoints.
Since the dataset is generated from the proposed pipeline presented in the paper, a significantly larger set can be generated with the accessible tools provided and there is no theoretical limit on the number of samples that could exist in the large set.
Each frame instance is a set of the RGB image, semantic mask, depth, and keypoint annotations. There is no separate format of the data.
There are multiple labels associated with each image instance. (i) The depth map is generated from the volumetric rendering stage in the final stage of the Next3D generator, (ii) The semantic labels are pixel-level descriptors about part of face such as skin, eyes, head, upper lip, lower lip, nose etc., (iii) the 68 facial landmarks along the face edge, eyes, nose, and mouth. These can be used to train multi-task networks on downstream tasks.
No. All information extracted from the proposed pipeline are present in the dataset and annotations.
No such information is necessary in the proposed dataset.
Yes, we sample a 100k sized training dataset and a 10k sized validation dataset from the data generation pipeline for facilitating the experiments reported in the paper. Furthermore, more data can be sampled using the proposed approach.
Yes, there are some minor misalignments in the annotations and the generated RGB image. These are due to the nature of the StyleGAN generator network used in the process. We have addressed the issue and steps taken to rectify these errors in the paper.
The dataset is self contained and can be used for training facial analysis tasks. However, the dataset generation framework relies of FLAME and Next3D to enable training the models. These resources are available on archival forums and widely available, as per the details outlined in the paper and the code repositories or our work.
No, since the dataset has been synthetically generated.
No, since the dataset has been synthetically generated and contains outputs from generative models that generate head images based on provided geometry, expression params, and latent variable alone.
No, the dataset has been generated through a generator model trained on publically available FFHQ dataset which contains images from various groups.
No, since the dataset has been generated synthetically from a generative model.
No biometric level information is synthesized so there are no hallucinations towards identity specific information.
The dataset was directly generated using a pre-trained generative model which we modify to be able to extract annotations and semantic level information. The code and resources to generate more data is available through the project webpage and the paper.
Please see the response to the previous question.
The data was generated by uniformly sampling in the latent variable distribution. The exact parameters are reported in the main paper with thorough discussion in the supplementary material.
Not applicable since the dataset was synthetically generated.
The dataset was generated on a single A100 GPU with 6 parallel process over a span of 2.5 hours. The provided code can be further optimized to enable better parallelization and multi-GPU support.
Not applicable since the dataset was synthetically generated.
Not applicable since the dataset was synthetically generated.
Not applicable since the dataset was synthetically generated.
Not applicable since the dataset was synthetically generated.
Not applicable since the dataset was synthetically generated.
The potential impact and use cases of the generated dataset have been thoroughly discussed with experiments supporting such examples in the paper.
The dataset was directly generated using a pre-trained generative model which we modify to be able to extract annotations and semantic level information. The code and resources to generate data are available through the project webpage and the paper, along with thorough discussions about pre- and post-processing of the dataset.
Please refer to the previous response.
Please refer to the previous response.
Yes, the dataset is used to benchmark performance on downstream facial analysis tasks such as semantic segmentation, depth estimation, and facial landmark estimation in both single task and multi-task frameworks. The code to reproduce the experiments is available on the project page along with detailed analysis in the paper.
The dataset usage via Kaggle is available on the dataset and project page.
The dataset can be used towards human head re-enactment, face parsing, and 3D reconstruction tasks on head avatar models.
Not applicable since the dataset was generated synthetically. However, the generative model used in the pipeline can be replaced with different approaches which adhere to the guidelines mentioned in the method section of the paper.
Since the dataset was generated fairly and synthetically, we allow full use of the dataset and generative methods for academic non-commercial research as per the license guidelines.
No, the dataset is available for academic non-commercial usage and hosted on Kaggle as a public dataset.
Yes, the dataset is hosted on Kaggle at https://www.kaggle.com/datasets/shubhamdokania/synthforgedata/ and has an associated DOI 10.34740/kaggle/dsv/8660031
Yes, the dataset is available to access and distribute on Kaggle.
The dataset is licensed under CC BY-NC-SA 4.0 which allows non-commercial use of the dataset and we encourage further research using the proposed pipeline.
No such restrictions are imposed.
No such restrictions are imposed.
The dataset will continue to be hosted and maintained by the authors on the Kaggle platform. In the case of any changes to the terms of the hosting platform, the dataset will still be publicly available with changes updated on all project pages, repositories, etc.
Yes, please find the contact information of the authors in the project webpage, paper, and the code repositories.
Not available and not required. The dataset is documented in the proposed paper.
The dataset which has been generated to facilitate the experiments reported in the paper will not be updated/changed. However, in the case of changes on the code for generation of the dataset, information will be duly updated on the Github repository.
Only synthetically generated information is available in the dataset so no such constraints are applicable.
Please refer to the previous responses for clarification.
Yes, the dataset is released in the open domain allowing academic non-commercial research.