Roblox has yet to have the most descriptive facial expressions, even with its new updates, but the company is working to fix that. In a recent blog post titled “Inside the Tech: Enabling Facial Expressions for Avatars,” the company shed light on the technical challenges and innovative solutions behind this ambitious endeavor.
According to the blog, real-time facial tracking is one of the key hurdles the team faces. The goal is to have avatars mirror users’ real-life expressions through their device’s webcam, making a more immersive and interactive experience. However, achieving this across a vast range of devices with varying processing power is a big challenge. Roblox is working on one of its first deep learning models, designed to capture facial expressions on multiple devices in real-time.
Another challenge listed is simplifying the process of creating dynamic avatars with facial animation capabilities. Traditionally, this process has been complex and requires users to know how to rig models and use linear blend skinning techniques. As a solution, the team is developing technology that automatically rigs and cages models based on static designs. This is smart and would reduce the technical barrier for creators, but it is difficult to make.
Roblox utilizes the industry-standard FACS (Facial Animation Control System) for accurate facial recognition. A FACS uses about 50 different controls to describe facial movements like blinking, stretching, and raising eyebrows. The team uses a combination of real and synthetic data to train its deep learning model effectively.
Typically, human-labeled images are used for this purpose. However, Roblox utilizes synthetic data generated from 3D models with a variety of FACS poses and lighting conditions. This lets the model learn expressions that are hard to capture in real life.
In addition, the model is designed to dynamically adapt its processing requirements to the user’s hardware to ensure accessibility. The model is split into two phases: BaseNet and HiFiNet. With BaseNet, you get a fast approximation of FACS, while with HiFiNet, you get a more accurate one. Depending on the device’s processing power, the system chooses which phase to run, so even low-end devices can experience expressive avatars.
Roblox aims to make all of this work using these solutions soon, and I have to admit the last one is impressive. I recommend reading the blog post for more details on the process, but I did cover the important parts here. Keep in mind that it’s a bit technical, so don’t expect it to be as simple as I made it here for people who don’t understand animation or modeling.