Microsoft Shows Progress Toward Real-Time AI-Generated Game Worlds
Published on February 20, 2025 at 02:30AM
An anonymous reader quotes a report from Ars Technica: For a while now, many AI researchers have been working to integrate a so-called "world model" into their systems. Ideally, these models could infer a simulated understanding of how in-game objects and characters should behave based on video footage alone, then create fully interactive video that instantly simulates new playable worlds based on that understanding. Microsoft Research's new World and Human Action Model (WHAM), revealed today in a paper published in the journal Nature, shows how quickly those models have advanced in a short time. But it also shows how much further we have to go before the dream of AI crafting complete, playable gameplay footage from just some basic prompts and sample video footage becomes a reality. Much like Google's Genie model before it, WHAM starts by training on "ground truth" gameplay video and input data provided by actual players. In this case, that data comes from Bleeding Edge, a four-on-four online brawler released in 2020 by Microsoft subsidiary Ninja Theory. By collecting actual player footage since launch (as allowed under the game's user agreement), Microsoft gathered the equivalent of seven player-years' worth of gameplay video paired with real player inputs. Early in that training process, Microsoft Research's Katja Hoffman said the model would get easily confused, generating inconsistent clips that would "deteriorate [into] these blocks of color." After 1 million training updates, though, the WHAM model started showing basic understanding of complex gameplay interactions, such as a power cell item exploding after three hits from the player or the movements of a specific character's flight abilities. The results continued to improve as the researchers threw more computing resources and larger models at the problem, according to the Nature paper. To see just how well the WHAM model generated new gameplay sequences, Microsoft tested the model by giving it up to one second's worth of real gameplay footage and asking it to generate what subsequent frames would look like based on new simulated inputs. To test the model's consistency, Microsoft used actual human input strings to generate up to two minutes of new AI-generated footage, which was then compared to actual gameplay results using the Frechet Video Distance metric. Microsoft boasts that WHAM's outputs can stay broadly consistent for up to two minutes without falling apart, with simulated footage lining up well with actual footage even as items and environments come in and out of view. That's an improvement over even the "long horizon memory" of Google's Genie 2 model, which topped out at a minute of consistent footage. Microsoft also tested WHAM's ability to respond to a diverse set of randomized inputs not found in its training data. These tests showed broadly appropriate responses to many different input sequences based on human annotations of the resulting footage, even as the best models fell a bit short of the "human-to-human baseline." The most interesting result of Microsoft's WHAM tests, though, might be in the persistence of in-game objects. Microsoft provided examples of developers inserting images of new in-game objects or characters into pre-existing gameplay footage. The WHAM model could then incorporate that new image into its subsequent generated frames, with appropriate responses to player input or camera movements. With just five edited frames, the new object "persisted" appropriately in subsequent frames anywhere from 85 to 98 percent of the time, according to the Nature paper.
Published on February 20, 2025 at 02:30AM
An anonymous reader quotes a report from Ars Technica: For a while now, many AI researchers have been working to integrate a so-called "world model" into their systems. Ideally, these models could infer a simulated understanding of how in-game objects and characters should behave based on video footage alone, then create fully interactive video that instantly simulates new playable worlds based on that understanding. Microsoft Research's new World and Human Action Model (WHAM), revealed today in a paper published in the journal Nature, shows how quickly those models have advanced in a short time. But it also shows how much further we have to go before the dream of AI crafting complete, playable gameplay footage from just some basic prompts and sample video footage becomes a reality. Much like Google's Genie model before it, WHAM starts by training on "ground truth" gameplay video and input data provided by actual players. In this case, that data comes from Bleeding Edge, a four-on-four online brawler released in 2020 by Microsoft subsidiary Ninja Theory. By collecting actual player footage since launch (as allowed under the game's user agreement), Microsoft gathered the equivalent of seven player-years' worth of gameplay video paired with real player inputs. Early in that training process, Microsoft Research's Katja Hoffman said the model would get easily confused, generating inconsistent clips that would "deteriorate [into] these blocks of color." After 1 million training updates, though, the WHAM model started showing basic understanding of complex gameplay interactions, such as a power cell item exploding after three hits from the player or the movements of a specific character's flight abilities. The results continued to improve as the researchers threw more computing resources and larger models at the problem, according to the Nature paper. To see just how well the WHAM model generated new gameplay sequences, Microsoft tested the model by giving it up to one second's worth of real gameplay footage and asking it to generate what subsequent frames would look like based on new simulated inputs. To test the model's consistency, Microsoft used actual human input strings to generate up to two minutes of new AI-generated footage, which was then compared to actual gameplay results using the Frechet Video Distance metric. Microsoft boasts that WHAM's outputs can stay broadly consistent for up to two minutes without falling apart, with simulated footage lining up well with actual footage even as items and environments come in and out of view. That's an improvement over even the "long horizon memory" of Google's Genie 2 model, which topped out at a minute of consistent footage. Microsoft also tested WHAM's ability to respond to a diverse set of randomized inputs not found in its training data. These tests showed broadly appropriate responses to many different input sequences based on human annotations of the resulting footage, even as the best models fell a bit short of the "human-to-human baseline." The most interesting result of Microsoft's WHAM tests, though, might be in the persistence of in-game objects. Microsoft provided examples of developers inserting images of new in-game objects or characters into pre-existing gameplay footage. The WHAM model could then incorporate that new image into its subsequent generated frames, with appropriate responses to player input or camera movements. With just five edited frames, the new object "persisted" appropriately in subsequent frames anywhere from 85 to 98 percent of the time, according to the Nature paper.
Read more of this story at Slashdot.
Comments
Post a Comment