CPoser: An Optimization-after-Parsing Approach for Text-to-Pose Generation Using Large Language Models.
DescriptionText-to-pose generation is challenging due to the complexity of natural language and human posture semantics. Utilizing large language models (LLMs) for text-to-pose generation is appealing due to their strong capabilities in text understanding and reasoning. However, as LLMs are designed for general-purpose language processing and not specifically trained for pose generation, it remains nontrivial to generate precise articulation targets for the full body using LLMs directly. To this end, we propose CPoser, a novel approach to harness the power of LLMs for text-to-pose generation, featuring a prompt parsing stage and a pose optimization stage. The parsing stage utilizes LLMs to turn text prompts into pose intermediate representations (Pose-IRs) through a set of predefined structured queries. These Pose-IRs explicitly describe specific pose conditions, such as squatting depth and knee bending angle, naturally forming an objective function that a target pose should satisfy. The optimization stage solves for expressive poses and hand gestures based on the Pose-IR objective function via robust optimization in a quantized pose prior space. The results are further refined to enhance naturalness and incorporate facial expressions. Experiments show that our approach effectively understands diverse text prompts for pose generation, surpassing existing text-to-pose methods.
Event Type
Technical Papers
TimeWednesday, 4 December 202411:08am - 11:19am JST
LocationHall B7 (1), B Block, Level 7
Registration Categories
Language Formats