Camera Settings as Tokens: Modeling Photography on Latent Diffusion Models
DescriptionText-to-image models have revolutionized content creation, enabling users to generate images from natural language prompts. While recent advancements in conditioning these models offer more control over the generated results, photography—a significant artistic domain—remains inadequately integrated into these systems. Our research identifies critical gaps in modeling camera settings and photographic terms within text-to-image synthesis. Vision-language models (VLMs) like CLIP and OpenCLIP, which typically drive the text conditions through cross-attention mechanisms of conditional diffusion models, struggle to represent numerical data like camera settings effectively in their textual space. To address these challenges, we present CameraSettings20k, a new dataset aggregated from RAISE, DDPD, and PPR10K.Our curated dataset offers normalized camera settings for over 20,000 raw-format images, providing equivalent values standardized to a full-frame sensor. Furthermore, we introduce Camera Settings as Tokens, an embedding approach leveraging the LoRA adapter of Latent Diffusion Models (LDMs) to numerically control image generation based on photographic principles like focal length, aperture, film speed, and exposure time. Our experimental results demonstrate the effectiveness of the proposed approach to generate promising synthesized images obeying the photographic principles given the specified numerical camera settings. Furthermore, our work not only bridges the gap between camera settings and user-friendly photographic control in image synthesis but also sets the stage for future explorations into more physics-aware generative models.
Event Type
Technical Papers
TimeTuesday, 3 December 20249:00am - 12:00pm JST
LocationHall C, C Block, Level 4
Registration Categories
Language Formats