Inference-Time Prompt Projection for Safe Text-to-Image Generation with Total Variation Guarantees
| 구분 | 박사학위 논문 발표 |
|---|---|
| 일정 | 2026-05-20(수) 16:00~18:00 |
| 세미나실 | 27동 220호 |
| 강연자 | 이민혁 (서울대학교) |
| 담당교수 | 강명주 |
| 기타 |
This thesis investigates how to improve safety in text-to-image generation without unnecessarily degrading benign prompt-image alignment. In text-to-image diffusion models, safety alignment aims to suppress unsafe outputs, while prompt-image alignment requires faithful generation of benign user intent. In practice, stronger safety intervention can improve safety but also distort the underlying conditional generation behavior.
In this thesis, we first formalize this tension through a total variation (TV) perspective. We show that, under a fixed reference conditional generator, any nontrivial reduction in unsafe generations necessarily incurs deviation from the reference distribution, yielding a Safety-Prompt Alignment Trade-off (SPAT). We then propose an inference-only prompt projection framework that selectively rewrites high-risk prompts into a tolerance-controlled safe set while leaving already safe prompts effectively unchanged in practice. To realize this idea, we use a two-stage inference-time cascade in which a large language model proposes candidate rewrites and a vision-language model verifies image-level safety.
Finally, we evaluate the proposed framework on four datasets and three diffusion backbones. Experimental results show that the proposed method consistently improves safety while preserving benign utility near the unaligned reference model. In particular, it achieves 16.7-60.0% relative reductions in inappropriate percentage compared with strong model-level alignment baselines, while maintaining near-reference performance on benign COCO prompts. These results suggest that selective prompt-space intervention provides a practical and theoretically grounded approach to safe text-to-image generation.