Many thanks for your excellent lectures, particularly those on diffusion models. I do have a few inquiries regarding models of conditional diffusion. Can we think of text vectors as the query (Q) and image vectors as the key (K) and value (V) in cross-attention instead of image vectors as the query (Q)?