the concept has great potential, but at this stage the generations from either text or image to video have a long way to go before you can produce anything close to cinematic or realistic results. As of now, anything over 4 seconds decays details, hallucinates into disfigured, non realistic shapes and maintains little cohesiveness over time. I love the idea, but for every close to good result, I have to churn through 15-20 dissapointing ones. I will hang in there, but certainly hope they figure out a way to maintain some consistency to the original seed throughout the generation and find a way to limit these grossly unrealistic hallaucinations.
I'm really enjoying the "grossly unrealistic hallaucinations" although I'm still generating a fair ammount of garbage in order to get shots that I like.