about v2: better face accuraccy, but 2 seconds sacrificed for transition.
transform your image input into deepfake, but the accuracy is kinda low in this version yet.
make your prompts like t2v, but with this first, in example.
DEEPFAKE, scene transition.
the girl of the picture eating burger at restaurant, the background changes to a fancy restaurant.




