TLDR:
My recommendation is it go with Hunyuan for NSFW. WAN 14b might eventually get a finetune that beats it but I doubt we’ll see that anytime soon.
TLDR With Reasoning:
Wan 1.3b : Don't use unless you are working with limited VRAM. You get a pretty noticeable jump in quality and prompt adherence by using Hunyuan and the speed isn't subjectively much slower. NSFW generation is poor out of the box, it needs training or LORAs to do this well. There aren't many Loras around and I suspect there never will be. Most people using Wan will be working with the 14b model. This will only become more true as VRAM becomes more available on newer cards (well... maybe...)
Hunyuan : Use this for NSFW. When you are waiting for a video to generate, it doesn't feel any slower than Wan 1.3b, but the quality is noticeably improved for human subjects and it handles NSFW out of the box. Hunyuan was definitely trained on more NSFW material than WAN was. It feels MUCH faster than Wan 14b, Wan 14b subjectively feels unbearably slow for rapidly iterating/generating videos. Loras are available in abundance. There are more NSFW Loras available for this than WAN.
Wan 14b : I’d say this is a better all round model for SFW content, but it can’t do much with NSFW out of the box, even basic nudity can fail sometimes. Movement looks nicer and is easier to control with prompting, it's better for directing the scene. Quality is better for non-realistic videos. Between this and Hunyuan there are different aesthetics. To me, Hunyuan feels more realistic while WAN is more cinematic. There are not as many Loras as Hunyuan but I suspect the gap will close quickly and there will be more SFW options available in the long run as it becomes the go-to for SFW content.
In my experiments, character Loras seem to train equally well on all 3 if using the same settings. See below for more detail.
Speed
Not sharing hardware or details, these are just to give you an idea of relative speed. These were all using the same resolution and frames, and roughly the same parameters. All used sageattention and teacache with equivalent settings. All used a "good" resolution for both Hunyuan and WAN, meaning a res that I've observed works well for both models independently. This is native Comfy with weights dropped to fp8 for Hunyuan and WAN 14b.
Model: Prompt Run Times
WAN 14b: 127.55, 132.62, 121.31
Hunyuan: 51.78, 54.69, 52.96
WAN 1.3b: 35.18, 32.64, 35.34
You can see WAN 1.3b and Hunyuan are pretty close, but WAN14b is FAR slower. Subjectively I don’t really notice the difference between WAN 1.3b and Hunyuan when staring and waiting for a vid to generate.
Quality
WAN 1.3b is generally worse, less so at lower resolutions but still worse than the other two. I'll ignore WAN 1.3b and just talk about Hunyuan and WAN 14b.
Hunyuan wins for NSFW because it can handle it easily out of the box and quality is as good as WAN. Hunyuan was obviously trained on NSFW material.
The aesthetic between the two differs. Color seems more saturated with WAN and it seems more cinematic to me, like a transformers or marvel movie. Hunyuan in comparison looks more realistic.
WAN does shine with motion control, following prompts for movement, and non-realistic content. Hunyuan sometimes gives odd things like a limb disappearing out of nowhere or an extra limb where there shouldn’t be one or other weird anatomy. WAN has this happen less often. WAN can also handle animation or cgi type stuff better. Hunyuan can sometimes feel like bad green screen even with realistic content depending on your prompt for the background.
Hunyuan also seems to be able to maintain the consistency of a trained character LORA at lower and more diverse resolutions than WAN, which is nice if you're trying to speed up generating even more.
Note that I’m not saying WAN is worthless for NSFW, with training and LORAs and using NSFW images as a starting point for the I2V models it can definitely do NSFW. It’s just not anywhere near as easy and I always felt like I had to really force it, and even then the results were often just not as good as Hunyuan. Hunyuan on the other hand was always more than happy to oblige with my NSFW requests. I just don’t see the point in putting that much effort into getting NSFW out of a model that seems to actively make it difficult. Unless someone drops some serious cash to fine tune the issues out of WAN, it’s just not worth it imo.
LORA Training
I only train character LORAs so keep that in mind.
Onetrainer can train Hunyuan but the LORA produced only works in Onetrainer. There is a note in the GitHub discussions that seems to indicate this won’t be fixed by Onetrainer, or at least it’s a really low priority. There is a hack for Comfy posted in the discussions where you can alter the lora.py file in your Comfy install, this works, but depending on how you trained the LORA it still won’t fully support it. At this time, I’d say Onetrainer is a no go for Hunyuan. I’m not sure about WAN support, I didn’t test it.
I’m using Musubi Tuner to train both models. It’s by the same hero that built sd-scripts which is what KoyhaSS is built on top of. This works with both Hunyuan and WAN and contains a conversion script to run on the output LORA to get it to run in Comfy.
Both models seem to train equally well for characters and the settings are pretty similar, so learning to train one you’ll pretty much have the other figured out as well.
I use images only, no video, and train on two datasets, one 512 and one 1024. I inpaint out any other people in the images and try to keep the backgrounds and clothing as diverse as possible, sometimes even changing clothing colors/styles using inpainting. The 512 dataset is mostly portraits, while the 1024 dataset has mostly full body pics. I don't know if this is the best way to go, I haven't experimented much with different data.
The dataset can be prepped just like you would for an image model, use one of the many tutorials on Civitai, there really isn’t anything different to do for these video models when it comes to prepping the data. Use individual caption files but it’s ok if every caption is just “sksk woman” or whatever, like a dream booth prompt. I trained 30+ models using only that caption and they all work fine. Maybe they would have turned out better with more detailed captions but I’m not sure if the quality increase is really worth the effort.
I haven't had time to test this, but given Hunyuans similarities to Flux when it comes to training I suspect that if you find specific resolutions you like to use when generating video and then resize your images so you have data sets of that same res and others with proportionate sizes it would improve the quality. For example, if you like to gen videos at 400x600, create image datasets at 400x600, 200x300, 500x750, etc.
The basic AdamW/AdamW8bit optimizer works just fine. Adafactor with a constant learning rate also works on both. Adafactor with the adafactor scheduler works on Hunyuan, although for some reason produces weird artifacts on WAN in later epochs, which for me made it unusable for WAN.
I did try increasing the weight_decay for AdamW and found I preferred to train with it set to it's default value. Higher values did keep the model from overfitting as easily, but I found for character Loras I actually had to overfit a bit to get the body shape right. Maybe this was because I use fewer full body pics than portraits in my training, I'm not sure. If I kept weight_decay higher it would probably eventually get the body/face as good as it does with default values, and it may even be better trained, but in finding a balance between speed/quality it just doesn't seem worth it to me to use a higher weight_decay.
'networks.lora' works ok and has no trouble producing pretty good likeness and body shape.
If you install Lycoris from its repo separately then you can use 'lycoris.kohya' with dora/decompose and the results will be even better. However, the conversion script doesn’t fully support Dora weights yet. The converted Dora will still work in Comfy though and produce better likeness than a normal lora. I am hoping Musubi gets updated to support proper conversion of Dora weights, or Comfy builds support. I doubt Comfy will be updated though, there were some pretty salty comments in the github discussions on this topic 😆.
Use a sample prompts file and have it write out prompts and save every X epochs, then review the samples and select the version that captures the subjects likeness to an acceptable level with the lowest number of steps. Overtraining isn’t super obvious with these models like with some image models. As steps increase, flexibility will decrease, you won’t see it so much in the samples but you’ll feel it when prompting later so go with the least amount of training you can do while still capturing your subject.
If you have less than 24gb of VRAM, then adjust the blocks to swap setting upward. I think it has to be an even number, at least 4, maybe only in multiples of 2? I couldn't find any solid documentation on this but there are some numbers that generated an error during training for me. If you run into an error just adjust up/down by two, make sure it's even, and try again.
I've attached sample files including the commands I use for training. Just install https://github.com/kohya-ss/musubi-tuner following the instructions on the GitHub page (I highly recommend a python venv or conda environment), optionally install Lycoris if you want to use dora, prep your dataset, and then customize the example files attached and run.
Good luck and god speed.