Style Capture: Use 3D/polygonal images in your LoRA with this one simple trick!

I’ve seen a lot of people exhibit a fear of using 3D (as in polygons) images in their training data, either attempting to work with an extremely limited 2D data set instead or deeming it impossible. Now that I’ve done 4 models (Ella, Tara Grimface, Carol, and Cindy Vortex) using primarily 3D data and at least 4 more (Code Fairy girls and Dagoth Ur) with heavy doses of it, and a few more with splashes of I can safely say: 3D data is NOT a problem! The article's picture was made with a LoRA trained on 25 3D pics and 5 2D pics (one chibi, rest badly compressed art from game's loading screens).

There’s just one simple trick to using 3D data: You need to identify it as “3D” when tagging your training data (and “low poly” if applicable). No! It really is that simple! The result will handle “normal” styles just fine. Whenever I see a model fail from using 3D data, it has always turned out that it wasn’t tagged 3D. Even the ones that didn't totally fail (generally a high quality modern model) exhibit a telling odd 3Dness that makes their origin unmistakable.

What’s the Catch?

There’s only one small catch when using 3D training data that was tagged 3D: The resulting LoRA will not play nice with LoRAs that imitate other 3D styles (at least, not without playing with weights) since “3D” will call your model's understanding of 3D.

~~I have to caveat that I’ve yet to do a completely~~ ~~pure~~ ~~3D data set, always including at least a few 2D images, but have done as low as only 5 of 82. Thus I’m not~~ ~~entirely~~ ~~sure it will work as well if you have zero non-3D image. From my experiences with details and secondary items that only appear in 3D though, they don’t appear to have any artifacts from their medium~~. Update: I've now done ~~two~~ many more. All issues appear to be from training data being fairly limited. Other LoRA makers have also confirmed there is no issue with pure 3D data.

Making the Most of a 3D Model

Assuming you’re capturing a video game model (rather than CGI animation), and have reasonably good control over the angle (a model rip or freecam is best, but you can do with just good camera control, especially if you can disable the HUD/crosshair in options, which more games allow than you think) you can get a lot of training data from just control over camera. I doubt just throwing near identical images at the trainer is a good idea, but by combining a few camera position tags you can easily get into the double digits while maintaining differences that can be distinguished with tags alone. By combining facing viewer, profile, from side, from behind, from above, from below, and straight-on, you can 13+ images of a model without posing. Most of these combos are pretty intuitive, especially when you know they’re valid (“from side, from behind” is a valid tagging for a view from 4 o’clock), though straight-on and profile don’t really combine with anything.

Some models can’t quite get all of these, since a lot of models do not hold up to close inspection from below, having holes, legs that end in nothing, and even worse artifact you probably don’t want to train. Further many more "anime" models look unnatural from profile. Even without these angles, they’ll provide get plenty of images. Basic poses such as “running” or “crawling” will get you many more when combined with these principles, but it’s game specific if you can even capture such. Do try to avoid having all the images you capture of a model be the same pose. It tends to make it the resulting LoRA's "default" pose.

For CGI animation, just treat it like any other cartoon (but tagged 3D).

Other Potential Applications

3D is among the most extreme and polluting styles that can be contained to a tag, but from what I’ve tired, it’s not the only one. The most obvious second case is “realistic, photo (medium)” for using photos in models that can also be used with art styles, as seen my Angkor Wat model (since it’s all public domain and CC pictures, you can see the training data and tags yourself till the end of the month)

Beyond those, I’ve also had success in capturing “Blurry, lowres, jpeg artifacts” plus “1980s (style)” in my Atsuko LoRA. Unfortunately since ALL the screenshots were a blurry mess, its in large part treated as inherent to the style by some checkpoints. Would likely work better if you have some data without the two overlapping. Still, one can presume that the various “19XX (Style)” (and 2000s (style)), can also be the used the same way.

Official art and anime screenshot technically do work for this, but there’s a big caution here: Official art is often used as a quality improving (and push towards “anime” style) tag in generation by a very large chunk of users, and anime screenshot is often used as a style tag the same way, so they’ll often bleed the style for these into generations. This isn’t a bad thing if the quality of the pictures is good, though if you’re training off horrifically compressed low res cutscenes from a Saturn disc (surprisingly, that LoRA came out pretty decent) or unpolished sketches done by the official artist you will want to leave these off your tags, even if they fit the definition. Traditional media is, likewise, a generally used style tag and while fine if the image used is good quality, may be something to exclude if the image in question is low quality. “Sketch” is the opposite case. It is often seen as inherently low quality (or, at least, crude) and appears in many standard negatives so it's applicable.

Using “monochrome, greyscale” for applicable images likely falls under this concept. I’ve covered it before, but even with pure B&W data set, you can put color in at generation time, at a cost of somewhat flat coloring when not paired with a style. I’ve tagged a lot of images I’ve used in training that had visible panel dividers and the like as “comic” and never noticed an issue with the format, though I don’t know how much they’d be an issue without doing so.

I’ve included tagged “chibi” images a few times and they seem to integrate without issue. I lack through tests on “figure” (as in hunk of plastic), but it’s a well used tag so it might be useful for isolating plastic texture and real surroundings. It overlaps with “photo (medium)”.

Edit: Another big one I forgot to mention: "pixel art". Even as someone who makes truly awful training data work, I'd only try using the highest quality/largest sprites for training unless I was trying to make a style or truly out of options, but fighting game sprites, cutscene stills, and the occasional portrait can all be useful options for training data. If you can't see the pixels, I wouldn't use this tag.
Edit 3: I take it back. I just made a LoRA out of majority 30x60 spites with ~20 colors as a test. Even this works! (Disclaimer: ~~at least in Illustrious~~ with 3 non-pixel images+6 bigger images from cutscenes adding to the 19 gameplay screenshots). UPDATE: Works in Pony too.

Edit 2: Another potential use of this is lighting. Using applicable tags like "dark", "night", "sunrise", or "sunset" can avoid an image's odd pallet skewing the final model.

Style Capture: Use 3D/polygonal images in your LoRA with this one simple trick!

What’s the Catch?

Making the Most of a 3D Model

Other Potential Applications

Comments