Nekofantasia 0.1
The first Rectified Flow diffusion model for anime art generation
Overview
Nekofantasia is the first-ever diffusion model for anime art generation built on Rectified Flow technology, based on the cutting-edge Stable Diffusion 3.5 Medium architecture. Our training dataset currently consists of 4 million high-quality anime artworks — and in another first, every single one of them was personally reviewed and hand-picked by the Nekofantasia team over the course of two years. We took this painstaking approach because every automated image scoring method out there, relied on by most anime AI developers, is wildly unreliable — it bulk-deletes valuable training images while keeping stuff that clearly should have been tossed, dragging down dataset and final model quality to an unacceptable degree.
The goal of Project Nekofantasia: to break through the stagnation of community-driven, uncensored anime models — which have largely been stuck on outdated tech and methods — and build the best free anime art generation model out there, by tackling a whole range of systemic technical issues that plague existing models: the use of Adafactor instead of full AdamW, fp16 instead of bf16, automated aesthetic filtering (or no filtering at all) instead of manual curation, small datasets, legacy architectures, and training mistakes.
The ultimate goal: to eventually arrive not just at the best existing anime model, but at the definitive, ultimate anime model — one whose output is virtually indistinguishable from real high-quality artwork.
⚠️ ATTENTION: Nekofantasia 0.1 is an early preview release that has NOT completed full training due to funding constraints. It has hit the quality bar we expected at this stage, but it's not yet capable of a lot of things it will undoubtedly be capable of with further training. This is exactly the outcome we've spent years working toward — painstakingly assembling a dataset by hand, tracking down and fixing countless issues that had been consistently degrading one model after another, and running experiments, including expensive ones.
However, making serious progress from here is simply not possible without your help. If you're willing and able to financially support us and make a real contribution to the advancement of free anime models, we'd be grateful for any donation via the addresses at the bottom of this page or through Patreon. We don't have wealthy backers or corporate funding — our only hope is voluntary community support. That's why we're releasing this version now: as proof that we're dead serious, so we can earn your support.⚠️
Why Rectified Flow + MMDiT Matters
Using a Flow-based model with the MMDiT architecture has already fixed — or will fix in future releases — virtually all the shortcomings of other models, including:
1. No more "plasticky" look
The telltale "plasticky," cookie-cutter look inherent to EPS-prediction models is already gone in the current version. EPS-prediction models are fundamentally unable to reliably recover the DC component of an image (overall tone and brightness) from a noisy signal. In practice, this means there's no "sweet spot" for CFG: low values give you washed-out, faded colors; high values give you oversaturated, eye-searing neon. A specific case of this is EPS models' inability to render scenes with extreme lighting — night scenes collapse into dark blue, bright scenes lose saturation. V-prediction partially addresses this; Rectified Flow solves it completely. This is an inherent limitation of the method, not a tuning issue. Rectified Flow fundamentally resolves it by predicting velocity instead of noise. On top of that, using full AdamW instead of Adafactor preserves per-element adaptivity of the second moment (which Adafactor loses due to factorization), allowing the model to pick up on finer stylistic nuances. Bf16 mixed precision instead of fp16 provides greater dynamic range and training stability.
2. Better stability and more efficient GPU compute usage than EPS and V-prediction
V-prediction, used as a partial fix for EPS issues, introduces its own instability. Meanwhile, EPS training costs are higher than they need to be — something that can be avoided thanks to Rectified Flow's smoother loss landscape.
3. 16-channel VAE for better fine detail
The 16-channel VAE provides significantly more accurate reconstruction of fine spatial details compared to the 4-channel VAEs of previous generations, which should substantially improve rendering of complex elements down the line (fingers, eyes, clothing details).
Installing and running
Nekofantasia-01.safetensors goes in ComfyUI/models/checkpoints.
Text encoders can be downloaded from the link: https://huggingface.co/Nekofantasia/Nekofantasia-alpha/tree/main/text_encoders. All three files (t5xxl_fp16.safetensors, clip_l.safetensors, clip_g.safetensors) go in ComfyUI/models/text_encoders.
Install RK-samplers node: https://github.com/memmaptensor/ComfyUI-RK-Sampler#installation
Recommended Prompt Structure
Nekofantasia does not use artificial quality tags (masterpiece, best quality, etc.) — low-quality images were NEVER part of the training data.
Tag order doesn't matter much when building your prompt, since tag shuffling was used during training, and the model architecture has significantly less dependence on tag order than UNet-based models.
Since the current 0.1 version hasn't completed full training, long, detailed prompts will give you the best results.
Don't use underscores (_) in tags. Separate tags with commas.
We recommend using highres and absurdres, and it helps to include booru safety tags (general, sensitive, questionable, explicit).
Example Prompt
1girl, absurdres, animal ears, bow, braid, cat ears, dress, green dress, hair bow, highres, kaenbyou rin, long hair, long sleeves, looking at viewer, nekomata, oil painting (medium), painting (medium), portrait, red eyes, red hair, red ribbon, neck ribbon, smile, solo, touhou, traditional media, twin braidsRecommended Negative Prompt
lowres, pixelated, downscaled, upscaled, jpeg artifacts, compression artifacts, scan artifacts, blurry, censored, bar censor, mosaic censoring, heart censor, bad anatomy, bad hands, bad feet, extra digits, fewer digits, watermark, text, dated, watermark grid, sample watermark, artist nameSampling Settings
Due to instability in the model's vector field at this stage, higher-order samplers are recommended for producing clean images — ideally Runge-Kutta methods.
CFG Scale — 3–7 (feel free to go higher if needed)
Sampler — Dopri5 (recommended) or Bosh3
Steps — N/A — Dopri5 is an adaptive sampler that automatically determines the number and size of its steps
Note: There's no benefit to using even higher-order samplers. Bosh3 is also a solid choice, but despite being 3rd-order vs. 5th, it won't actually be faster. In theory, you can generate with simpler samplers like Euler or Heun, but they'll require a lot of steps, and even then you'll likely only save time at the cost of reduced stability and the occasional third arm.
Recommended Resolutions
Target ~1MP, all sides divisible by 64. Portrait (vertical) images currently tend to produce better quality than landscape.
Square | 1024 × 1024 | 1:1
Portrait | 896 × 1152 | 7:9
832 × 1216 | 2:3
768 × 1344 | 4:7
640 × 1536 | 5:12
Landscape | 1152 × 896 | 9:7
1216 × 832 | 3:2
1344 × 768 | 7:4
1536 × 640 | 12:5
Sources
Various anime imageboards, personal blogs, Patreon, Pixiv, and game CGs extracted from RPA files of various visual novels. Using a custom-trained neural network combined with manual review, we removed a large number of AI-generated and AI-assisted artworks uploaded to booru sites without proper tagging — which posed a real risk of degrading model quality.
Data Handling
Unlike the approach taken by virtually every previous anime model, images were subjected to minimal lossy compression (only INTER_AREA for bucket resizing to training-compatible dimensions — we plan to eliminate even this down the road), with no WebP or JPG recompression beyond whatever the original artists already applied when uploading. Dataset collection for the current version was completed in February 2026 and likely covers virtually every character with even moderate popularity.
Training Details
Optimizer: AdamW
Scheduler: Constant with warmup
Effective batch size: 176
Mixed precision: bf16 (not full bf16)
Hardware: 8x H100 SXM (~24 hours, 194 GPU-hours)
Text encoders were not trained. In the MMDiT architecture, caption-to-image interaction happens via JointAttention within the model itself, making text encoder fine-tuning a waste of GPU compute. Training text encoders through diffusion loss is not an effective approach.
Known Limitations
Has significant issues with fingers and fine details.
Hasn't yet learned characters or many uncommon tags.
Only a small fraction of the model's potential has been realized at this stage. However, it has already oriented toward the anime style and doesn't exhibit the "smeared" anime rendering look that's typical of many models and the original SD 3.5.
Character tags currently have almost no effect on generation.
In rare cases with uncommon or missing booru tags, photorealistic style bleed-through may occur
Due to the extremely aggressive safety filters baked into base SD 3.5, NSFW content generation is currently almost impossible. However, unlike the base model, Nekofantasia 0.1 can already properly render bare breasts — which means that with further training, StabilityAI's censorship (which turned a lot of people off from their most modern model) can likely be fully overcome.
Recommended Settings for LoRA
Optimizer: adamw8bit/adamw (scheduler-free optimizers tend to significantly underestimate step size. Prodigy with d=3 and safe_warmup can be a decent option)
Network dim: 32-64
Network Alpha: =dim
network_train_unet_only: true
Effective batch size: 4-12
LR warmup steps: 100-200
LR: 4e-4/2e-4
training_shift: 1.0
Weighting_scheme: logit_normal
network_module: networks.lora_sd3
Note on methods: Lokr/locon/i3/lycoris and other methods are either fundamentally incompatible with transformer architecture or require significant fixes; using standard LoRA is recommended.
Numerical stability: To ensure numerical stability, it's recommended to avoid fp16 and use bf16 instead.
Changelog
v0.1 — Trained on 1/3 of the full dataset. Initial release — 2026.03.13.
Roadmap
1.0 Medium Release
2–3 epochs of training. Knowledge of virtually all moderately popular characters (2k+ artworks on Danbooru). Proper limb generation. Various artist styles. Quality competitive with commercial anime generators. Potentially the best free general-purpose anime art model, with knowledge of nearly all characters and styles without needing LoRAs.
Estimated cost: $1,200–$2,600
VAE Decoder Fine-tuning
To fully eliminate detail noise and potential VAE artifacts. Decoder training is much simpler and will be stopped as soon as PSNR reaches near-lossless levels — likely less than one epoch, possibly less than half.
Estimated cost: $50–$600
Reference-like Feature (similar to NAI)
The ability to feed the model a single image as a style, subject, or character reference. (This may not be implemented before the next milestone, since the Medium model has quality ceilings below what the community deserves.)
Estimated cost: TBD — further research needed
1.0 on Stable Diffusion 3.5 Large (8B)
New large model with a greatly expanded dataset. Maximum anime art generation quality achievable on the 8B MMDiT architecture: correct multi-figure composition, style blending, narrative scene generation, and minimization of typical AI artifacts.
Estimated cost: $4,000–$10,000 (potentially several times higher if we prioritize quality and increase resolution to 2.1MP)
IP-Adapter for the Large Model
Adding a reference feature for transferring subject, style, or character from a donor image into the generation.
Estimated cost: Slightly less than the previous item
⚠️All estimates are based on extrapolation from current results and may be adjusted, since a model of this type has never been built before.
Acknowledgments
StabilityAI — For creating such a fantastic model architecture that has, unfortunately, gotten far less attention from the community than it deserves. SD 3.5 represents a major leap forward in diffusion model architecture, and we hope to showcase its potential for high-quality anime art generation.
Kohya_ss — For building the training scripts that made this possible.
You, personally — For taking the time to check out this model, reading through all of this, and hopefully generating some images and showing your support.
Join our Discord server
Please Support Us
Your donations will go towards improving quality — so that everyone can freely create beautiful, diverse anime art in any subject and any style, without censorship, per-generation fees, or dependence on corporates.
You can support us via cryptocurrency or through Patreon (https://www.patreon.com/nekofantasia). Crypto is preferred, since we receive 100% of your donation that way — no taxes and excessive fees.
Donation Addresses
BTC: bc1q8g902k9gcstrtc543q849tzmeezta9t5j6jc43
XMR: 42aMKZ1ZPNJDMxjEMMYTs3PPbAxcZqfJnNfMS361gX4mdjMefc4rUBSHxAFCLmryi5WH2TVUPMiL2Ho7ZGn6iEjwBxXhKDu
ETH / BNB / EVM-compatible chains / any tokens (USDT etc): 0xeb8390f51431EBDc4332D43568EeCe4888dDAe53
TRX / any TRC20 tokens: TEZJetBdbEbL239Z91QJSh9zN5ggcFTuEu
ZEC: t1ZChGuaPDJJAVUXjWywRpuzHU3FRe6iis1
DASH: XdHYPfECKVs3qu65r35h5vA2pa9XcQNAap
LTC: ltc1qjfsgnmueylc7j2uhpp7u2rey08me5nylvgfwzf
SOL: 3vEKkYNxZYcEcxRrEMJdbXijBjpNcJJhBXtJtp6ojWuE
BCH: qpal09f5cky3g0yjs48tv5xl9k6zhz0ldcpa673peu
If you'd prefer a donation method that isn't listed here, reach out to us on Discord.
As a thank-you for donating, you'll receive:
Access to a private donors-only channel on Discord.
Early access to preview and release builds of future models.
Your feedback and suggestions will get significantly more attention.
Possibly additional exclusive perks we haven't thought of yet.

