Sign In

ANIMA Retrain: Full Finetuned JSON train; Subject Buckets

4

ANIMA Retrain: Full Finetuned JSON train; Subject Buckets

Hello again.

The mad scientist is at it again... Some of you remember me. Some of you have no idea who I am. I've been around for a while and since then I've dove the depths of AI in order to learn. I've expanded my knowledge into many fields of AI, advanced mathematics, advanced symbolic theory, and more.

What is a subject bucket?

Say you have 50 images all with a potato, and 25 of them are on a table.

You now have 2 buckets. ["potato", "table"]. It's much more complex than this, as each is grouped into trees of complexity and then collapsed or expanded based on what is what, what is where, and so on.

By preparing the VLM system using my particular finetune of VLM, you get access to a simple converter or a simple prompter. It's not the best mind you, 0.8b isn't very smart and the training is limited to mostly what the model already knew, but it definitely works.

Why?

It PROVIDES a DIRECT anchor to a topic. Something MOST diffusion models lack. You don't have to take my word for it, play with the prototypes.

The current prototypes

https://civitai.red/models/2639332/bigliminal-json

Even without subject bucketing the model manages to retain Anima control using the bigliminal dataset for 4-5 epochs, roughly 1000 images JSON trained.

These aren't using subject bucketing, as the structure for a much larger system WILL REQUIRE multi-concept.

JSON data

https://huggingface.co/AbstractPhil/Qwen3.5-0.8B-json-captioner

Using my trained version of QWEN I have produced a great deal of data at 1024x1024 extracted from qwen.

Data Type

QWEN-image lightning extractions, 90,000 of them give or take. The model ought to learn some real data and differentiation capacity.

Catalyst; LORA

Standard lora will be enough for this one. Nothing special required.

With this model I made this dataset

We'll be training 1000 images from the qwen 90k set as a preliminary.

https://huggingface.co/datasets/AbstractPhil/diffusion-pretrain-set-ft1

The process

2 pass, first we run the VLM prepared JSON and then the ANIMETIMM prepared JSON.

VLM Data

The VLM itself is quite non-intelligent, qwen 3.5 0.8b isn't the most intelligent model. I think it's score is roughly 4 on the benchmark, while many 4b models are 12 - which doesn't inspire confidence in terms of standard benchmarking.

We aren't here to run benchmarks though. This is a practical solution to a problem, subject association has no foundationally strong controllers.

AnimeTIMM Data

Using one of the newer animetimm vits I prepared the tags there then passed them clean through the VLM for json conversion. The hallucination is not too bad and the results are objectively good, even with a bit of drift.

Completed Pass

Both prompt sets have different buckets, and those buckets are entirely different formats of differentiation targets.

Why multi-concept for >1000 images?

The model itself is being targeted because it's already pretrained and based on cosmo 2b. By trying to run a big chunk of images through the system simultaneously, the model will create a cascade of differentiation changes towards the json. More than likely corruption and fault over time. Smaller less than 1000 image trains are fine, however larger require stronger pieces of information.

Faulted trains in the past with other models

SDXL and I have a long track record. I've fed models a million or more images for multiple epochs only to have the model fall apart down the chain.

SDXL is a bag of tokens model, meaning the grab-bag system isn't necessarily the most potent utility for plain english prompting. MANY of my images were plain english prompted only to see a massive amount of faults along the chain.

Remedy for large multi-subject balancing isn't simple

It ought to work though. The structure ought to form a series of differentiations that provide useful pathways, rather than collapsing or destroying them. This is reinforced by the json itself and the model will be prompt-capable simply by prompting plain English or Booru style when the train is done.

4