Brainstorm Time.
Lets see... say we use ohh... 200 million images from LAION2B.
Alright, that's 200 million images; can be of various sizes, shapes, and quality.
Lets prune that down; assume, like almost all filter systems, we prune everything outside of our bucketing 512x1216 size. Not even rescale, just prune.
Alright, lets just make an assumption how much data we have left, since I don't have LAION's statistics handy right this minute.
Lets assume we're left with 40 million usable images after caption filtering, grotesque checks, absurdity checks, complexity checks, and so on.
So we're left with 40 million usable images of our 200 million potential image candidates from LAION 2B.
Well that's not very much is it? LAION is supposed to be the end-all for all data right, shouldn't it have all the good data?
Well that's the problem with casting a wide net. You catch the fish, and the dirt, and the rocks, and the plankton, and the boat, and so on. You tear the net and you only get a little bit of the goal.
ALRIGHT, so now you understand why I use BOORUS.
BOORUS are often at least SEMI-CURATED. They have substantially more information than you'd think associated with tags, images, and concepts.
They also have the added benefit of just.. y'know, making fun stuff, and producing really fun information. Most people can at least ENJOY some tasteful booru images, and many of us enjoy the less... tasteful images as well, the more depraved y'know. Heathen hours.
Alright lets say we get... 10 million from the next 200m, and so on. Lets say we end up with a total of about 120-150 million useful images from LAION2B.
Well.. that's... comparable to the total scraped booru, cosplay, and ai generated scraped pool.
Okay, we're cooking now. We have some numbers.
Lets now do some math on the captioning shall we?
LAION captioning is dogshit.
Yeah we know, they know, everyone knows. It was necessary though, because it is based on the CLIP_VIT, hand captioned datasets like IMAGENET, and the contributions of many colleges, universities, and research firms.
The system was based on an early stage sourcing and they had no idea what they were doing yet. They were building something fantastic, and it turned into a massive society-altering achievement. They didn't know how good or bad these things would be at understanding the images, captions, or translating them, or the specifics to anything yet. It was all heavily theoretical and highly experimental. Diffusion in general is sorcery in many ways.
They are truly pioneers and they have my utmost respect.
THAT BEING SAID, we're onward to the future.
There are a multitude of other recaptions of much if not all of LAION. There is exponentially more information and data out there than is to be expected, especially when sourcing good datasets.
Lets recaption Laion then.
Alright, lets say we get a full tag cluster going. We somehow get a group to allow us to use their 1200 A100 cluster, each with 20 gigs local disk size.
Alright, we've now approached a new logistics problem that I don't plan to address here; lets go straight through and treat these as 8 machine dockers today.
Lets call our hypercluster 150 8 machine dockers; each cluster allocated 160 gigs disk space, 256 gigs ram per cluster; assuming each machine has 32 gigs ram.
For convenience lets assume it's all clumped into one interface;
workspace contains a cloned read-only prepared network drive cluster for rapid access
model contains the active model data to be distributed and grabbed by the processes
output contains the saved model checkpoints, blocks, and whatever else you need to retain. Not to mention the most important elements like states and trained images per epoch to ensure you aren't actually retraining the same images over and over.
Alright badass, lets do the math now.
Lets conservatively assume we can caption an average of 12000 images per computation hour per 8 a100s; accounting for prep, transfer, resizing, scaling, recoloring, filtering, and so on.
That's not bad... Well... then you do the large scale math here.
Okay that's roughly 1,800,000 images per real hour in the mega cluster.
KEEP IN MIND THIS IS ALL PURELY HYPOTHETICAL.
That's... NOT VERY MANY!
Well this isn't comparable to reality, we can only caption 1.8 million images per hour... and we need to caption say 400 million.
That's 222 real hours, even with this mega cluster.
Alright now we have a real hours count, lets multiply this hour count assuming that each hour multiplies the cost of our 1200; and lets say each a100 costs $1 an hour to rent which is very conservative.
1200 x 222 x 1
That's nearly $266,000 just to caption 400 million images; barring errors, failures, and filtering.
Let say after filtering, we're left with less than 320 million images.
That's about 1500 images per $1. You're paying 20 dollars a minute to caption these, and you need to run it for 222 hours to complete the run.
1200 total A100s -> $1 per hour each = $1200 an hour
1,800,000 images per hour -> 1/1500th of one dollar per image
1,800,000 / 1200 = 1500 images captioned per hour per A100.
400,000,000 / 1500 = 266,666~ total A100 hours at $1 each.
$280,000 or more due to additional costs, time lost, errors, hardware failures, etc.
Well that's just... fucking horrible.
Yeah, I know. As it stands we have an hour count; which is
1200 * 222 A100 computation hours.
266,400 A100 computation hours to transfer, caption, prune, and so on 400 million images.
Alright lets say we get some optimizations done to our software, so we cut this down to less than half, we've doubled our value; and are still paying 40 cents per image.
Alright... Well... this is absurd.
(after optimizations formula)
400,000,000 / (24,000 (i/h) * 150) = 111 hours real time renting 1200 a100s.
You're essentially burning money on a pyre at that point. Someone like me could never hope to recaption LAION without substantial assistance, and I most definitely could; I would just need the resources.
Lets train our flux now!
Alright we have our 320 million captioned images, sourced from every dataset throughout the internet, clumped into a fat fucking heap.
Lets say we jam all the boorus in there too, and we're up to 400 million.
How are we going to train this thing?
Well first, off we'd need to use something like Diffusers, as it's substantially more stable and reliable than the majority of the alternative sources. Kohya is out of the question, and so is hand sourced or hand created everything unless you have multiple experts lined up to assist.
I'm good, but I'm no expert, which means I would need experts on my side if I were to build a software like that; which is why even experts pick a robust and utilized software like Diffusers or something of that nature.
Lets now do the math for the 1200 80 gig a100s on this.
Well lets say every single training option now causes overhead, less overhead, time consumption, worse time consumption, and so on. Lets say every single god damn thing you tick on that distribution software pipeline, trickles to all 1200 machines for a simultaneous train.
Lets just assume all bottlenecks don't matter for the sake of ease here. Data can only transfer at a certain rate, so lets just bypass that potential bottleneck.
Lets say we can batch size 16 each 80 gig a100. We'll use a very optimal system, built entirely on having a large structured system meant to distribute bucketed images in clusters, and the gradient ropes are updated based on bucketing sizes, and so on; so we can assume we have a large scale bucketing system with a highly optimized subsystem.
Bucketing alone adds entire hardware requirements, software requirements, potential faults, potential failings, and so on. Not to mention the possibility of cross-contamination.
Ouch. This shit is complicated dude.
Yeah no shit. This is usually managed by teams of high-end engineers, programmers, mathematicians, researchers, and professors. Not to mention the guys at the top tugging at everyone's heart strings due to it taking too long for their pocketbooks (the whip).
Okay, lets say for the sake of ease you're just managing this, but you have guys over top of you. Hypothetically, where do we start?
Well, we start by making a plan.
If the goal is money, we should choose a valid research potential that can be monetarily beneficial; So you would naturally want to hide the source locations, produce carefully curated and high quality outputs, and produce a useful product for people to consume that can be profitable and interesting.
Well that's... not very interesting to me. It sounds like they'll just be jerking eachother off and acting like they're making innovations; but in reality they'll just be recreating something that's already made with bells and whistles trying to capture a percentage of the market. Sound like game development to you? Sure sounds like it to me.
My goal is interest, intrigue, and developing new methods of rapid training.
Simply put, I'm broke, so I have to figure out creative ways to train without shelling out $30,000 a day in a100s just to have a potentially faulty experiment gone awry.
Back to FLUX, sorry distraction. ADD.
If you get the best and most high quality setup; choose the correct settings, prep and bucket correctly so everything is perfectly set up into databases to draw from, and say all the bottlenecks are solved ahead of time. Despite spending a month testing and ensuring everything works in the dummy environment, we can now use the big boy environment.
Historically nothing ever runs smoothly the first try. The larger the machine, the more faults can be present. So all the testing in the world may not prepare you for faults, failings, and so on that present themselves when you're in the big boy billionaire's datacenter. Things that weren't accounted for become problems, and things that weren't problems in the testing are now presented as inconsistent problems due to hardware limitations or faults.
Anything that can go wrong will go wrong.
Lets assume we spend a big portion of time setting everything up in the testbed and everything works, then we dump it into the machine and it makes a big grinding clunking smashing noise instead of working.
Well, that's when you need the professionals nearby and the experts who are capable of handling problems.
Back to training.
Lets say our structure can train 1800 samples per hour per a100 per cluster, so 1800 * 1200
2,160,000 samples per hour in a cluster of 1200 a100s.
HISTORICALLY through Runpod I've benchmarked 1k samples per 4090 per hour, and 1800 per a100 80 gig with BF16.
That doesn't mean it's accurate.
FP32 is supposed to be FASTER on A100s than BF16; so lets assume that our FP32 full train is in fact running a choice perfect and absolutely beautiful count of 5000 samples per hour per a100 due to having a higher than expected rate, high end modern hardware, and perfect conditions.
Say 6,000,000 samples per hour total.
raises finger That's NOT VERY MANY DUDE.
Uhhhh... okay.
Well we have to assume we need 1 image per 1 step per epoch; we need to learn at a fairly high learn rate due to starting from a ZERO flux model.
Not to mention, the T5 is kind of a slow bitch, but required, so it would be best to cache and bucket those at the start of an epoch as well.
NOW YOU KNOW WHY PEOPLE DONT TRAIN THE T5!
There's no fucking telling what will happen when you feed it so many images. It could very well fall apart in the first 5000 if you aren't careful. On top of that, that's synchronizing more ropes, not just the UNET, which scales upward to more hardware and more time to synchronize changes due to more sets of distributed ram and hardware.
The finale.
In any case, lets say it's expensive as hell.
Lets assume we need... oh... a conservative 50 epochs with our 400 million images to complete a successful flux knockoff.
We're looking at 20 billion samples.
Alright; 6million samples per hour...
That's... 3333~ real time hours, and 4 million dollars.