Hello all,
Wolfgang Black here - a member of the ML/AI team here at Civitai. Welcome to the third installment of Machine Learning Projects: Classifying Content into Ratings. In this series, I'll be discussing how we've tackled the problem of trying to classify images and other media into movie-like ratings for our users to browse.
Despite this article detailing the work we've done for media classification, I wouldn't call this a solved problem! I've talked with other machine learning (ML) scientists at various social media companies who have expressed frustrations at existing ML solutions. The truth of the matter is, human moderators are needed across all levels and forms of media. Even at Civitai, we're still demoing this work in the backend as we try to tune our models to achieve better performance for labeling media.
This series of articles is to share insight into our efforts at Civitai and their results, especially since the deployment of these solutions may very well affect the site! In the last article, I had some great conversations and insights from our readers, so I invite that again below. Have ideas on how to tackle this problem? Want to try it on your own? Check out the dataset and engage in the comments, let's discuss!
This problem really lends itself to a multimodal solution. The majority of images shared on Civitai were either generated with our onsite generator or uploaded to a user's profile after they generate the image locally. In most cases, users share the generation information - specifically the prompt used in generation. We also use ML models, like WDTagger, to add tags to our media. Because of this we have two types of text data and the image data itself. To tackle this problem, we tried two main solutions: a mixture of 'experts' and a multimodal model. These models are composite solutions which take in the outputs of the hidden layers of other ML models. To create those solutions, we needed to do some traditional deep learning (DL) using our single modalities. In this article, I'll cover our efforts in Natural Language Processing (NLP) and our mixture models.
Like the second article, this article will cover the models we explored, their architectures, and how these models performed. I'll try to include some images and tables, and I'll also link references when important. This isn't meant to be a comprehensive literature review or to help people unfamiliar with the models master the concepts. It's more a project roadmap of how we tackled this problem. I'm always happy to discuss in the comments if there are questions. Before we get into the models though, let's remind ourselves what we're talking about at a high level.
Seen this intro before? Check out the new stuff below in The Natural Language Models.
What are we trying to solve
At Civitai we receive hundreds of thousands of images through either our generation pipeline, where users experiment with different checkpoints and LoRAs, or our ingestion pipeline where users can share their locally generated images. We also receive a ton of data from users in training data sets, which users can use to train their own LoRAs onsite. When users want to make this data public and share with the community, we at Civitai have to make sure we can serve those images up to the appropriate audience. One way we do this is with image ratings that mimic movie ratings. Users are able to control the nature of the content they’ll see in their image by toggling on or off specific ratings.
To keep our users' browsing experience pleasant, we want to make sure our ratings are accurate and representative of the content. Currently we use a system of tags, rating the tags with a numeric value, and then determining a max allowable threshold per rating. If tags push the numeric value over a specific rating threshold, the rating graduates to a more restrictive level. However, we’re interested in trying to see if we can’t run a smaller or more simple architecture than a tagger to determine the rating. As such, I’ve tried a few different architectures, modalities, and strategies to build an end-to-end pipeline to correctly classify content with the movie ratings
To approach this problem, we built mixture models which contain an odd number of single modalities and implement a voter layer. The single modality models each individually classify either the image or text. The text could be just the prompt or the prompt with the ML Tags. The models would classify their assigned modality and pass their prediction to the voter, which would select the most voted upon index. If there was a tie the more conservative, that is higher nsfw level, of the tie would be selected. We also experimented with multimodal models, which would take the pre-logit layers from different ML models and concatenate the outputs. These outputs would become the inputs of a simple Multilayer Perceptron (MLP). This MLP would then output logits, which we'd select the highest logit to determine which classification of the media. Readers should know that the multimodal model requires more training, as each individual model was trained for our task and then the MLP was also further trained.
This article focuses on the single modality NLP models, their training, and performance.
The Natural Language Models
Language is a huge part of diffusion modeling (DM) - especially when we consider why it was so easily and widely adopted by creatives. The idea that you can tell a machine what you want to look at, hear, watch is so appealing to pretty much everyone. Since natural language is the interface or prompting language for DMs it makes sense we should be able to utilize this modality to classify the content that comes out of the DM. However, anyone experienced with generating images knows that the models can make things outside of the context of the prompt making it difficult to fully rely on the prompt modality as a single point of data.
The community has also largely rallied behind the WD tagger model, so much so that other models have included the WD Tags as keywords or unique tokens to drive specific generations. This means that the tags contain specific semantic information that users recognize and can use to prompt as well as search or categorize on their own. As such we include the tags in our text modality giving us additional data. Now that we have our text modality data, we can start to consider different types of NLP models.
Similar to the CV efforts - we started simply and quickly ramped up to modern DL architectures. We looked at Long Short-Term Memory (LSTMs) recurrent networks, which are small and very simple NLP models. These models are easy to train and typically used to help understand if there is some semantic pattern to the data - something in this case we already know is true due to the fact that our modalities are either generated by ML Models OR used in the diffusion process.
After the LSTM we focused experiments on BERT models. BERT, or Bidirectional Encoder Representations from Transformers, have been utilized widely for text classification, summarization, and generation and are foundational to the current state of NLP and large language models. Within the BERT class of models, we explored DistilBert and RoBERTa models by adjusting the text modalities allowing us to compare the dataset and the different size architectures.
Long Short-Term Memory
The LSTM is a type of recurrent neural network (RNN) architecture. The RNN architecture allows for sequential data processing, giving the LSTM the ability to process the data over multiple time steps. RNNs also contain a memory
in the form of a hidden state, allowing the architecture to consider previous inputs when considering the current inputs.
A simple example to consider is listening to a sentence. In English, a sentence may start with the subject, include an action word, and then have a noun that is the subject of the action word for instance: “Yesterday I, along with a ton of my friends from school and summer camp, played soccer at the docks downtown in the new park”. The RNN can be thought of as having the ability to understand that subject of the sentence as it relates to the subject of the action - despite being in different locations of the sentence.
However, traditional RNNs struggle with long term dependencies, or the ability to store enough information to make semantic connections within the textual data. As such, LSTMs were developed for long sequence or time series data - like that found in natural language and speech recognition.
The key component of the LSTM is the memory cell, which is superior over the RNN in maintaining information over longer context. Each LSTM unit contains three gates: input, forget, and output. These gates determine how information flows through the LSTM and which data is saved as context for future LSTM units. This makes them especially useful for tasks like named entity recognition and part-of-speech tagging.
As a crucial step in NLP tasks, we implemented tokenization. Initially, we used a simple word-level tokenization approach, assigning integer values to each unique word in our training data. We included special tokens for sentence start, sentence end, and unknown words. However, despite having a decent-sized dataset for fine-tuning, our vocabulary didn't cover all words used in the prompts on our site. While we could have expanded our vocabulary to include all possible words, we found that the LSTM struggled to learn semantic relationships, particularly for words not well-represented in our labeled dataset.
This limitation led us to seek a more sophisticated tokenization method and to fine-tune a more advanced model with recent State-Of-The-Art (SOTA) performance, aiming to better capture the semantic nuances in our data. This led us to utilizing the aforementioned BERT models. Below, we’ll cover the transformer based models we experimented with and ended up utilizing in our multimodal and mixture models.
Readers should note that the LSTM performances will not be shared, but it did perform very well in training and validation - just not in testing. As such, it was not used in the construction of the multimodal systems and so it will be neglected moving forward.
BERT
The BERT model was introduced in late 2018 by Google and was notable for its dramatic improvement over previous SOTA methods. BERT utilizes a transformer-based architecture with bidirectional self-attention, allowing it to consider the full context of a word and its position within a sequence. This bidirectional approach enables BERT models to capture semantic structures more effectively than standard unidirectional models.
We discussed transformers as an architecture in the previous article, however it’s important to note that the main benefit of the transformers architecture comes from the attention mechanism. The introduction of Attention addressed the issue of variable context lengths by enabling researchers to indicate that padding tokens should be ignored. By applying zeros in the attention layer for padding tokens, the model effectively disregards them, simplifying the mathematical operations in the embedded space as the embeddings pass through the model. This innovation significantly improved the handling of varying input lengths in NLP models. The expansion of the attention mechanism to bidirectional in BERT allowed the model to learn positional representation and semantic structures that are more complicated and complete than the original unidirectional attention.
The pre-training process, which included masked language modeling and next sentence prediction tasks, allows the architecture to develop a deep understanding of language context. This results in highly effective contextualized embeddings. After the pre-training process, BERT models can be fine-tuned by adding an additional layer which can be adapted to a wide range of NLP tasks; including question answering, sentiment analysis, and text classification - often achieving SOTA results with minimal task-specific modifications.
This advance in NLP led to a variety of BERT-based architectures including research in how to make the model more computational efficient while trying to maintain its performance.
DistilBert
The DistilBert model is based on the BERT architecture but has 40% less parameters while maintaining 97% of the model performance. This model was created through knowledge distillation, also known as teacher-student. In this training method the teacher model outputs probabilities, rather than single class predictions, which are compared to the output probabilities of the student net in a distilled loss calculation. Standard loss is also used here, comparing the outputs of the student model to the hard labels of the data. These losses are added into a combined loss and then used to update the student nets weights. This process is shown below in the next figure. Despite the student models smaller size, the model is able to learn how to generalize with fewer parameters while maintaining most of the capabilities of the larger model.
Computation efficiency is an often understated metric, especially in deployment. Since we plan on using this model as one of many within either a mixture or as a feature generator for a multimodal model the reduced costs of running the DistilBert was an attractive option for our NLP methods - especially since the model maintains great performance on the NLP leaderboards. For our experiments we finetuned two DistilBerts for text classification for our movie raters. Both models saw the prompts, but one was also provided the tags.
It’s important to note some prompts are very long. Traditional text embedding models used in generative AI may only consider the first 75 tokens of the prompt, or they can divide longer prompts into sequences of 75 tokens and then concatenate the sequences together - however for the DistilBert models we set the max token length to 512. This means that for longer prompts, the DistilBert with Tags converges to the same solution as the DistilBert without Tags. The tokenizers used in the generative models are different from the tokenizers used in the BERT Models, so the tokens may not line up one to one. For this work we used the base tokenizers for each model as they were found on HuggingFace.
Similarly to how researchers worked to make BERT more computationally efficient, research was done to improve the performance of the model by varying the training methodology. This created the RoBERTa model, which we also explored due to its superior text classification performance.
RoBERTa
Robustly Optimized BERT Approach or RoBERTA was developed by FacebookAI (Now Meta AI) in 2019. It shares the same architecture as BERT, however its pre-training phase is fairly different. The training process removes the next-sentence prediction trained into BERT and utilizes a different masked language method.
During pre-training In BERT, masked language modeling is used in a static way. This means that of the tokens fed into the model during training, some 15% of tokens are marked for masking. 80% of these tokens are replaced with the [MASK] token, 10% are left unchanged, and 10% are replaced with random words. This allows the model to predict the original work under the mask based on the context provided by the unmasked tokens.
This method is used in RoBERTa as well, with a single important distinction - the pre-training masking is dynamic. That means that at every epoch, the 15% of tokens that are masked are randomly selected. This is different than in BERT where each epoch the same 15% of tokens are masked for all training. This dynamic masking increases the diversity of the masked tokens and prevents the model from learning masked positions. It also allows the model to see the same sequence but masked differently each time. RoBERTa also relies on a larger dataset with larger batch sizes.
RoBERTa's dynamic masking introduces more variability into the training process compared to BERT's static masking. However, this potential increase in variance is effectively leveraged as a strength through the use of larger datasets and batch sizes. These allow the model to encounter a wider variety of masked patterns across many examples, enabling it to learn more robust and generalizable representations of the underlying semantic structures in the data. The combination of dynamic masking with increased training data and larger batch sizes contributes to RoBERTa's improved performance over BERT.
Aside from the dataset and pre-training step, the RoBERTa model has the same architecture and number of parameters as the BERT model. For our work here, we explored RoBERTA-large which has around 350 million parameters and is ~3x larger than the BERT/RoBERTa base.
Unlike the DistilBert models we trained a single RoBERTa model on the prompts only. This was in part due to training costs, but also because we wanted to make a mixture or generate features for the multimodal models that could be used in our generation pipeline or in our ingestion pipeline. The main differences between these is that in the generation pipeline the images are not yet tagged. We also wanted to explore the different architectures and the effects of the different modalities/datasets on the overall performances.
Results
Above we covered the various architectures we utilized for NLP in this work. We also experimented with combining prompts and tags, text cleaning, fine-tuning on various pre-training weights, and different training techniques like early stopping and focusing on specific metrics over others. Just like in the previous article, we’ll present the results of these efforts by reporting metrics like class accuracy and f1 score. We'll also look at Down Class Misclassification (DCM), a significant metric for this project as it highlights the misclassification of images with significant differences in content rating like an X rated image being scored as PG13.
To get these performance metrics we have applied a test set. Unlike the training dataset, this test set is some 12k images - the bulk of which have not been publicly shared. These images have all been manually reviewed by either the head of our moderation team, our steadfast CEO, myself, or are images that did not make it into any of the training sets due to downsampling.
The metrics we'll report out are derived metrics based on four parameters: True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). These concepts make up the base confusion matrix. To explain these concepts - let's consider a binary classifier which predicts class 0 or class 1. If the model predicts class 0 correctly, then we can say the model has made a TP prediction for class 0 - and a TN for class 1. If it predicts class 1 instead, it's a FN for class 0 and a FP for class 1. Thus every prediction falls into one of these categories across all classes. However, rather than considering each parameter for all classes we use summarized metrics. Below is a table with these summarized metrics and their definitions.
The first metric we’ll look at is class accuracy. In this table we can see that the larger RoBERTa architecture performs better than the DistilBerts - by around 10%. This performance gap is not surprising, but highlights the power of DistilBerts ability to be a lightweight and effective model. Among the two DistilBert models, the model with Tags performs better across all classes, with the greatest increase in accuracy PG13 and X. Similarly to the CV models, the R class underperforms with accuracies less than 50%. If this was a binary classifier we’d say it’s worse than random, but since we have a 5 class system still better than random though it does not nearly reach the performance we’d like. R is an especially difficult class, given that PG13 and R have a hard boundary (as will be seen by the misclassification counts).
While accuracy is a great metric to consider, it’s sensitive to example counts. That is, the eval set has different counts for each class and so the relative TP counts can be different and greatly affect the accuracy. As such we present the f1 score, which is a balance of identifying positive examples of each class. Here we see that the DistilBert with tags performs better than RoBERTA - across every class. This model discriminates well in what it identifies in each class, making sure to not over-select. This means that the DistilBert+Tags model is the superior model for positive identification and in computational performance.
Since we’re interested in trying to understand how these models perform so we can design our mixtures and multimodal models we want to understand the counts of misclassifications and how severely the models mislabel down class. Its somewhat of a smaller concern to see, say an Apple labeled as R or above. However, it’s not acceptable to see something graphic on the front page or when a user wants to search PG images. Since we’re interested in using these models as features for our other model architectures I’m only going to report on the DistilBert+Tag and the RoBERTa.
What we can see in the table below is that the DistilBert+Tag has less DMC than the RoBERTa, something we might have gathered from its performance in the f1 tables. However, considering its slightly worse performance in accuracy, it’s likely this model has worse up class misclassification. The DistilBert+Tags model has significantly better numbers around misclassifications from X-PG and X-PG13, it does however have slightly more misclassifications between R-PG13. Again, this boundary is blurry and can be very personal. Even on the moderator team internally it can be hard to categorize a humanoid Image between PG13 and R.
Unlike the CV models, there is a clear breakout among the NLP models - DistilBert+Tags. This model is around 40% smaller than the RoBERTa and after being fine tuned seems to fit our use case much better than the RoBERTa. However, the challenge of using this model everywhere is that there are workflows where the tags either do not exist at the time of scoring or will never exist for the specific image. While we could always add the tagger to the pipeline, it would mean calling another large transformer based model and losing the efficiency gain. As such, for our mixtures and even in some use cases for the multimodal model we’ll utilize DistilBert AND RoBERTa instead of pairing the RoBERTa and DistilBert+Tags.
The astute reader may pause and think - ‘wait, RoBERTa as well?’. The reason for this is because when browsing the actual mislabeled media, RoBERTa picked up on some examples that the DistilBert+Tags did not and vice versa. And while the RoBERTa is a larger transformer, it’s still computationally cheaper than the SWIN transformer used for tagging.
Conclusion
In this article we explored various NLP architectures we used to try to classify the text modalities in our generative text2image problems. We explored utilizing LSTMs and two different BERT architectures - emphasizing computational efficiency and the ability to easily finetune and deploy the models. The main takeaways can be summarized as:
LSTMs or any architecture where we need to develop our own tokenizers are much less efficient at handling nuance or new language than BERT based architectures, mostly due to masked language pre-training techniques
The BERT architectures perform fairly well across all classes but have a challenge with the PG13-R line
The DistilBert+Tags model is superior for class specific f1 scores and Down Class Misclassifications, even if the RoBERTa is better for class accuracy.
No single NLP model can truly solve this problem, which makes sense when we consider the randomness in generation inherent to the diffusion process
In the next article we’ll discuss mixture models and multimodal models, and their performance in media classification. I’ll do a quick refresher on the single modality model performances as a group and then show how combining these architectures either as voters or features for an MLP can improve performance to higher levels.
As I prepare for the next article, I'd love to hear about your experiences with NLP modeling. Are there any efficient small-language models you've used for similar tasks? Is RoBERTa still relevant, or should we be looking at newer transformer architectures?
I've kept this article high-level, but I'm happy to dive deeper into the mathematics or share code if there's interest. Your feedback will help shape the depth and focus of future articles.
Your insights are invaluable as we tackle these challenges in ML and AI. Lets shape the future of safe and enjoyable content across Civitai and the Gen-AI community!