Foreword

While training different model like Embeddings, Hypernetworks and LoRA I am all the time confronted with abbreviations and names as well as concepts of different kind. This glossary will be a collection of explanations I need while working on new models.

I will successively add explanations of these concepts for clarification purposes.

Glossary

ANN

An artificial neural network (ANN) is used as a computational model in the field of machine learning and implements the principles of neural networks based on biological neural networks found in both the human and animal brain. The ANN consists of a large number of connected nodes, each of which performs a simple mathematical operation. The output of each node is determined by this operation and by a set of parameters specific to that node. By linking these nodes and carefully setting their parameters, very complex functions can be learned and calculated.

Bias

Bias is a parameter in neural networks that allows a model to fit the training data better by shifting the activation function. It helps the model to learn the data’s underlying pattern.

BLOB

In computer vision, a BLOB is an area of an image that differs from the surrounding areas in properties such as colour or intensity. The term BLOB refers to areas that are connected to each other and form a distinct pattern within an image, typically used to represent objects, features or regions of interest in object recognition and segmentation tasks. Blob detection methods aim to identify these regions by analysing properties such as shape, size and texture.

Bucket

A bucket is a virtual container. The training images used in a LoRA training do not have to be of the same size, but images of different sizes cannot be trained at the same time. Therefore, it is necessary to sort the images into the so-called buckets according to their size before the LoRA training starts. Similar sized images are put in the same bucket and different sized images are put in different buckets.

Checkpoint

Checkpoint is a larger pre-trained Base Models that are responsible for the core image generation. This has been trained on large datasets of images so that this is able to generate on the other hand side a variety of images independently. The Checkpoint can be thought of as the main generation engine of image generation, capable of understanding and interpreting a range of text inputs to produce different image outputs.

Diffusion Model

In machine learning diffusion models are a class of generative models for latent variables. A diffusion model consists of three main components. There is the forward process, the reverse process and the sampling process. The goal of diffusion models is to learn a diffusion process for a given dataset so that the process can generate new images that are similarly distributed to the original dataset. A trained diffusion model can be sampled in many ways and with varying efficiency and quality.

Dreambooth

DreamBooth is a deep learning generation model that is used to personalise existing text to image models through fine-tuning. DreamBooth implementations can be applied in general to text to image models, allowing the model to produce more fine-tuned and personalised results after training with a reduced set of images of a subject. The disadvantage is the size of the generated model, which is in the range of gigabytes.

Embedding

Embedding is also known as Textual Inversion. Textual Inversion is a training technique for personalizing image generation base models with just a few sample images that you want to learn the base model on. In this technique, the Text Embeddings are learned and updated to match the sample images you provide. The new Embeddings are tied to a specific word that you must use in the prompt.

→ see also textual Inversion

Epoch

In the context of machine learning and particularly in neural network training, the term epoch represents a base concept. An epoch refers to one complete single pass of the entire training dataset through the learning algorithm. In other words, when all the dataset samples have been exposed to the neural network for learning patterns once, one epoch is said to be completed.

Euler

Euler is a so-called sampler. In the context of samplers, Euler is one of the simplest methods. This numerical method is based on ordinary differential equations (ODE) and eliminates the noise linearly in each step. Due to its simplicity, it may not be as accurate as we would like, but it is one of the fastest.

FNN

A feedforward neural network (FNN) is the second type of network alongside the RNN that follows the same rules. Both types of artificial neural networks are characterized by the direction of the information flow between their layers. The information flow is unidirectional, i.e. the information in the model only flows in one direction, from the input nodes via the hidden nodes (if present) to the output nodes, without cycles or loops. Modern feedforward networks are trained using the backpropagation method and are colloquially referred to as vanilla neural networks.

Hallucinations

In AI, especially with large language models, this refers to the generation of false or nonsensical information that is presented as fact. This is the case when an AI produces content that is not based on its training data or real-world facts.

Heun

Heun's Method is a sampling method. It is a numerical technique used to solve ordinary differential equations by predicting the next value based on the slope at the current point and a predicted slope at the next point. This method is often referred to as an improvement to Euler’s method, as it enhances the accuracy of the basic Euler method by averaging the slopes, providing a better estimate of the function's behavior. But it needs to predict noise twice in each step, so it is twice as slow as Euler.

Hypernetwork

A Hypernetwork is a small neural network that modifies the cross-attention module of the U-net noise predictor. It is similar to a LoRA or an Embedding, as it is a small model file used for modifying a checkpoint model by weights.

Image Space

In contradiction to the latent space the image space holds the images in uncompressed form.

Inpainting

Inpainting means that you can change areas within a picture. This allows you to add motifs to a picture, remove motifs and repair broken areas.

Latent Space

The latent space is a space, where images are exist in compressed form. To hold images in the so-called image space would lead to a high storage rate. The solution is to use the so-called latent space. Instead of working on such a large space like the image space with images, the images are first compressed to the latent space that is a few factors smaller.

LLM

A large language model (LLM) is a model with an extrem large number of parameters that are adjusted during the training of a model. These large number of parameters can be in a range of at least a billion or much more parameters. Due to its extreme large size, the LLM requires a lot of data and high sophisticated computing capability for the training of a model. LLMs are usually based on the so-called Transformer architecture.

LoRA

Low-Rank Adaptation (LoRA) is a Parameter-Efficient Fine-Tuning (PEFT) method. This method was introduced by a team of Microsoft researchers in the year 2021. Since this time, LoRA became a popular approach for the fast fine-tuning of large language models, (stable) diffusion models and other types of AI models. A LoRA model is an extension to an existing large model.

Loss

In short, diffusion models iteratively denoise images to add detail to an image. The loss measures the difference between the predicted noise and the actual noise added to the image. The lower the value of the loss, the better the result.

LSTM

LSTM is the abbreviation for Long Short-Term Memory. A LSTM is a special form of a Recurrent Neural Network (RNN) that is used in deep learning to overcome the limitations of conventional RNNs, especially when learning long-term dependencies. Conventional RNNs struggle with the problems of vanishing and exploding gradients, which make it difficult for them to remember information many steps back in a sequence. LSTM networks overcome this problem by incorporating memory cells that can hold information in memory for long periods of time.

Markov Chain

A Markov Chain is named after the mathematician Andrei Markov. The so-called Markov Chain is a sequence of possible state changes in which the probability of each event depends only on the state that was attained in the previous event. Markov Chains are mathematical models used to represent systems that change between different states over time. The principle of the Markov Chain provides a powerful framework for analysing and predicting the behaviour of systems that evolve over time, especially when the future state depends only on the current state.

Negative Prompt

The Negative Prompt refers to certain attributes or characteristics that should not appear in an image generated by the AI. If one do not want the base model to generate AI images with a such a specific feature, one can specify that in the Negative Prompt. This can be done in text form by use of punctuation and modifiers [1] as well as in form of Embeddings.

NLP

Natural language processing (NLP) is a branch of computer science, information technology and artificial intelligence that deals with the interaction between computers and natural human languages, in particular with the question of how to program computers so that they can process and analyze large amounts of natural language data.

Outpainting

Outpainting is the extension of an image to the outside. This involves adding new content outside the original image area. This can be completely new content and motifs or simply the expansion of a picture to include a larger field of view. The format of an image can also be changed in this way.

Overfitting

Overfitting occurs when a LoRA model learns the training data too well. Underfitting is the opposite, where the model fails to capture the underlying concept. A sign for overfitting is when an image becomes saturated, is full of artifacts or is plain distorted weird.

PEFT

PEFT is the abbreviation for Parameter-Efficient Fine-Tuning. In fine-tuning, a model that has already been trained for a specific task is modified so that it performs a similar task. As the abbreviation states, this is done parameter efficient. Parameter-efficient fine-tuning is in this way a set of techniques and methods for fine-tuning a large model in the most computationally and time-efficient way possible, without sacrificing performance that might occur with full fine-tuning.

Prompt

In the context of LLMs, a Prompt is the input text or the instruction set given to the model to elicit a desired response. Effective Prompt engineering is crucial for guiding LLMs to generate useful, relevant, and high-quality outputs.

Prompt Weight

The Prompt Weight allows you to emphasize or de-emphasize certain parts of a Prompt, giving one more control over the AI generated image. The default weight has a value of 1.0. In order to reduce emphasis on a word or phrase, decrease the weight by a factor of 0.1. In order to increase emphasis on a word or phrase, increase the weight by a factor of 0.1. Weights are also used in the Prompt when dealing with Embeddings, Hypernetworks and LoRAs. The weight is in this case part of the related expression.

RNN

The recurrent neural network (RNN) is an artificial neural network (ANN) in which the connections between the nodes form a directed graph along a temporal sequence. The output of the previous step is fed into the current step as input. In conventional neural networks, all inputs and outputs are independent of each other. This behavior allows the RNN to exhibit dynamic behavior over time. Unlike forward neural networks (FNNs), RNNs can use their memory to process sequences of inputs.

Sampler

In stable diffusion, a special technique known as sampling is used to create an image. This is done using the so-called sampler. First, a random image is generated in latent space. The noise predictor then evaluates the noise of the image, which is then subtracted from the image. This iterative process is repeated several times to produce a refined, noise-free image. This denoising process is referred to as sampling, as Stable Diffusion generates a new sample image at each step. Samplers are also known as schedulers.

Seed

In AI technology a seed is a fixed or a random large number. Think of it as the initial value for a random number generator used in AI. Such a seed makes each output next to the other settings unique.

Textual Inversion

A Textual Inversion is also known as Embedding. Textual Inversion is a training technique for personalizing image generation base models with just a few sample images that you want to learn the base model on. In this technique, the Text Embeddings are learned and updated to match the sample images you provide. The new Embeddings are tied to a specific word that you must use in the prompt.

→ see also Embedding

Token

A token is a word that is used in the well-known Prompts that are used. A token is usually a word or part of a word (subword), but it can also be a punctuation. A tokenizer converts a given Prompt into a set of tokens. A text encoder takes the input in form of the tokens and outputs a list of numbers representing each token in the text as a vector per token.

Transformer

The so-called Transformer is a deep learning architecture developed by researchers at Google that based on a multi-head attention mechanism. Transformers addresses some of the limitations of recurrent neural architectures (RNNs) such as long short-term memory (LSTM) and became as a result widely used in the natural language processing (NLP). Later variations have been widely adopted for training large language models (LLMs) on large language datasets (LLDs). The Transformer is converting text to a numerical representation called token, and each token is converted into a vector via looking up in a word embedding table.

Upscaler

In principle, AI image generation requires a GPU with a very high amount of VRAM. If the image resolution to be generated is too high, image generation fails due to a lack of VRAM. An upscaler scales an image to the desired resolution in a highly sophisticated way. There are a lot of upscalers available which can be used in the image generation process.

Variational Autoencoder

The common abbreviation for the Variational Autoencoder is VAE. The Variational Autoencoder is something like a sophisticated resizing tool for images. It consists of two parts. The first part is the encoder, which compresses a standard image such as a 512x512 pixel image into a smaller and more compact form, the so-called latent image. The second part is the Decoder, which reverses the previous process and enlarges the latent image back to its original size. The whole process is extremely efficient as working with smaller image versions requires less memory and processing power, making it faster and more practical, especially for computers with limited resources.

VRAM

VRAM is the abbreviation for Video Random-Access Memory. This term is used when referring to a Graphics Processing Unit (GPU). GPUs are used on modern graphics cards and can be used for various types of mathematical calculations. Applications include AI image generation and crypto mining. The latter is better known in this context.

Weight

A weight refers to a parameter that a neural network learns during the training process. Within the neural network, each node of the neural network is assigned a weight that determines how information flows through it.

and more coming soon ...

Finally

Have a nice day. Have fun. Be inspired!

References

[1] https://aienthusiastic.com/stable-diffusion-prompt-grammar-syntax-weights/

[2] https://moxielearn.ai/blog/17-ai-terms-and-acronyms

[3] https://www.kindo.ai/blog/45-ai-terms-phrases-and-acronyms-to-know

[4] https://stable-diffusion-art.com/glossary/

[5] https://ortomate.ai/ai-glossary

Glossary With Respect To AI