img2video with CogVideoX Unix

I've received a few requests to document how I generate my videos, so I wanted to whip up a quick guide for folks.

This guide provides a script aimed for execution on a UNIX system such as Linux, OSX, or WSL2.

It presumes that you have python3 , pip, imagemagick, and ffmpeg installed.

I have provided a wrapper script for CogVideoX here which improves the UX of the upstream CLI tool. It will take care of cloning CogVideoX and installing its python dependencies, so all you need to do is run the provided script. It also takes care of selecting the appropriate model for an image based on its resolution.

Generating Input Images

We will start by creating input images, but as we do this we'll want to pay close attention to their resolution.

If you'd like fast video generation stick to either 720x480 or 480x720 images. This will use CogVideoX 1.0. In my experience these take ~3 minutes to generate.

If you'd like higher resolution images you can use 768x1360, 1360x768, or 768x768 which will use CogVideoX 1.5. You can use resolutions between 768-1360, but one edge must be 768, and the other must be 768-1360 and divisible by 16 ( e.g. 768x1344 or 1344x768 ). In my experience these take ~15 minutes to generate.

When selecting images try to avoid any that contain hands or feet since CogVideo struggles to animate these. Pay very close attention to details and err on the side of rejecting images - any deformed aspects of the image will only be exaggerated when they become animated.

img2video Prompts

In my experience it seems like the simpler these prompts are the better. For example a great starting point for an image of 1girl is simply "A beautiful woman".
You can extend this with simple camera movements like "A beautiful woman. The camera moves towards her." but keep them simple.
It seems like mentioning details about the image in the prompt will cause those parts of the image to be animated, so for example "A beautiful woman with black hair." might cause her hair to flow.

Getting Good Results

Frankly the real trick to good results is trial and error. For each video that I post it generally took 2-8 iterations to finally get a good one. I usually invoke the script in a loop and leave it running overnight or while I'm away from my machine.

The Script

Save this script to a file named img2video, and make it executable with chmod +x img2video;

When using the script you should almost always provide a prompt with img2video -p "Some prompt..." foo.png since the default is just "A beautiful woman".

#! /usr/bin/env bash
# ============================================================================ #
# Generate a video from an image.
#
# USAGE: img2video [OPTIONS...] IMG-PATH
#
#
# ---------------------------------------------------------------------------- #

set -eu;
set -o pipefail;

# ---------------------------------------------------------------------------- #

_as_me="img2video";

_version="0.1.0";

_usage_msg="USAGE: $_as_me [OPTIONS...] IMG-PATH
Generate a video from an image.
";

_help_msg="$_usage_msg
OPTIONS
  -o,--output FILE  Output video file. ( default: based on input file )
  -S,--steps N      Number of inference steps. ( default: 20 )
  -s,--seed N       Random seed (positive integer). ( default: random )
  -p,--prompt TEXT  Prompt text. ( default: 'A beautiful woman' )
  -h,--help         Print help message to STDOUT.
  -u,--usage        Print usage message to STDOUT.
  -v,--version      Print version information to STDOUT.

ENVIRONMENT
  GREP              Command used as \`grep' executable.
  REALPATH          Command used as \`realpath' executable.
  FFMPEG            Command used as \`ffmpeg' executable.
  CONVERT           Command used as \`convert' executable.
  PYTHON3           Command used as \`python3' executable.
  PIP               Command used as \`pip' executable.
  MKTEMP            Command used as \`mktemp' executable.
  GIT               Command used as \`git' executable.
";


# ---------------------------------------------------------------------------- #

usage() {
  if [[ "${1:-}" = "-f" ]]; then
    echo "$_help_msg";
  else
    echo "$_usage_msg";
  fi
}


# ---------------------------------------------------------------------------- #

#@BEGIN_INJECT_UTILS@
: "${GREP:=grep}";
: "${REALPATH:=realpath}";
: "${FFMPEG:=ffmpeg}";
: "${CONVERT:=convert}";
: "${PYTHON3:=python3}";
: "${PIP:=pip}";
: "${MKTEMP:=mktemp}";
: "${GIT:=git}";


# ---------------------------------------------------------------------------- #

# Repo information
: "${XDG_CACHE_HOME:=$HOME/.cache}";
: "${REPO_DIR:=$XDG_CACHE_HOME/img2video/cogvideo}";
: "${COG_URL:=https://github.com/THUDM/CogVideo.git}";
: "${COG_REV:=2fdc59c3ce48aee1ba7572a1c241e5b3090abffa}";
: "${DIFFUSERS_URL:=https://github.com/huggingface/diffusers.git}";
: "${DIFFUSERS_REV:=b5fd6f13f5434d69d919cc8cedf0b11db664cf06}";


# ---------------------------------------------------------------------------- #

declare -a TMPFILES;
TMPFILES=();

cleanup() {
  rm -f "${TMPFILES[@]}";
}

trap cleanup EXIT;


# ---------------------------------------------------------------------------- #

while [[ "$#" -gt 0 ]]; do
  case "$1" in
    # Split short options such as `-abc' -> `-a -b -c'
    -[^-]?*)
      _arg="$1";
      declare -a _args;
      _args=();
      shift;
      _i=1;
      while [[ "$_i" -lt "${#_arg}" ]]; do
        _args+=( "-${_arg:$_i:1}" );
        _i="$(( _i + 1 ))";
      done
      set -- "${_args[@]}" "$@";
      unset _arg _args _i;
      continue;
    ;;
    --*=*)
      _arg="$1";
      shift;
      set -- "${_arg%%=*}" "${_arg#*=}" "$@";
      unset _arg;
      continue;
    ;;
    -o|--output)
      if [[ "$#" -lt 2 ]]; then
        echo "$_as_me: option '$1' requires an argument" >&2;
        exit 1;
      fi
      OUTFILE="$2";
      shift;
    ;;
    -S|--steps)
      if [[ "$#" -lt 2 ]]; then
        echo "$_as_me: option '$1' requires an argument" >&2;
        exit 1;
      fi
      STEPS="$2";
      shift;
    ;;
    -s|--seed)
      if [[ "$#" -lt 2 ]]; then
        echo "$_as_me: option '$1' requires an argument" >&2;
        exit 1;
      fi
      SEED="$2";
      shift;
    ;;
    -p|--prompt)
      if [[ "$#" -lt 2 ]]; then
        echo "$_as_me: option '$1' requires an argument" >&2;
        exit 1;
      fi
      PROMPT="$2";
      shift;
    ;;
    -u|--usage)    usage;    exit 0; ;;
    -h|--help)     usage -f; exit 0; ;;
    -v|--version)  echo "$_version"; exit 0; ;;
    --) shift; break; ;;
    -?|--*)
      echo "$_as_me: Unrecognized option: '$1'" >&2;
      usage -f >&2;
      exit 1;
    ;;
    *)
      if [[ -z "${IMG:-}" ]]; then
        IMG="$1";
      else
        echo "$_as_me: Unexpected argument '$1'" >&2;
        usage -f >&2;
        exit 1;
      fi
    ;;
  esac
  shift;
done


# ---------------------------------------------------------------------------- #

# Set fallbacks

: "${SEED:=$RANDOM}";
: "${STEPS:=20}";
: "${PROMPT:=A beautiful woman}";

if [[ -z "${OUTFILE:-}" ]]; then
  OUTFILE="${IMG%.png}_$SEED.mp4";
fi


# ---------------------------------------------------------------------------- #

if [[ -r "$OUTFILE" ]]; then
  echo "$_as_me: output file '$OUTFILE' already exists" >&2;
  exit 1;
fi


# ---------------------------------------------------------------------------- #

rotate_image() {
  local _img _angle _tmpfile;
  case "$1" in
    -c|--cclock) _angle="-90"; shift; ;;
    *)           _angle="90"; ;;
  esac
  _img="$1";
  _tmpfile="$( $MKTEMP; )";
  TMPFILES+=( "$_tmpfile" );
  $CONVERT "$_img" -rotate "$_angle" "$_tmpfile";
  mv "$_tmpfile" "$_img";
}


# ---------------------------------------------------------------------------- #

rotate_video() {
  local _vid _angle _tmpfile;
  case "$1" in
    -c|--cclock) _angle="cclock"; shift; ;;
    *)           _angle="clock"; ;;
  esac
  _vid="$1";
  _tmpfile="$( $MKTEMP; )";
  TMPFILES+=( "$_tmpfile" );
  $FFMPEG -i "$_img" -vf "transpose=$_angle" "$_tmpfile";
  mv "$_tmpfile" "$_vid";
}


# ---------------------------------------------------------------------------- #

get_image_size() {
  local _img;
  _img="$1";
  $CONVERT "$_img" -print "%w %h\n" /dev/null;
}

get_image_width() {
  local _img;
  _img="$1";
  $CONVERT "$_img" -print "%w\n" /dev/null;
}

get_image_height() {
  local _img;
  _img="$1";
  $CONVERT "$_img" -print "%h\n" /dev/null;
}


# ---------------------------------------------------------------------------- #

max() {
  local _a _b;
  _a="$1";
  _b="$2";
  if [[ "$_a" -gt "$_b" ]]; then
      echo "$_a";
  else
      echo "$_b";
  fi
}

min() {
  local _a _b;
  _a="$1";
  _b="$2";
  if [[ "$_a" -lt "$_b" ]]; then
      echo "$_a";
  else
      echo "$_b";
  fi
}


# ---------------------------------------------------------------------------- #

pick_model() {
  local _img _width _height _max _mix;
  _img="$1";
  _width="$( get_image_width "$_img"; )";
  _height="$( get_image_height "$_img"; )";
  _max="$( max "$_width" "$_height"; )";
  _min="$( min "$_width" "$_height"; )";
  if [[ "$_max" -eq 720 ]] && [[ "$_min" -eq 480 ]]; then
    echo "THUDM/CogVideoX-5b-I2V";
  elif [[ "$_min" -eq 768 ]] && [[ "$_max" -le 1360 ]]; then
    echo "THUDM/CogVideoX1.5-5b-I2V";
  else
    echo "$_as_me: unsupported image size $_width x $_height" >&2;
    exit 1;
  fi
}


# ---------------------------------------------------------------------------- #

needs_rotate() {
  local _img _width _height;
  _img="$1";
  _width="$( get_image_width "$_img"; )";
  _height="$( get_image_height "$_img"; )";
  if [[ "$_width" -eq 480 ]] && [[ "$_height" -eq 720 ]]; then
    return 0;
  else
    return 1;
  fi
}


# ---------------------------------------------------------------------------- #

DID_ROTATE=0;
IMG_MROT="$IMG";
if needs_rotate "$IMG"; then
  DID_ROTATE=1;
  IMG_MROT="$( $MKTEMP; ).png";
  TMPFILES+=( "$IMG_MROT" );
  cp "$IMG" "$IMG_MROT";
  rotate_image "$IMG_MROT";
fi


# ---------------------------------------------------------------------------- #

MODEL="$( pick_model "$IMG"; )";


# ---------------------------------------------------------------------------- #

declare -a common_flags v1_flags v1_5_flags;

common_flags=(
  '--model_path' "$MODEL"
  '--image_or_video_path' "$IMG_MROT"
  '--output_path' "$OUTFILE"
  '--generate_type' 'i2v'
  '--num_inference_steps' "$STEPS"
  '--seed' "$SEED"
  '--prompt' "$PROMPT"
);

v1_flags=(
  '--num_frames' '49'
  '--fps' '8'
  '--width' '720'
  '--height' '480'
);

v1_5_flags=(
  '--num_frames' '81'
  '--fps' '16'
  '--width' "$( get_image_width "$IMG"; )"
  '--height' "$( get_image_height "$IMG"; )"
);


# ---------------------------------------------------------------------------- #

declare -a flags;
flags=( "${common_flags[@]}" );

case "$MODEL" in
  THUDM/CogVideoX-5b-I2V)
    flags+=( "${v1_flags[@]}" );
  ;;
  THUDM/CogVideoX1.5-5b-I2V)
    flags+=( "${v1_5_flags[@]}" );
  ;;
  *)
    echo "$_as_me: unsupported model '$MODEL'" >&2;
    exit 1;
  ;;
esac


# ---------------------------------------------------------------------------- #

# Setup repo if it doesn't exist
if [[ ! -d "$REPO_DIR" ]]; then
  mkdir -p "${REPO_DIR%/*}";
  $GIT clone "$COG_URL" "$REPO_DIR";
  ( cd "$REPO_DIR"; $GIT checkout "$COG_REV"; );
fi

# Setup diffusers if it doesn't exist
if [[ ! -d "$REPO_DIR/diffusers" ]]; then
  $GIT clone "$DIFFUSERS_URL" "$REPO_DIR/diffusers";
  ( cd "$REPO_DIR/diffusers"; $GIT checkout "$DIFFUSERS_REV"; );
fi

# Setup Virtual Environment
if [[ ! -d "$REPO_DIR/.venv" ]]; then
  $PYTHON3 -m venv "$REPO_DIR/.venv";
  source "$REPO_DIR/.venv/bin/activate";
  $PIP install -r "$REPO_DIR/requirements.txt";
  $PIP uninstall -y diffusers;
  $PIP install -e "$REPO_DIR/diffusers";
else
  source "$REPO_DIR/.venv/bin/activate";
fi


# ---------------------------------------------------------------------------- #

echo "$_as_me: Generating '$OUTFILE' with $MODEL" >&2;
$PYTHON3 "$REPO_DIR/inference/cli_demo.py" "${flags[@]}";


# ---------------------------------------------------------------------------- #

if [[ "$DID_ROTATE" -eq 1 ]]; then
  rotate_video --cclock "$OUTFILE";
fi


# ---------------------------------------------------------------------------- #
#
#
#
# ============================================================================ #

More to come, but this'll do for now.