# Open-Sora-Plan **Repository Path**: hqsrawmelon/Open-Sora-Plan ## Basic Information - **Project Name**: Open-Sora-Plan - **Description**: fork of https://github.com/PKU-YuanGroup/Open-Sora-Plan - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-03-06 - **Last Updated**: 2024-10-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Open-Sora Plan [[Project Page]](https://pku-yuangroup.github.io/Open-Sora-Plan/) [[中文主页]](https://pku-yuangroup.github.io/Open-Sora-Plan/blog_cn.html) ## Goal This project aims to create a simple and scalable repo, to reproduce [Sora](https://openai.com/sora) (OpenAI, but we prefer to call it "CloseAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, we have limited resources, we deeply wish all open-source community can contribute to this project. Pull request are welcome!!! 本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前我们资源有限仅搭建了基础架构,无法进行完整训练,希望通过开源社区逐步增加模块并筹集资源进行训练,当前版本离目标差距巨大,仍需持续完善和快速迭代,欢迎Pull request!!! Project stages: - Primary 1. Setup the codebase and train a un-conditional model on landscape dataset. 2. Train models that boost resolution and duration. - Extensions 3. Conduct text2video experiments on landscape dataset. 4. Train the 1080p model on video2text dataset. 5. Control model with more condition.
## News **[2024.03.05]** See our latest [todo](https://github.com/PKU-YuanGroup/Open-Sora-Plan?tab=readme-ov-file#todo), welcome to pull request. **[2024.03.04]** We re-organize and modulize our codes and make it easy to [contribute](https://github.com/PKU-YuanGroup/Open-Sora-Plan?tab=readme-ov-file#how-to-contribute-to-the-open-sora-plan-community) to the project, please see the [Repo structure](https://github.com/PKU-YuanGroup/Open-Sora-Plan?tab=readme-ov-file#repo-structure). **[2024.03.03]** We open some [discussions](https://github.com/PKU-YuanGroup/Open-Sora-Plan/discussions) and clarify several issues. **[2024.03.01]** Training codes are available now! Learn more in our [project page](https://pku-yuangroup.github.io/Open-Sora-Plan/). Please feel free to watch 👀 this repository for the latest updates. ## Todo #### Setup the codebase and train a unconditional model on landscape dataset - [x] Setup repo-structure. - [x] Add Video-VQGAN model, which is borrowed from [VideoGPT](https://github.com/wilson1yan/VideoGPT). - [x] Support variable aspect ratios, resolutions, durations training on [DiT](https://github.com/facebookresearch/DiT). - [x] Support Dynamic mask input inspired [FiT](https://github.com/whlzy/FiT). - [x] Add class-conditioning on embeddings. - [ ] Incorporating [Latte](https://github.com/Vchitect/Latte) as main codebase. - [ ] Add VAE model, which is borrowed from [Stable Diffusion](https://github.com/CompVis/latent-diffusion). - [ ] Joint dynamic mask input with VAE. - [ ] Make the codebase ready for the cluster training. Add SLURM scripts. - [ ] Add sampling script. - [ ] Incorporating [SiT](https://github.com/willisma/SiT). #### Train models that boost resolution and duration - [ ] Add [PI](https://arxiv.org/abs/2306.15595) to support out-of-domain size. - [x] Add frame interpolation model. #### Conduct text2video experiments on landscape dataset. - [ ] Finish data loading, pre-processing utils. - [ ] Add CLIP and T5 support. - [ ] Add text2image training script. - [ ] Add prompt captioner. #### Train the 1080p model on video2text dataset - [ ] Looking for a suitable dataset, welcome to discuss and recommend. - [ ] Finish data loading, pre-processing utils. - [ ] Support memory friendly training. - [ ] Add flash-attention2 from pytorch. - [ ] Add xformers. - [ ] Add accelerate to automatically manage training, e.g. mixed precision training. - [ ] Add gradient checkpoint. - [ ] Train using the deepspeed engine. #### Control model with more condition - [ ] Load pretrained weight from [PixArt-α](https://github.com/PixArt-alpha/PixArt-alpha). - [ ] Incorporating [ControlNet](https://github.com/lllyasviel/ControlNet). ## Repo structure ``` ├── README.md ├── docs │ ├── Data.md -> Datasets description. │ ├── Contribution_Guidelines.md -> Contribution guidelines description. ├── scripts -> All training scripts. │ └── train.sh ├── sora │ ├── dataset -> Dataset code to read videos │ ├── models │ │ ├── captioner │ │ ├── super_resolution │ ├── modules │ │ ├── ae -> compress videos to latents │ │ │ ├── vqvae │ │ │ ├── vae │ │ ├── diffusion -> denoise latents │ │ │ ├── dit │ │ │ ├── unet | ├── utils.py │ ├── train.py -> Training code ``` ## Requirements and Installation The recommended requirements are as follows. * Python >= 3.8 * Pytorch >= 1.13.1 * CUDA Version >= 11.7 * Install required packages: ``` git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan cd Open-Sora-Plan conda create -n opensora python=3.8 -y conda activate opensora pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 pip install -r requirements.txt cd src/sora/modules/ae/vqvae/videogpt/ pip install -e . cd .. ``` ## Usage ### Datasets Refer to [Data.md](docs/Data.md) ### Video-VQVAE (VideoGPT) #### Training ``` cd src/sora/modules/ae/vqvae/videogpt ``` Refer to origin [repo](https://github.com/wilson1yan/VideoGPT?tab=readme-ov-file#training-vq-vae). Use the `scripts/train_vqvae.py` script to train a Video-VQVAE. Execute `python scripts/train_vqvae.py -h` for information on all available training settings. A subset of more relevant settings are listed below, along with default values. ##### VQ-VAE Specific Settings * `--embedding_dim`: number of dimensions for codebooks embeddings * `--n_codes 2048`: number of codes in the codebook * `--n_hiddens 240`: number of hidden features in the residual blocks * `--n_res_layers 4`: number of residual blocks * `--downsample 4 4 4`: T H W downsampling stride of the encoder ##### Training Settings * `--gpus 2`: number of gpus for distributed training * `--sync_batchnorm`: uses `SyncBatchNorm` instead of `BatchNorm3d` when using > 1 gpu * `--gradient_clip_val 1`: gradient clipping threshold for training * `--batch_size 16`: batch size per gpu * `--num_workers 8`: number of workers for each DataLoader ##### Dataset Settings * `--data_path `: path to an `hdf5` file or a folder containing `train` and `test` folders with subdirectories of videos * `--resolution 128`: spatial resolution to train on * `--sequence_length 16`: temporal resolution, or video clip length #### Reconstructing ```Python python rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1 ``` ```Python python rec_video.py --video-path "assets/origin_video_1.mp4" --rec-path "rec_video_1.mp4" --resolution 196 --num-frames 600 --sample-rate 1 ``` We present four reconstructed videos in this demonstration, arranged from left to right as follows: | **3s 596x336** | **10s 256x256** | **18s 196x196** | **24s 168x96** | | --- | --- | --- | --- | | | | | | ### VideoDiT (DiT) #### Training ``` sh scripts/train.sh ```

#### Sampling Coming soon. ## How to Contribute to the Open-Sora Plan Community We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now! For more details, please refer to the [Contribution Guidelines](docs/Contribution_Guidelines.md) ## Acknowledgement * [DiT](https://github.com/facebookresearch/DiT/tree/main): Scalable Diffusion Models with Transformers. * [VideoGPT](https://github.com/wilson1yan/VideoGPT): Video Generation using VQ-VAE and Transformers. * [FiT](https://github.com/whlzy/FiT): Flexible Vision Transformer for Diffusion Model. * [Positional Interpolation](https://arxiv.org/abs/2306.15595): Extending Context Window of Large Language Models via Positional Interpolation. ## License * The service is a research preview intended for non-commercial use only. See [LICENSE.txt](LICENSE.txt) for details. ## Contributors