BiGR is built upon Llama backbone, incorporating mask-token prediction and binary transcoder. BiGR is trained with a weighted binary cross-entropy (wBCE) loss for reconstructing masked tokens. For image generation, we design entropy-order sampling. For visual representation, we simply apply average pooling in the intermediate layers.
BiGR supports diverse zero-shot applications, without requiring task-specific structural changes or parameter fine-tuning.
Try out BiGR yourself at Colab!
If you find this project useful for your research, please cite the following:
@misc{hao2024bigr,
title={Bi{GR}: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities},
author={Shaozhe Hao and Xuantong Liu and Xianbiao Qi and Shihao Zhao and Bojia Zi and Rong Xiao and Kai Han and Kwan-Yee~K. Wong},
year={2024},
}