VSD2M

Abstract

As a common form of communication in social media, stickers win users’ love in the internet scenarios, for their ability to convey emotions in a vivid, cute, and interesting way. People prefer to get an appropriate sticker through retrieval rather than creation for the reason that creating a sticker is time-consuming and depends on rule-based creative tools with limited capabilities. Nowadays, advanced text-to-video algorithms spawn lots of general video generation systems for users to customize high-quality photo-realistic videos by only providing simple text prompts. However, creating the customized animated sticker, which has lower frame rates and more abstract semantics than video, is greatly hindered by difficulties in data acquisition and incomplete benchmarks. To facilitate the exploration of researchers in animated sticker generation (ASG) field, we firstly construct the currently largest vision-language sticker dataset named ``VSD2M'' at a two-million scale that contains static and animated stickers. Secondly, to enhance the performance of traditional video generation methods on ASG tasks with discrete characteristics, we come up with a Spatio Temporal Interaction (STI) layer that uses semantic interaction and detail preservation to alleviate the insufficient utilization of information. Moreover, we train baselines with several video generation methods (e.g., transformer-based, diffusion-based methods) on VSD2M and conduct a detailed analysis to establish systemic guidances on ASG task. To the best of our knowledge, this is the first large-scale benchmark for multi-frame animated sticker generation, and we hope that this work can give valuable inspiration for other scholars in intelligent creation.

VSD2M Dataset

Overview of data collection and processing, which can be divided into four stages: web crawling, data filtering, annotation and dataset spliting. During the data annotation process, we use manually labeled data to fine-tune different models to obtain high-quality semi-automatic annotation results.

Visual analysis of VSD2M. (a) Frequency count of top 25 trigger words. (b) Statistics of frame number, note that we only count multi-frame animated stickers. (c) Frequency count of top 35 words in descriptions. (d) Statistics of caption length.

Static information comparison of different vision-language sticker datasets.

STI Layer for Animated Sticker Generation

Prompt	VideoLDM [1]	VideoFactory [2]	I2VGen-XL [3]	Ours
A cute rabbit setting off firecrackers
A little bear waving his hands up and down
A little man with a pair of rabbit ears is sleeping in the quilt with stars in the sky
A cartoon little fox waving with a heart above his head

Comparison with other methods

More results generated based on VSD2M

Challenges and Future Works

As a novel task that urgently needs to be explored, ASG task plays an important role in user interaction and chat communities. However, compared with video generation task in general fields, ASG task present some unique challenges:

Due to the need for manual creation and copyright issues, the collection of emoticons is greatly limited and can only be obtained from a small number of websites. Few sticker samples are difficult to train a robust generative model, for this reason the fine-tuning methods such as PEFT need to be introduced to reduce the demand for samples.

Sticker usually covers a large number of scenes (i.e. cartoon and real scenes), while for a specific scene, they contain action, characters, etc. that are often interesting and rare for generic scenes. On the other hand, stickers always contain a large amount of optical characters to highlight the theme, and the distribution of these texts is messy and difficult to model. In this scenario, the generative model may need to adapt to scaling law by employing more parameters to handle such a distribution.

For animated stickers, the diversity and abstraction of content makes it difficult to divide the actions at a fine-grained level, which makes the model lacks perception of motion during learning. In addition, stickers are difficult to describe. For example, a dog in one sticker is greatly different from another, and the text can not be fully used to better distinguish. How to control the fine-grained subject and motion in the sticker is one of the urgent issues that need to be studied.

Under the above challenges, there are also some corresponding opportunities in ASG field:

Collecting larger-scale data and using self-supervised or supervised methods to obtain pre-trained models for ASG task is an important milestone. By learning the common features in stickers, the pre-trained model can quickly iterate on downstream tasks, thus greatly promoting the development of this field.

The cartoon stickers usually consist of simple lines and color blocks, with much less texture than the nature scene. While for lines and blocks, the two distributions are extremely different in the frequency domain, and it may be necessary to disassemble and reconstruct them separately.

The creation of artificial stickers is often based on the process of sketching-coloring, and how to model ASG task based on this process is also a promising direction. Whether the sticker generation can be broken down into sketching and coloring, thereby reducing the modeling difficulty and improving the sample quality, is an interesting appraoch that needs to be explored.

Similar to natural scenes, how to achieve detailed control in the generation process is also an indispensable part in intelligent creation. Characterization of subjects and modeling of action will inevitably become one of the bottlenecks in generating high-quality stickers.

BibTeX

If you use our work in your research, please cite:


            @misc{anonymous2024vsd2m,
            title={VSD2M: A Large-scale Vision-language Sticker Dataset for Multi-frame Animated Sticker Generation},
            author={Anonymous},
            year={2024},
            archivePrefix={arXiv},
            primaryClass={cs.CV}}