Abstract: Transferring visual language models (VLMs) from the image domain to the video domain has recently yielded great success on human action recognition tasks. However, standard recognition ...
Abstract: Human-centric instructional videos provide opportunities for users to learn real-world multistep tasks, such as cooking, makeup, and using professional tools. However, these lengthy videos ...