CCF TF算法与AI SIG主席，百度人工智能技术委员会主席
主题：《Context Autoencoder for Scalable Self-Supervised Representation Pretraining》
主题简介：Self-supervised representation pretraining aims to learn an encoder from unlabeled images, such that the encoded representations take on semantics and benefit downstream tasks. In this talk, I present a novel masked image modeling approach, context autoencoder (CAE), for scalable self-supervised representation training. The core ideas include that predictions are made in the latent representation space from visible patches to masked patches and that the encoder is only for representation learning and representation learning is only by the encoder. I also discuss why masked image modeling potentially outperforms contrastive pretraining (e.g., SimCLR, MoCo) and why contrastive learning performs on par with supervised pretraining on ImageNet. In addition, I show that linear probing and the extended version, attentive probing, are more suitable than fine-tuning on ImageNet for pretraining evaluation.
个人简介：Jingdong Wang is a Chief Scientist for computer vision with Baidu. His team is focusing on conducting product-driven and cutting-edge computer vision/deep learning/AI research and developing practical computer vision applications. Before joining Baidu, he was a Senior Principal Researcher at Microsoft Research Asia. His areas of interest are computer vision, deep learning, and multimedia search. His representative works include deep high-resolution network (HRNet), discriminative regional feature integration (DRFI) for supervised saliency detection, neighborhood graph search (NGS, SPTAG) for large scale similarity search. He has been serving/served as an Associate Editor of IEEE TPAMI, IJCV, IEEE TMM, and IEEE TCSVT, and an area chair of leading conferences in vision, multimedia, and AI, such as CVPR, ICCV, ECCV, ACM MM, IJCAI, and AAAI. He was elected as an ACM Distinguished Member, a Fellow of IAPR, and a Fellow of IEEE, for his contributions to visual content understanding and retrieval.
微软Cloud & AI计算机视觉研究组高级研究员
主题：《Florence: A New Foundation Model for Computer Vision》
主题简介：在多模态的大规模数据集上进行训练，通过少量的数据微调可以适应各种下游任务的计算机视觉基础模型，对于现实世界的计算机视觉应用至关重要。2021年底，微软发布Florenc基础模型，通过结合来自 Web的大规模图像 - 文本数据训练，可以轻松地适应各种计算机视觉任务，包括分类、检索、目标检测、视觉问答（VQA）、图像描述、视频检索和动作识别。模型发布时，在44个表征基准测试中多数都取得了新的SOTA结果，例如ImageNet-1K 零样本分类任务，top-1 准确率为85.7，ImageNet-1k微软后获得90.45 top-1准确率，COCO微调任务获得62.4 mAP，VQA任务获得80.36 mAP。
个人简介：现任微软Cloud & AI计算机视觉研究组高级研究员。主要研究方向为计算机视觉，大规模数据/语言多模态模型训练，物体检测/分割，人体姿态识别等。在CVPR/ECCV/ICCV/ICLR/AAAI等顶尖学术会议发表论文20余篇。他的多项研究技术成果已经开源并且应用到微软Azure等产品。
Senior Research Scientist at Google
主题：《Label-Efficient Visual Perception via Multimodal Supervision and Distillation》
主题简介：In this talk, I will focus on two of our recent work (VATT and ViLD) towards building label-efficient computer vision models. In VATT, we learn multimodal representations from unlabeled raw video, audio and text using a unified Transformer encoder. In ViLD, we distill from pre-trained vision-language models such as CLIP to enable strong open-vocabulary detection using off-the-shelf Mask R-CNN.
个人简介：Yin Cui is a Senior Research Scientist at Google. Yin's research focuses on multimodal and label-efficient visual perception. Before joining Google, he received a Ph.D. in Computer Science from Cornell University in 2019, advised by Professor Serge Belongie. Yin also co-organized COCO Visual Recognition Workshops and Fine-Grained Visual Categorization Workshops at major computer vision conferences.
Chair Scientist of Computer Vision and Robotics at IDEA
个人简介：Lei Zhang is currently a Chair Scientist of Computer Vision and Robotics at International Digital Economy Academy(IDEA) and an Adjunct Professor of Hong Kong University of Science and Technology (Guangzhou). Prior to this, he was a Principal Researcher and Research Manager at Microsoft, where he has worked since 2001 in Microsoft Research Asia (MSRA), Microsoft Research(MSR, Redmond), and other computer vision-related product teams. He has led research teams for years, conducting research on computer vision with applications in large-scale image analysis, object detection, and vision-language understanding. His research has led to many practical impacts in Bing Multimedia Search and Microsoft Cognitive Services. He has published more than 150 papers in top conferences and journals and holds more than 60 US-granted patents. He was named as IEEE Fellow for his contribution in large-scale visual recognition and multimedia information retrieval.
Professor of Computer Science and Engineering, University of California San Diego
个人简介：Zhuowen Tu is a full professor of Cognitive Science and also affiliated with the Department of Computer Science and Engineering, University of California San Diego. Before joining UCSD in 2013 as an assistant professor, he was a faculty member at UCLA. Between 2011 and 2013, he took a leave to work at Microsoft Research Asia. He received his Ph.D. from the Ohio State University and his M.E. from Tsinghua University. He is a recipient of the David Marr Prize award 2003 and a recipient of the David Marr Prize Honorable Mention award 2015. He is a Fellow of the IEEE.