Gaurav Sharma | Publications

2024

OmniVec: Learning Robust Representations with Cross Modal Sharing Siddharth Srivastava, and Gaurav Sharma In WACV 2024 [Abstract] [BibTex] [PDF]
Majority of research in learning based methods has been towards designing and training networks for specific tasks. However, many of the learning based tasks, across modalities, share commonalities and could be potentially tackled in a joint framework. We present an approach in such direction, to learn multiple tasks, in multiple modalities, with a unified architecture. The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads. We first pre-train it by self-supervised masked training, followed by sequential training for the different tasks. We train the network on all major modalities, e.g. visual, audio, text and 3D, and report results on 22 diverse and challenging public benchmarks. We demonstrate empirically that, using a joint network to train across modalities leads to meaningful information sharing and this allows us to achieve state-of-the-art results on most of the benchmarks. We also show generalization of the trained network on cross-modal tasks as well as unseen datasets and tasks.

@inproceedings{omnivec_wacv24, title = {OmniVec: Learning Robust Representations with Cross Modal Sharing}, author = {Srivastava, Siddharth and Sharma, Gaurav}, booktitle = {WACV}, year = {2024}, url = {https://arxiv.org/pdf/2311.05709.pdf} }

2023

ViSt3D: Video Stylization with 3D CNN Ayush Pande, and Gaurav Sharma In NeurIPS 2023 [Abstract] [BibTex] [Project page] [PDF]
Visual stylization has been a very popular research area in recent times. While image stylization has seen a rapid advancement in the recent past, video stylization, while being more challenging, is relatively less explored. The immediate method of stylizing videos by stylizing each frame independently has been tried with some success. To the best of our knowledge, we present the first approach to video stylization using 3D CNN directly, building upon insights from 2D image stylization. Stylizing video is highly challenging, as the appearance and video motion, which includes both camera and subject motions, are inherently entangled in the representations learnt by a 3D CNN. Hence, a naive extension of 2D CNN stylization methods to 3D CNN does not work. To perform stylization with 3D CNN, we propose to explicitly disentangle motion and appearance, stylize the appearance part, and then add back the motion component and decode the final stylized video. In addition, we propose a dataset, curated from existing datasets, to train video stylization networks. We also provide an independently collected test set to study the generalization of video stylization methods. We provide results on this test dataset comparing the proposed method with 2D stylization methods applied frame by frame. We show successful stylization with 3D CNN for the first time, and obtain better stylization in terms of texture cf. the existing 2D methods.

@inproceedings{vist3d_neurips23, title = {ViSt3D: Video Stylization with 3D CNN}, author = {Pande, Ayush and Sharma, Gaurav}, booktitle = {NeurIPS}, year = {2023}, project = {https://ayush202.github.io/projects/ViSt3D.html}, url = {https://openreview.net/pdf?id=2EiqizElGO} }

2022

Discriminative Semantic Transitive Consistency for Cross-Modal Learning Kranti Kumar Parida, and Gaurav Sharma CVIU 2022 [Abstract] [BibTex] [Project page] [PDF]
Cross-modal retrieval is generally performed by projecting and aligning the data from two different modalities onto a shared representation space. This shared space often also acts as a bridge for translating the modalities. We address the problem of learning such representation space by proposing and exploiting the property of Discriminative Semantic Transitive Consistency – ensuring that the data points are correctly classified even after being transferred to the other modality. Along with semantic transitive consistency, we also enforce the traditional distance minimizing constraint which makes the projections of the corresponding data points from both the modalities to come closer in the representation space. We analyze and compare the contribution of both the loss terms and their interaction, for the task. In addition, we incorporate semantic cycle-consistency for each of the modality. We empirically demonstrate better performance owing to the different components with clear ablation studies. We also provide qualitative results to support the proposals.

@article{dstc_cviu22, title = {Discriminative Semantic Transitive Consistency for Cross-Modal Learning}, author = {Parida, Kranti Kumar and Sharma, Gaurav}, journal = {CVIU}, year = {2022}, url = {https://arxiv.org/abs/2103.14103}, project = {https://krantiparida.github.io/projects/dstc.html} }
Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma In WACV 2022 [Abstract] [BibTex] [Project page] [PDF]
Binaural audio gives the listener an immersive experience and can enhance augmented and virtual reality. However, recording binaural audio requires specialized setup with a dummy human head having microphones in left and right ears. Such a recording setup is difficult to build and setup, therefore mono audio has become the preferred choice in common devices. To obtain the same impact as binaural audio, recent efforts have been directed towards lifting mono audio to binaural audio conditioned on the visual input from the scene. Such approaches have not used an important cue for the task: the distance of different sound producing objects from the microphones. In this work, we argue that depth map of the scene can act as a proxy for inducing distance information of different objects in the scene, for the task of audio binauralization. We propose a novel encoder-decoder architecture with a hierarchical attention mechanism to encode image, depth and audio feature jointly. We design the network on top of state-of-the-art transformer networks for image and depth representation. We show empirically that the proposed method outperforms state-of-the-art methods comfortably for two challenging public datasets FAIR-Play and MUSIC-Stereo. We also demonstrate with qualitative results that the method is able to focus on the right information required for the task. The qualitative results are available at our project page https://krantiparida.github.io/projects/bmonobinaural.html

@inproceedings{beyondmono_wacv22, title = { Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention}, author = {Parida, Kranti Kumar and Srivastava, Siddharth and Sharma, Gaurav}, booktitle = {WACV}, year = {2022}, url = {https://openaccess.thecvf.com/content/WACV2022/papers/Parida_Beyond_Mono_to_Binaural_Generating_Binaural_Audio_From_Mono_Audio_WACV_2022_paper.pdf}, project = {https://krantiparida.github.io/projects/bmonobinaural.html} }
Self-Supervised Cross-Video Temporal Learning for Unsupervised Video Domain Adaptation Jinwoo Choi, Jia-Bin Huang, and Gaurav Sharma In ICPR 2022 [Abstract] [BibTex]
We address the task of unsupervised domain adaptation (UDA) for videos with self-supervised learning. While UDA for images is a widely studied problem, UDA for videos is relatively unexplored. In this paper, we propose a novel self-supervised loss for the task of video UDA. The method is motivated by inverted reasoning. Many works on video classification have shown success with representations based on events in videos, e.g., ‘reaching’, ‘picking’, and ‘drinking’ events for ‘drinking coffee’. We argue that if we have event-based representations, we should be able to predict the relative distances between clips in videos. Inverting that, we propose a self-supervised task to predict the difference of the distance between two clips from the source video and the distance between two clips from the target video. We hope that such a task would encourage learning event-based representations of the videos, which is known to be beneficial for classification. Since we predict the difference of clip distances between clips from source videos and target videos, we ‘tie’ the two domains and expect to achieve well-adapted representations. We combine this purely self-supervised loss and the source classification loss to learn the model parameters. We give extensive empirical results on challenging video UDA benchmarks, i.e., UCF-HMDB and EPIC-Kitchens. The presented qualitative and quantitative results support our motivations and method.

@inproceedings{cvtd_icpr22, title = {Self-Supervised Cross-Video Temporal Learning for Unsupervised Video Domain Adaptation}, author = {Choi, Jinwoo and Huang, Jia-Bin and Sharma, Gaurav}, booktitle = {ICPR}, year = {2022} }

2021

Self Attention Guided Depth Completion using RGB and SparseLiDAR Point Clouds Siddharth Srivastava, and Gaurav Sharma In IROS 2021 [Abstract] [BibTex]
We address the problem of completing per pixel dense depth map using a single RGB image and the sparse point cloud of the scene. Depth prediction from RGB image is a hard problem and while dense point clouds obtained from LiDAR sensors can be used in addition to RGB image, the cost of such sensors is a significant barrier. Having LiDAR sensors which capture sparse point clouds is a reasonable middle ground. We propose a novel architecture which incorporates geometric primitives and self attention mechanisms, to improve the prediction. The motivation of self attention is to capture the correlations between scene and object elements, e.g. between the right and left window of car, early on in the network. While that for using geometric primitives is to have a high level clustering cue to enable the network to exploit similar correlations. In addition, we enforce complimentarity in the predictions made with RGB and sparse LiDAR respectively, this forces the two corresponding branches to focus on hard areas which are not already well predicted by the other branch. With exhaustive experiments on KITTI depth completion benchmark, NYU v2 and Matterport3D we show that the proposed method provides state-of-the-art results.

@inproceedings{echodepth_cvpr21, title = {Self Attention Guided Depth Completion using {RGB} and {SparseLiDAR} Point Clouds}, author = {Srivastava, Siddharth and Sharma, Gaurav}, booktitle = {IROS}, year = {2021} }
Beyond Image to Depth: Improving Depth Prediction using Echoes Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma In CVPR 2021 [Abstract] [BibTex] [arXiv] [Project page]
We address the problem of estimating depth with multi modal audio visual data. Inspired by the ability of animals, such as bats and dolphins, to infer distance of objects with echolocation, some recent methods have utilized echoes for depth estimation. We propose an end-to-end deep learning based pipeline utilizing RGB images, binaural echoes and estimated material properties of various objects within a scene. We argue that the relation between image, echoes and depth, for different scene elements, is greatly influenced by the properties of those elements, and a method designed to leverage this information can lead to significantly improve depth estimation from audio visual inputs. We propose a novel multi modal fusion technique, which incorporates the material properties explicitly while combining audio (echoes) and visual modalities to predict the scene depth. We show empirically, with experiments on Replica dataset, that the proposed method obtains 28% improvement in RMSE compared to the state-of-the-art audio-visual depth prediction method. To demonstrate the effectiveness of our method on larger dataset, we report competitive performance on Matterport3D, proposing to use it as a multimodal depth prediction benchmark with echoes for the first time. We also analyse the proposed method with exhaustive ablation experiments and qualitative results.

@inproceedings{echodepth_cvpr22, title = {Beyond Image to Depth: Improving Depth Prediction using Echoes}, author = {Parida, Kranti Kumar and Srivastava, Siddharth and Sharma, Gaurav}, booktitle = {CVPR}, project = {https://krantiparida.github.io/projects/bimgdepth.html}, arxiv = {2103.08468}, year = {2021} }
Exploiting Local Geometry for Feature and Graph Construction for Better 3D Point Cloud Processing with Graph Neural Networks Siddharth Srivastava, and Gaurav Sharma In ICRA 2021 [Abstract] [BibTex] [Project page]
We propose simple yet effective improvements in point representations and local neighborhood graph construction within the general framework of graph neural networks (GNNs) for 3D point cloud processing. As a first contribution, we propose to augment the vertex representations with important local geometric information of the points, followed by nonlinear projection using a MLP. As a second contribution, we propose to improve the graph construction for GNNs for 3D point clouds. The existing methods work with a knn based approach for constructing the local neighborhood graph. We argue that it might lead to reduction in coverage in case of dense sampling by sensors in some regions of the scene. The proposed methods aims to counter such problems and improve coverage in such cases. As the traditional GNNs were designed to work with general graphs, where vertices may have no geometric interpretations, we see both our proposals as augmenting the general graphs to incorporate the geometric nature of 3D point clouds. While being simple, we demonstrate with multiple challenging benchmarks, with relatively clean CAD models based, as well as with real world noisy scans, that the proposed method achieves state of the art results on benchmarks for 3D classification (ModelNet40) , part segmentation (ShapeNet) and semantic segmentation (Stanford 3D Indoor Scenes Dataset). We also show that the proposed network achieves faster training convergence, e.g. 40% less epochs for classification.

@inproceedings{gnn_icra21, title = {Exploiting Local Geometry for Feature and Graph Construction for Better 3D Point Cloud Processing with Graph Neural Networks}, author = {Srivastava, Siddharth and Sharma, Gaurav}, booktitle = {ICRA}, project = {https://siddharthsrivastava.github.io/publication/geomgcnn/}, year = {2021} }

2020

Shuffle and Attend: Video Domain Adaptation Jinwoo Choi, Gaurav Sharma, Samuel Schulter, and Jia-Bin Huang In ECCV 2020 [Abstract] [BibTex] [PDF]
We address the problem of domain adaptation in videos for the task of human action recognition. Inspired by image-based domain adaptation, we can perform video adaptation by aligning the features of frames or clips of source and target videos. However, equally aligning all clips is suboptimal as not all clips are informative for the task. As the first novelty, we propose an attention mechanism which focuses on more discriminative clips and directly optimizes for video-level (cf. clip-level) alignment. As the backgrounds are often very different between source and target, the source background-corrupted model adapts poorly to target domain videos. To alleviate this, as a second novelty, we propose to use the clip order prediction as an auxiliary task. The clip order prediction loss, when combined with domain adversarial loss, encourages learning of representations which focus on the humans and objects involved in the actions, rather than the uninformative and widely differing (between source and target) backgrounds. We empirically show that both components contribute positively towards adaptation performance. We report state-of-the-art performances on two out of three challenging public benchmarks, two based on the UCF and HMDB datasets, and one on Kinetics to NEC-Drone datasets. We also support the intuitions and the results with qualitative results.

@inproceedings{sava_eccv20, title = {Shuffle and Attend: Video Domain Adaptation}, author = {Choi, Jinwoo and Sharma, Gaurav and Schulter, Samuel and Huang, Jia-Bin}, booktitle = {ECCV}, url = {http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123570664.pdf}, year = {2020} }
Object Detection with a Unified Label Space from Multiple Datasets Xiangyun Zhao, Samuel Schulter, Gaurav Sharma, Yi-Hsuan Tsai, Manmohan Chandraker, and Ying Wu In ECCV 2020 [Abstract] [BibTex] [Project page] [PDF]
Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. The practical benefits of such an object detector are obvious and significant—application-relevant categories can be picked and merged form arbitrary existing datasets. However, naïve merging of datasets is not possible in this case, due to inconsistent object annotations. Consider an object category like faces that is annotated in one dataset, but is not annotated in another dataset, although the object itself appears in the latter’s images. Some categories, like face here, would thus be considered foreground in one dataset, but background in another. To address this challenge, we design a framework which works with such partial annotations, and we exploit a pseudo labeling approach that we adapt for our specific case. We propose loss functions that carefully integrate partial but correct annotations with complementary but noisy pseudo labels. Evaluation in the proposed novel setting requires full annotation on the test set. We collect the required annotations1 and define a new challenging experimental setup for this task based on existing public datasets. We show improved performances compared to competitive baselines and appropriate adaptations of existing work.

@inproceedings{uod_eccv20, title = {Object Detection with a Unified Label Space from Multiple Datasets}, author = {Zhao, Xiangyun and Schulter, Samuel and Sharma, Gaurav and Tsai, Yi-Hsuan and Chandraker, Manmohan and Wu, Ying}, booktitle = {ECCV}, url = {http://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590171.pdf}, project = {http://www.nec-labs.com/~mas/UniDet/}, year = {2020} }
Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos Kranti Kumar Parida, Neeraj Matiyali, Tanaya Guha, and Gaurav Sharma In WACV 2020 [Abstract] [BibTex] [arXiv]
We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classification and retrieval of videos. ZSL has been studied extensively in the recent past but has primarily been limited to visual modality and to images. We demonstrate that both audio and visual modalities are important for ZSL for videos. Since a dataset to study the task is currently not available, we also construct an appropriate multimodal dataset with 33 classes containing 156,416 videos, from an existing large scale audio event dataset. We empirically show that the performance improves by adding audio modality for both tasks of zeroshot classification and retrieval, when using multimodal extensions of embedding learning methods. We also propose a novel method to predict the ‘dominant’ modality using a jointly learned modality attention network. We learn the attention in a semi-supervised setting and thus do not require any additional explicit labelling for the modalities. We provide qualitative validation of the modality specific attention, which also successfully generalizes to unseen test classes.

@inproceedings{cjme_wacv20, title = {Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos}, author = {Parida, Kranti Kumar and Matiyali, Neeraj and Guha, Tanaya and Sharma, Gaurav}, booktitle = {WACV}, arxiv = {1910.08732}, year = {2020} }
Video Person Re-Identification using Learned Clip Similarity Aggregation (Best Paper Award Finalist) Neeraj Matiyali, and Gaurav Sharma In WACV 2020 [Abstract] [BibTex] [arXiv]
We address the challenging task of video-based person re-identification. Recent works have shown that splitting the video sequences into clips and then aggregating clip based similarity is appropriate for the task. We show that using a learned clip similarity aggregation function allows filtering out hard clip pairs, e.g. where the person is not clearly visible, is in a challenging pose, or where the poses in the two clips are too different to be informative. This allows the method to focus on clip-pairs which are more informative for the task. We also introduce the use of 3D CNNs for video-based re-identification and show their effectiveness by performing equivalent to previous works, which use optical flow in addition to RGB, while using RGB inputs only. We give quantitative results on three challenging public benchmarks and show better or competitive performance. We also validate our method qualitatively.

@inproceedings{reid_wacv20, title = {Video Person Re-Identification using Learned Clip Similarity Aggregation}, author = {Matiyali, Neeraj and Sharma, Gaurav}, honor = {Best Paper Award Finalist}, booktitle = {WACV}, arxiv = {1910.08055}, year = {2020} }
Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones Jinwoo Choi, Gaurav Sharma, Manmohan Chandraker, and Jia-Bin Huang In WACV 2020 [Abstract] [BibTex] [PDF]
We address the problem of human action classification in drone videos. Due to the high cost of capturing and labeling large-scale drone videos with diverse actions, we present unsupervised and semi-supervised domain adaptation approaches that leverage both the existing fully annotated action recognition datasets and unannotated (or onlya few annotated) videos from drones. To study the emerging problem of drone-based action recognition, we create anew dataset containing 5,250videos to evaluate the task.We tackle both problem settings with 1) same and 2) different action label sets for the source (e.g., Kinectics dataset)and target domains (drone videos). We present a combination of video and instance-based adaptation, paired with either a classifier or an embedding-based framework to transfer the knowledge from source to target. Our results show that the proposed adaptation approach substantially improves the performance on this challenging and practical task. We further demonstrate the applicability of our method for learning cross-view action recognition on the Charades-Ego dataset. We provide qualitative analysis to understand the behaviors of our approaches.

@inproceedings{drone_wacv20, title = {Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones}, author = {Choi, Jinwoo and Sharma, Gaurav and Chandraker, Manmohan and Huang, Jia-Bin}, booktitle = {WACV}, url = {http://openaccess.thecvf.com/content_WACV_2020/papers/Choi_Unsupervised_and_Semi-Supervised_Domain_Adaptation_for_Action_Recognition_from_Drones_WACV_2020_paper.pdf}, year = {2020} }

2019

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles Siddharth Srivastava, Frederic Jurie, and Gaurav Sharma In IROS 2019 [Abstract] [BibTex] [arXiv]
We address the problem of 3D object detection from 2D monocular images in autonomous driving scenarios. We propose to lift the 2D images to 3D representations using learned neural networks and leverage existing networks working directly on 3D to perform 3D object detection and localization. We show that, with carefully designed training mechanism and automatically selected minimally noisy data, such a method is not only feasible, but gives higher results than many methods working on actual 3D inputs acquired from physical sensors. On the challenging KITTI benchmark, we show that our 2D to 3D lifted method outperforms many recent competitive 3D networks while significantly outperforming previous state of the art for 3D detection from monocular images. We also show that a late fusion of the output of the network trained on generated 3D images, with that trained on real 3D images, improves performance. We find the results very interesting and argue that such a method could serve as a highly reliable backup in case of malfunction of expensive 3D sensors, if not potentially making them redundant, at least in the case of low human injury risk autonomous navigation scenarios like warehouse automation.

@inproceedings{2d3dobjdet2019, title = {Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles}, author = {Srivastava, Siddharth and Jurie, Frederic and Sharma, Gaurav}, booktitle = {IROS}, arxiv = {1904.08494}, year = {2019} }

2018

Zero-Shot Object Detection Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran In ECCV 2018 [Abstract] [BibTex] [arXiv]
We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training. We work with a challenging set of object classes, not restricting ourselves to similar and/or fine-grained categories cf. prior works on zero-shot classification. We follow a principled approach by first adapting visual-semantic embeddings for ZSD. We then discuss the problems associated with selecting a background class and motivate two background-aware approaches for learning robust detectors. One of these models uses a fixed background class and the other is based on iterative latent assignments. We also outline the challenge associated with using a limited number of training classes and propose a solution based on dense sampling of the semantic label space using auxiliary data with a large number of categories. We propose novel splits of two standard detection datasets - MSCOCO and VisualGenome and discuss extensive empirical results to highlight the benefits of the proposed methods. We provide useful insights into the algorithm and conclude by posing some open questions to encourage further research.

@inproceedings{zsdetn2018, title = {Zero-Shot Object Detection}, author = {Bansal, Ankan and Sikka, Karan and Sharma, Gaurav and Chellappa, Rama and Divakaran, Ajay}, booktitle = {ECCV}, arxiv = {1804.04340}, year = {2018} }
Discriminatively Trained Latent Ordinal Model for Video Classification Karan Sikka, and Gaurav Sharma TPAMI 2018 [Abstract] [BibTex] [arXiv] [Project page]
We study the problem of video classification for facial analyis and human action recognition. We propose a novel weakly supervised learning method that models the video as a sequence of automatically mined, discriminative sub-events (e.g. onset and offset phase for smile, running and jumping for high-jump). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF – it extends such frameworks to model the ordinal aspect in the videos, approximately. We obtain consistent improvements over relevant competitive baselines on four challenging and publicly available video based facial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations and on three challenging human action datasets. We also validate the method with qualitative results and show that they largely support the intuitions behind the method.

@article{sikka2016lomo, title = {Discriminatively Trained Latent Ordinal Model for Video Classification}, author = {Sikka, Karan and Sharma, Gaurav}, journal = {TPAMI}, volume = {40}, number = {8}, pages = {1829--1844}, project = {http://ksikka.com/lomo_page.html}, arxiv = {1608.02318}, year = {2018} }
Unsupervised Learning of Face Representations (Best Paper Award) Samyak Datta, Gaurav Sharma, and C.V. Jawahar In IEEE Conference on Automatic Face and Gesture Recognition (FG) 2018 [Abstract] [BibTex] [arXiv]
We present an approach for unsupervised training of CNNs in order to learn discriminative face representations. We mine supervised training data by noting that multiple faces in the same video frame must belong to different persons and the same face tracked across multiple frames must belong to the same person. We obtain millions of face pairs from hundreds of videos without using any manual supervision. Although faces extracted from videos have a lower spatial resolution than those which are available as part of standard supervised face datasets such as LFW and CASIA-WebFace, the former represent a much more realistic setting, e.g. in surveillance scenarios where most of the faces detected are very small. We train our CNNs with the relatively low resolution faces extracted from video frames collected, and achieve a higher verification accuracy on the benchmark LFW dataset cf. hand-crafted features such as LBPs, and even surpasses the performance of state-of-the-art deep networks such as VGG-Face, when they are made to work with low resolution input images.

@inproceedings{unsupfaces2018, title = {Unsupervised Learning of Face Representations}, honor = {Best Paper Award}, author = {Datta, Samyak and Sharma, Gaurav and Jawahar, C.V.}, booktitle = {IEEE Conference on Automatic Face and Gesture Recognition (FG)}, arxiv = {1803.01260}, year = {2018} }
Large Scale Novel Object Discovery in 3D Siddharth Srivastava, Gaurav Sharma, and Brejesh Lall In IEEE Winter Conference on Applications of Computer Vision (WACV) 2018 [Abstract] [BibTex] [arXiv]
We present a method for discovering objects in 3D point clouds from sensors like Microsoft Kinect. We utilize supervoxels generated directly from the point cloud data and design a Siamese network building on a recently proposed 3D convolutional neural network architecture. At training, we assume the availability of the some known objects—these are used to train a non-linear embedding of supervoxels using the Siamese network, by optimizing the criteria that supervoxels which fall on the same object should be closer than those which fall on different objects, in the embedding space. We do not assume the objects during test to be known, and perform clustering, in the embedding space learned, of supervoxels to effectively perform novel object discovery. We validate the method with quantitative results showing that it can discover numerous unseen objects while being trained on only a few dense 3D models. We also show convincing qualitative results of object discovery in point cloud data when the test objects, either specific instances or even their categories, were never seen during training.

@inproceedings{objDisc2017, title = {Large Scale Novel Object Discovery in 3D}, author = {Srivastava, Siddharth and Sharma, Gaurav and Lall, Brejesh}, booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)}, arxiv = {1701.07046}, year = {2018} }
Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking Rahul Sharma, Tanaya Guha, and Gaurav Sharma In IEEE Winter Conference on Applications of Computer Vision (WACV) 2018 [Abstract] [BibTex] [arXiv]
Public speaking is an important aspect of human communication and interaction. The majority of computational work on public speaking concentrates on analyzing the spoken content, and the verbal behavior of the speakers. While the success of public speaking largely depends on the content of the talk, and the verbal behavior, non-verbal (visual) cues, such as gestures and physical appearance also play a significant role. This paper investigates the importance of visual cues by estimating their contribution towards predicting the popularity of a public lecture. For this purpose, we constructed a large database of more than 1800 TED talk videos. As a measure of popularity of the TED talks, we leverage the corresponding (online) viewers’ ratings from YouTube. Visual cues related to facial and physical appearance, facial expressions, and pose variations are extracted from the video frames using convolutional neural network (CNN) models. Thereafter, an attention-based long short-term memory (LSTM) network is proposed to predict the video popularity from the sequence of visual features. The proposed network achieves state-of-the-art prediction accuracy indicating that visual cues alone contain highly predictive information about the popularity of a talk. Furthermore, our network learns a human-like attention mechanism, which is particularly useful for interpretability, i.e. how attention varies with time, and across different visual cues by indicating their relative importance.

@inproceedings{speakerPerf2018, title = {Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking }, author = {Sharma, Rahul and Guha, Tanaya and Sharma, Gaurav}, booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)}, arxiv = {1707.06830}, year = {2018} }

2017

AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma In CVPR 2017 [Abstract] [BibTex] [arXiv] [Project page]
We propose a novel method for temporally pooling frames in a video for the task of human action recognition. The method is motivated by the observation that there are only a small number of frames which, together, contain sufficient information to discriminate an action class present in a video, from the rest. The proposed method learns to pool such discriminative and informative frames, while discarding a majority of the non-informative frames in a single temporal scan of the video. Our algorithm does so by continuously predicting the discriminative importance of each video frame and subsequently pooling them in a deep learning framework. We show the effectiveness of our proposed pooling method on standard benchmarks where it consistently improves on baseline pooling methods, with both RGB and optical flow based Convolutional networks. Further, in combination with complementary video representations, we show results that are competitive with respect to the state-of-the-art results on two challenging and publicly available benchmark datasets.

@inproceedings{adascan2017, title = {AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos}, author = {Kar, Amlan and Rai, Nishant and Sikka, Karan and Sharma, Gaurav}, booktitle = {CVPR}, arxiv = {1611.08240}, project = {https://amlankar.github.io/publication/adascan/}, year = {2017} }
An Empirical Evaluation of Visual Question Answering for Novel Objects Santhosh K. Ramakrishnan, Ambar Pal, Gaurav Sharma, and Anurag Mittal In CVPR 2017 [Abstract] [BibTex] [arXiv]
We study the problem of answering questions about images in the harder setting, where the test questions and corresponding images contain novel objects, which were not queried about in the training data. Such setting is inevitable in real world-owing to the heavy tailed distribution of the visual categories, there would be some objects which would not be annotated in the train set. We show that the performance of two popular existing methods drop significantly (up to 28%) when evaluated on novel objects cf. known objects. We propose methods which use large existing external corpora of (i) unlabeled text, i.e. books, and (ii) images tagged with classes, to achieve novel object based visual question answering. We do systematic empirical studies, for both an oracle case where the novel objects are known textually, as well as a fully automatic case without any explicit knowledge of the novel objects, but with the minimal assumption that the novel objects are semantically related to the existing objects in training. The proposed methods for novel object based visual question answering are modular and can potentially be used with many visual question answering architectures. We show consistent improvements with the two popular architectures and give qualitative analysis of the cases where the model does well and of those where it fails to bring improvements.

@inproceedings{novel2017, title = {An Empirical Evaluation of Visual Question Answering for Novel Objects}, author = {Ramakrishnan, Santhosh K. and Pal, Ambar and Sharma, Gaurav and Mittal, Anurag}, booktitle = {CVPR}, arxiv = {1704.02516}, year = {2017} }
Expanded Parts Model for Semantic Description of Humans in Still Images Gaurav Sharma, Frederic Jurie, and Cordelia Schmid TPAMI 2017 [Abstract] [BibTex] [arXiv]
We introduce an Expanded Parts Model (EPM) for recognizing human attributes (e.g. young, short hair, wearing suit) and actions (e.g. running, jumping) in still images. An EPM is a collection of part templates which are learnt discriminatively to explain specific scale-space regions in the images (in human centric coordinates). This is in contrast to current models which consist of a relatively few (i.e. a mixture of) ’average’ templates. EPM uses only a subset of the parts to score an image and scores the image sparsely in space, i.e. it ignores redundant and random background in an image. To learn our model, we propose an algorithm which automatically mines parts and learns corresponding discriminative templates together with their respective locations from a large number of candidate parts. We validate our method on three recent challenging datasets of human attributes and actions. We obtain convincing qualitative and state-of-the-art quantitative results on the three datasets.

@article{sharma2017epm, title = {Expanded Parts Model for Semantic Description of Humans in Still Images}, author = {Sharma, Gaurav and Jurie, Frederic and Schmid, Cordelia}, journal = {TPAMI}, volume = {39}, number = {1}, pages = {87--101}, arxiv = {1509.04186}, year = {2017} }
Fast Localization of Autonomous Vehicles using Discriminative Metric Learning Ankit Pensia, Gaurav Sharma, James McBride, and Gaurav Pandey In Conference on Computer and Robot Vision (CRV) 2017 [Abstract] [BibTex]
In this paper, we report a novel algorithm for localization of autonomous vehicles in an urban environment using orthographic ground reflectivity map created with a three-dimensional (3D) laser scanner. It should be noted that the road paint (lane markings, zebra crossing, traffic signs etc.) constitute the distinctive features in the surface reflectivity map which are generally sparse as compared to the non-interesting asphalt and the off-road portion of the map. Therefore, we propose to project the reflectivity map to a lower dimensional space, that captures the useful features of the map, and then use these projected feature maps for localization. We use discriminative metric learning technique to obtain this lower dimensional space of feature maps. Experimental evaluation of the proposed method on real data shows that it is better than the standard image matching techniques in terms of accuracy. Moreover, the proposed method is computationally fast and can be executed at real-time (10 Hz) on a standard CPU.

@inproceedings{loc-dml-CRV2017, title = {Fast Localization of Autonomous Vehicles using Discriminative Metric Learning}, author = {Pensia, Ankit and Sharma, Gaurav and McBride, James and Pandey, Gaurav}, booktitle = {Conference on Computer and Robot Vision (CRV)}, year = {2017} }

2016

Deep Fusion of Visual Signatures for Client-Server Facial Analysis (Best Paper Award Runners Up) Binod Bhattarai, Gaurav Sharma, and Frederic Jurie In Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) 2016 [Abstract] [BibTex] [arXiv]
Facial analysis is a key technology for enabling human-machine interaction. In this context, we present a client-server framework, where a client transmits the signature of a face to be analyzed to the server, and, in return, the server sends back various information describing the face e.g. is the person male or female, is she/he bald, does he have a mustache, etc. We assume that a client can compute one (or a combination) of visual features; from very simple and efficient features, like Local Binary Patterns, to more complex and computationally heavy, like Fisher Vectors and CNN based, depending on the computing resources available. The challenge addressed in this paper is to design a common universal representation such that a single merged signature is transmitted to the server, whatever be the type and number of features computed by the client, ensuring nonetheless an optimal performance. Our solution is based on learning of a common optimal subspace for aligning the different face features and merging them into a universal signature. We have validated the proposed method on the challenging CelebA dataset, on which our method outperforms existing state-of-the-art methods when rich representation is available at test time, while giving competitive performance when only simple signatures (like LBP) are available at test time due to resource constraints on the client.

@inproceedings{deepfuse2016, title = {Deep Fusion of Visual Signatures for Client-Server Facial Analysis}, honor = {Best Paper Award Runners Up}, author = {Bhattarai, Binod and Sharma, Gaurav and Jurie, Frederic}, booktitle = {Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP)}, arxiv = {1611.00142}, year = {2016} }
CP-mtML: Coupled Projection multi-task Metric Learning for Large Scale Face Retrieval Binod Bhattarai, Gaurav Sharma, and Frederic Jurie In CVPR 2016 [Abstract] [BibTex] [arXiv]
We propose a novel Coupled Projection multi-task Metric Learning (CP-mtML) method for large scale face retrieval. In contrast to previous works which were limited to low dimensional features and small datasets, the proposed method scales to large datasets with high dimensional face descriptors. It utilises pairwise (dis-) similarity constraints as supervision and hence does not require exhaustive class annotation for every training image. While, traditionally, multi-task learning methods have been validated on same dataset but different tasks, we work on the more challenging setting with heterogeneous datasets and different tasks. We show empirical validation on multiple face image datasets of different facial traits, e.g. identity, age and expression. We use classic Local Binary Pattern (LBP) descriptors along with the recent Deep Convolutional Neural Network (CNN) features. The experiments clearly demonstrate the scalability and improved performance of the proposed method on the tasks of identity and age based face image retrieval compared to competitive existing methods, on the standard datasets and with the presence of a million distractor face images.

@inproceedings{mtml_cvpr_2016, title = {{CP-mtML}: {C}oupled Projection multi-task Metric Learning for Large Scale Face Retrieval }, author = {Bhattarai, Binod and Sharma, Gaurav and Jurie, Frederic}, booktitle = {CVPR}, arxiv = {1604.02975}, year = {2016} }
LOMo: Latent Ordinal Model for Facial Analysis in Videos (Spotlight presentation) Karan Sikka, Gaurav Sharma, and Marian Bartlett In CVPR 2016 [Abstract] [BibTex] [arXiv]
We study the problem of facial analysis in videos. We propose a novel weakly supervised learning method that models the video event (expression, pain etc.) as a sequence of automatically mined, discriminative sub-events (eg. onset and offset phase for smile, brow lower and cheek raise for pain). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF- it extends such frameworks to model the ordinal or temporal aspect in the videos, approximately. We obtain consistent improvements over relevant competitive baselines on four challenging and publicly available video based facial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations. In combination with complimentary features, we report state-of-the-art results on these datasets.

@inproceedings{lomo_cvpr_2016, title = {{LOMo}: Latent Ordinal Model for Facial Analysis in Videos}, honor = {Spotlight presentation}, author = {Sikka, Karan and Sharma, Gaurav and Bartlett, Marian}, booktitle = {CVPR}, arxiv = {1604.01500}, year = {2016} }
Latent Embeddings for Zero-shot Classification (Spotlight presentation) Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele In CVPR 2016 [Abstract] [BibTex] [arXiv]
We present a novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification. The proposed method augments the state-of-the-art bilinear compatibility model by incorporating latent variables. Instead of learning a single bilinear map, it learns a collection of maps with the selection, of which map to use, being a latent variable for the current image-class pair. We train the model with a ranking based objective function which penalizes incorrect rankings of the true class for a given image. We empirically demonstrate that our model improves the state-of-the-art for various class embeddings consistently on three challenging publicly available datasets for the zero-shot setting. Moreover, our method leads to visually highly interpretable results with clear clusters of different fine-grained object properties that correspond to different latent variable maps.

@inproceedings{latem_cvpr_2016, title = {Latent Embeddings for Zero-shot Classification}, honor = {Spotlight presentation}, author = {Xian, Yongqin and Akata, Zeynep and Sharma, Gaurav and Nguyen, Quynh and Hein, Matthias and Schiele, Bernt}, booktitle = {CVPR}, arxiv = {1603.08895}, year = {2016} }
Local Higher-Order Statistics (LHS) describing images with statistics of local non-binarized pixel patterns Gaurav Sharma, and Frederic Jurie Computer Vision and Image Understanding (CVIU) 2016 [Abstract] [BibTex] [arXiv]
We propose a new image representation for texture categorization and facial analysis, relying on the use of higher-order local differential statistics as features. It has been recently shown that small local pixel pattern distributions can be highly discriminative while being extremely efficient to compute, which is in contrast to the models based on the global structure of images. Motivated by such works, we propose to use higher-order statistics of local non-binarized pixel patterns for the image description. The proposed model does not require either (i) user specified quantization of the space (of pixel patterns) or (ii) any heuristics for discarding low occupancy volumes of the space. We propose to use a data driven soft quantization of the space, with parametric mixture models, combined with higher-order statistics, based on Fisher scores. We demonstrate that this leads to a more expressive representation which, when combined with discriminatively learned classifiers and metrics, achieves state-of-the-art performance on challenging texture and facial analysis datasets, in low complexity setup. Further, it is complementary to higher complexity features and when combined with them improves performance.

@article{lhs_cviu_2016, title = {Local Higher-Order Statistics ({LHS}) describing images with statistics of local non-binarized pixel patterns}, author = {Sharma, Gaurav and Jurie, Frederic}, journal = {Computer Vision and Image Understanding (CVIU)}, volume = {142}, pages = {13--22}, arxiv = {1510.00542}, year = {2016} }
A Joint Learning Approach for Cross Domain Age Estimation (Best Student Paper Award of Image, Video and Multidimensional Singal Processing) Binod Bhattarai, Gaurav Sharma, Alexis Lechervy, and Frederic Jurie In ICASSP 2016 [Abstract] [BibTex]
We propose a novel joint learning method for cross domain age estimation, a domain adaptation problem. The proposed method learns a low dimensional projection along with a re- gressor, in the projection space, in a joint framework. The projection aligns the features from two different domains, i.e. source and target, to the same space, while the regressor pre- dicts the age from the domain aligned features. After this alignment, a regressor trained with only a few examples from the target domain, along with more examples from the source domain, can predict very well the ages of the target domain face images. We provide empirical validation on the largest publicly available dataset for age estimation i.e. MORPH- II. The proposed method improves performance over several strong baselines and the current state-of-the-art methods.

@inproceedings{jl_icassp_2016, title = {A Joint Learning Approach for Cross Domain Age Estimation}, honor = {Best Student Paper Award of Image, Video and Multidimensional Singal Processing}, author = {Bhattarai, Binod and Sharma, Gaurav and Lechervy, Alexis and Jurie, Frederic}, year = {2016}, booktitle = {ICASSP} }

2015

Scalable Nonlinear Embeddings for Semantic Category-based Image Retrieval Gaurav Sharma, and Bernt Schiele In ICCV 2015 [Abstract] [BibTex] [PDF]
We propose a novel algorithm for the task of supervised discriminative distance learning by nonlinearly embedding vectors into a low dimensional Euclidean space. We work in the challenging setting where supervision is with constraints on similar and dissimilar pairs while training. The proposed method is derived by an approximate kernelization of a linear Mahalanobis-like distance metric learning algorithm and can also be seen as a kernel neural network. The number of model parameters and test time evaluation complexity of the proposed method are O(dD) where D is the dimensionality of the input features and d is the dimension of the projection space - this is in contrast to the usual kernelization methods as, unlike them, the complexity does not scale linearly with the number of training examples. We propose a stochastic gradient based learning algorithm which makes the method scalable (w.r.t. the number of training examples), while being nonlinear. We train the method with up to half a million training pairs of 4096 dimensional CNN features. We give empirical comparisons with relevant baselines on seven challenging datasets for the task of low dimensional semantic category based image retrieval.

@inproceedings{nml_iccv_2016, title = {Scalable Nonlinear Embeddings for Semantic Category-based Image Retrieval}, author = {Sharma, Gaurav and Schiele, Bernt}, year = {2015}, url = {https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Sharma_Scalable_Nonlinear_Embeddings_ICCV_2015_paper.pdf}, booktitle = {ICCV} }
Latent Max-margin Metric Learning for Comparing Video Face Tubes (Best Paper Award) Gaurav Sharma, and Patrick Perez In CVPRW 2015 [Abstract] [BibTex] [PDF]
Comparing "face tubes" is a key component of modern systems for face biometrics based video analysis and annotation. We present a novel algorithm to learn a distance metric between such spatio-temporal face tubes in videos. The main novelty in the algorithm is based on incorporation of latent variables in a max-margin metric learning framework. The latent formulation allows us to model, and learn metrics to compare faces under different challenging variations in pose, expressions and lighting. We propose a novel dataset named TV Series Face Tubes (TSFT) for evaluating the task. The dataset is collected from 12 different episodes of 8 popular TV series and has 94 subjects with 569 manually annotated face tracks in total. We show quantitatively how incorporating latent variables in max-margin metric learning leads to improvement of current state-of-the-art metric learning methods for the two cases when the testing is done with subjects that were seen during training and when the test subjects were not seen at all during training. We also give results on a challenging benchmark dataset: YouTube faces, and place our algorithm in context w.r.t. existing methods.

@inproceedings{lftc_cvprw2015, title = {Latent Max-margin Metric Learning for Comparing Video Face Tubes}, honor = {Best Paper Award}, author = {Sharma, Gaurav and Perez, Patrick}, year = {2015}, url = {http://openaccess.thecvf.com/content_cvpr_workshops_2015/W02/papers/Sharma_Latent_Max-Margin_Metric_2015_CVPR_paper.pdf}, booktitle = {CVPRW} }

2014

EPML: Expanded Parts based Metric Learning for Occlusion Robust Face Verification Gaurav Sharma, Frederic Jurie, and Patrick Perez In ACCV 2014 [BibTex]
@inproceedings{epml_acccv14, title = { {EPML}: {E}xpanded Parts based Metric Learning for Occlusion Robust Face Verification}, author = {Sharma, Gaurav and Jurie, Frederic and Perez, Patrick}, year = {2014}, booktitle = {ACCV} }
Learning Non-linear SVM in Input Space for Image Classification Gaurav Sharma, Frederic Jurie, and Patrick Perez HAL Technical Report hal-00977304 2014 [BibTex]
@article{nsvm_hal14, title = {Learning Non-linear SVM in Input Space for Image Classification}, author = {Sharma, Gaurav and Jurie, Frederic and Perez, Patrick}, year = {2014}, journal = {HAL Technical Report hal-00977304} }
Some faces are more equal than others: Hierarchical organization for accurate and efficient large-scale identity-based face retrieval Binod Bhattarai, Gaurav Sharma, Frederic Jurie, and Patrick Perez In ECCVW 2014 [BibTex]
@inproceedings{hfaces_eccvw14, title = {Some faces are more equal than others: {H}ierarchical organization for accurate and efficient large-scale identity-based face retrieval}, author = {Bhattarai, Binod and Sharma, Gaurav and Jurie, Frederic and Perez, Patrick}, year = {2014}, booktitle = {ECCVW} }
Transfer Learning via Attributes for Improved On-the-fly Classification Praveen Kulkarni, Gaurav Sharma, Joaquin Zepeda, and Louis Chevallier In WACV 2014 [BibTex]
@inproceedings{tl_wacv14, title = {Transfer Learning via Attributes for Improved On-the-fly Classification}, author = {Kulkarni, Praveen and Sharma, Gaurav and Zepeda, Joaquin and Chevallier, Louis}, year = {2014}, booktitle = {WACV} }

2013

A Novel Approach for Efficient SVM Classification with Histogram Intersection Kernel (Oral presentation; 7% acceptance rate) Gaurav Sharma, and Frederic Jurie In BMVC 2013 [BibTex]
@inproceedings{nsvm_bmvc13, title = {A Novel Approach for Efficient SVM Classification with Histogram Intersection Kernel}, honor = {Oral presentation; 7% acceptance rate}, author = {Sharma, Gaurav and Jurie, Frederic}, year = {2013}, booktitle = {BMVC} }

2012

Expanded Parts Model for Human Attribute and Action Recognition in Still Images Gaurav Sharma, Frederic Jurie, and Cordelia Schmid In CVPR 2012 [Abstract] [BibTex]
We propose a new model for recognizing human attributes (e.g. wearing a suit, sitting, short hair) and actions (e.g. running, riding a horse) in still images. The proposed model relies on a collection of part templates which are learnt discriminatively to explain specific scale-space locations in the images (in human centric coordinates). It avoids the limitations of highly structured models, which consist of a few (i.e. a mixture of) ‘average’ templates. To learn our model, we propose an algorithm which automatically mines out parts and learns corresponding discriminative templates with their respective locations from a large number of candidate parts. We validate the method on recent challenging datasets: (i) Willow 7 actions [7], (ii) 27 Human Attributes (HAT) [25], and (iii) Stanford 40 actions [37]. We obtain convincing qualitative and state-of-the-art quantitative results on the three datasets.

@inproceedings{epm_cvpr12, title = {Expanded Parts Model for Human Attribute and Action Recognition in Still Images}, author = {Sharma, Gaurav and Jurie, Frederic and Schmid, Cordelia}, year = {2012}, booktitle = {CVPR} }
Semantic Description of Humans in Images Gaurav Sharma 2012 [BibTex]
@phdthesis{sharma_thesis2012, title = {Semantic Description of Humans in Images}, author = {Sharma, Gaurav}, year = {2012}, school = {LEAR -- INRIA, GREYC -- CNRS} }
Local Higher-Order Statistics (LHS) for Texture Categorization and Facial Analysis Gaurav Sharma, Sibt Hussain, and Frederic Jurie In ECCV 2012 [BibTex]
@inproceedings{sharma_eccv2012, title = {Local Higher-Order Statistics ({LHS}) for Texture Categorization and Facial Analysis}, author = {Sharma, Gaurav and ul Hussain, Sibt and Jurie, Frederic}, year = {2012}, booktitle = {ECCV} }
Discriminative Spatial Saliency for Image Classification Gaurav Sharma, Sibt Hussain, and Frederic Jurie In CVPR 2012 [BibTex]
@inproceedings{sharma_cvpr2012, title = {Discriminative Spatial Saliency for Image Classification}, author = {Sharma, Gaurav and ul Hussain, Sibt and Jurie, Frederic}, year = {2012}, booktitle = {CVPR} }

2011

Learning Discriminative Spatial Representation for Image Classification (Oral presentation; 8% acceptance rate) Gaurav Sharma, and Frederic Jurie In BMVC 2011 [BibTex]
@inproceedings{dsr_bmvc2011, title = {Learning Discriminative Spatial Representation for Image Classification}, honor = {Oral presentation; 8% acceptance rate}, author = {Sharma, Gaurav and Jurie, Frederic}, year = {2011}, booktitle = {BMVC} }

2010

Distributed Calibration of Pan-Tilt Camera Network using Multi-Layered Belief Propagation Ayesha Choudhary, Gaurav Sharma, Santanu Chaudhury, and Subhashis Banerjee In Workshop on Camera Networks, Computer Vision and Pattern Recognition (CVPR) 2010 [BibTex]
@inproceedings{distcalib2010, title = {Distributed Calibration of Pan-Tilt Camera Network using Multi-Layered Belief Propagation}, author = {Choudhary, Ayesha and Sharma, Gaurav and Chaudhury, Santanu and Banerjee, Subhashis}, booktitle = {Workshop on Camera Networks, Computer Vision and Pattern Recognition (CVPR)}, year = {2010} }

2009

Adaptive Digital Makeup Abhinav Dhall, Gaurav Sharma, Rajen Bhatt, and Ghulam M. Khan In International Symposium on Visual Computing (ISVC) 2009 [BibTex]
@inproceedings{adaptmakeup2009, title = {Adaptive Digital Makeup}, author = {Dhall, Abhinav and Sharma, Gaurav and Bhatt, Rajen and Khan, Ghulam M.}, booktitle = {International Symposium on Visual Computing (ISVC)}, year = {2009} }
Hierarchical System for Categorization and Orientation Detection of Consumer Images Gaurav Sharma, Abhinav Dhall, Santanu Chaudhury, and Rajen Bhatt In International Conference on Pattern Recognition and Machine Intelligence (PReMI) 2009 [BibTex]
@inproceedings{catorient2009, title = {Hierarchical System for Categorization and Orientation Detection of Consumer Images}, author = {Sharma, Gaurav and Dhall, Abhinav and Chaudhury, Santanu and Bhatt, Rajen}, booktitle = {International Conference on Pattern Recognition and Machine Intelligence (PReMI)}, year = {2009} }
Curvature Feature Distribution based Classification of Indian Scripts from Document Images Gaurav Sharma, Ritu Garg, and Santanu Chaudhury In Workshop on Multilingual OCR, International Conference on Document Analysis and Recognition (ICDAR) 2009 [BibTex]
@inproceedings{indscript2009, title = {Curvature Feature Distribution based Classification of Indian Scripts from Document Images}, author = {Sharma, Gaurav and Garg, Ritu and Chaudhury, Santanu}, booktitle = {Workshop on Multilingual OCR, International Conference on Document Analysis and Recognition (ICDAR)}, year = {2009} }
Object Detection as Statistical Test of Hypothesis Gaurav Sharma, Santanu Chaudhury, and J. B. Srivastava In Indian Conference on Vision Graphics and Image Processing (ICVGIP) 2009 [BibTex]
@inproceedings{objdet2009, title = { Object Detection as Statistical Test of Hypothesis}, author = {Sharma, Gaurav and Chaudhury, Santanu and Srivastava, J. B.}, booktitle = {Indian Conference on Vision Graphics and Image Processing (ICVGIP)}, year = {2009} }

2008

Kernel Eigen Space Merging Gaurav Sharma, Santanu Chaudhury, and J. B. Srivastava In Technical report for Masters Thesis, Indian Institute of Technology Delhi (IITD) 2008 [BibTex]
@inproceedings{kereigtr2008, title = { Kernel Eigen Space Merging}, author = {Sharma, Gaurav and Chaudhury, Santanu and Srivastava, J. B.}, booktitle = {Technical report for Masters Thesis, Indian Institute of Technology Delhi (IITD)}, year = {2008} }
Bag-of-Features Kernel Eigen Spaces for Classification Gaurav Sharma, Santanu Chaudhury, and J. B. Srivastava In International Conference on Pattern Recognition (ICPR) 2008 [BibTex]
@inproceedings{kerneleig2008, title = {Bag-of-Features Kernel Eigen Spaces for Classification}, author = {Sharma, Gaurav and Chaudhury, Santanu and Srivastava, J. B.}, booktitle = {International Conference on Pattern Recognition (ICPR)}, year = {2008} }

2007

Text Mining through Entity-Relationship Based Information Extraction Lipika Dey, M. Abulaish, Mr. Jahiruddin, and Gaurav Sharma In Workshop on Bio-Medical Application of Web Technology, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligen Agent Technology 2007 [BibTex]
@inproceedings{textmine2007, title = {Text Mining through Entity-Relationship Based Information Extraction}, author = {Dey, Lipika and Abulaish, M. and Jahiruddin, Mr. and Sharma, Gaurav}, booktitle = {Workshop on Bio-Medical Application of Web Technology, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligen Agent Technology}, year = {2007} }