Assistant Professor,
Department of Computer Science and Engineering (joint appointment with the Department of Human-centered Design)
Indraprastha Institute of Information Technology, Delhi (IIIT-Delhi).
Rajiv Ratn Shah currently works as an Assistant Professor in the Department of Computer Science and Engineering (joint appointment with the Department of Human-centered Design) at IIIT-Delhi. He received his Ph.D. in computer science from the National University of Singapore, Singapore. Before joining IIIT-Delhi, he worked as a Research Fellow in Living Analytics Research Center (LARC) at the Singapore Management University, Singapore. Prior to completing his Ph.D., he received his M.Tech. and M.C.A. degrees in Computer Applications from the Delhi Technological University, Delhi and Jawaharlal Nehru University, Delhi, respectively. He has also received his B.Sc. in Mathematics (Honors) from the Banaras Hindu University, Varanasi. Dr. Shah is the recipient of several awards, including the prestigious Heidelberg Laureate Forum (HLF) and European Research Consortium for Informatics and Mathematics (ERCIM) fellowships. He has also received the best paper award in the IWGS workshop at the ACM SIGSPATIAL conference 2016, San Francisco, USA and was runner-up in the Grand Challenge competition of ACM International Conference on Multimedia 2015, Brisbane, Austraila. He is involved in organizing and reviewing of many top-tier international conferences and journals. Recently, he has organized a workshop on Multimodal Representation, Retrieval, and Analysis of Multimedia Content (MR2AMC) in the conjunction of the first IEEE MIPR 2018 conference. His research interests include multimedia content processing, natural language processing, image processing, multimodal computing, data science, social media computing, and the internet of things.
Specifically, I am looking for motivated students and interns in the following areas at IIIT-Delhi.:
Debanjan Mahata, John Kuriakose, Rajiv Ratn Shah, Roger Zimmermann, "Key2Vec: Automatic Ranked Keyphrase Extraction from Scientific Articles using Phrase Embeddings," In the proceedings of NAACL, (Accepted). New Orleans, Louisiana, USA, 2018.
Hitkul Jangid, Shivangi Singhal, Rajiv Ratn Shah, Roger Zimmermann, "Aspect-Based Financial Sentiment Analysis using Deep Learning," In the proceedings of WWW Conference, (pages 1961-1966). Perth, Australia, 2018.
Yifang Yin, Rajiv Ratn Shah, Guanfeng Wang, Roger Zimmermann, "Feature-based Map Matching for Low-Sampling-Rate GPS Trajectories," In the proceedings of ACM Transactions on Spatial Algorithms and Systems, (Accepted). 2018.
Debanjan Mahata, Jasper Friedrichs, Rajiv Ratn Shah, Jing Jiang, "Did you take the #pill? - Detecting Personal Intake of Medicine from Twitter," In the proceedings of IEEE Intelligent Systems on Affective Computing and Sentiment Analysis, (Accepted). 2018.
Mayank Meghawat, Satyendra Yadav, Debanjan Mahata, Yifang Yin, Rajiv Ratn Shah, Roger Zimmermann, "A Multimodal Approach to Predict Social Media Popularity," In the proceedings of MR2ACM in IEEE MIPR, (In press). Miami, Florida, USA, 2018.
See the List of All PublicationsMining social media messages such as tweets, blogs, and Facebook posts for health and drug related information has received significant interest in pharmacovigilance research. Social media sites (e.g., Twitter), have been used for monitoring drug abuse, adverse reactions of drug usage and analyzing expression of sentiments related to drugs. Most of these studies are based on aggregated results from a large population rather than specific sets of individuals. In order to conduct studies at an individual level or specific cohorts, identifying posts mentioning intake of medicine by the user is necessary. Towards this objective we develop a classifier for identifying mentions of personal intake of medicine in tweets. We train a stacked ensemble of shallow convolutional neural network (CNN) models on an annotated dataset. We use random search for tuning the hyper-parameters of the CNN models and present an ensemble of best models for the prediction task. Our system produces state-of-the-art result, with a micro-averaged F-score of 0.693. We believe that the developed classifier has direct uses in the areas of psychology, health informatics, pharmacovigilance and affective computing for tracking moods, emotions and sentiments of patients expressing intake of medicine in social media.
Aspect based sentiment analysis aims to detect an aspect (i.e. features) in a given text and then perform sentiment analysis of the text with respect to that aspect. This paper aims to give a solution for the FiQA 2018 challenge subtask 1. We perform aspect-based sentiment analysis on the microblogs and headlines of financial domain. We use a multi-channel convolutional neural network for sentiment analysis and a recurrent neural network with bidirectional long short-term memory units to extract aspect from a given headline or microblog. Our proposed model produces a weighted average F1 score of 0.69 for the aspect extraction task and predicts sentiment intensity scores with a mean squared error of 0.112 on 10-fold cross validation. We believe that the developed system has direct applications in the financial domain.
Capturing videos anytime and anywhere, and then instantly sharing them online, has become a very popular activity. However, many outdoor user-generated videos (UGVs) lack a certain appeal because their soundtracks consist mostly of ambient background noise. Aimed at making UGVs more attractive, we introduce ADVISOR, a personalized video soundtrack recommendation system. We propose a fast and effective heuristic ranking approach based on heterogeneous late fusion by jointly considering three aspects: venue categories, visual scene, and user listening history. Specifically, we combine confidence scores, produced by SVMhmm models constructed from geographic, visual, and audio features, to obtain different types of video characteristics. Our contributions are threefold. First, we predict scene moods from a real-world video dataset that was collected from users’ daily outdoor activities. Second, we perform heuristic rankings to fuse the predicted confidence scores of multiple models, and third we customize the video soundtrack recommendation functionality to make it compatible with mobile devices. A series of extensive experiments confirm that our approach performs well and recommends appealing soundtracks for UGVs to enhance the viewing experience. This work results in the following publications.
Social media platforms such as Flickr allow users to annotate photos with descriptive keywords, called, tags with the goal of making multimedia content easily understandable, searchable, and discoverable. However, due to the manual, ambiguous, and personalized nature of user tagging, many tags of a photo are in a random order and even irrelevant to the visual content. Aiming to automatically compute tag relevance for a given photo, we propose a tag ranking scheme based on voting from photo neighbors derived from multimodal information. Specifically, we determine photo neighbors leveraging geo, visual, and semantics concepts derived from spatial information, visual content, and textual metadata, respectively. We leverage high-level features instead traditional low-level features to compute tag relevance. Moreover, we explore the fusion of multimodal information to refine tag ranking leveraging recall based weighting. Subsequently, we build a tag recommendation system because manual annotation is very time-consuming and cumbersome for most users, which makes it difficult to search relevant photos. Moreover, predicted tags for a photo are not necessarily relevant to users’ interests. We aim to automatically annotate photos such that tags describe objective aspects of the photos considering user tagging behaviours. Our tag recommendation system, called, PROMPT, that recommends personalized tags for a given photo leveraging personal and social contexts. Specifically, first, we determine a group of users who have similar tagging behavior as the user of the photo, which is very useful in recommending personalized tags. Next, we find candidate tags from visual content, textual metadata, and tags of neighboring photos, and recommends five most suitable tags. We initialize scores of the candidate tags using asymmetric tag co-occurrence probabilities and normalized scores of tags after neighbor voting, and later perform random walk to promote the tags that have many close neighbors and weaken isolated tags. Finally, we recommend top five user tags to the given photo. This work results in the following publications.
The rapid growth in the amount of user-generated content (UGCs) online necessitates for social media companies to automatically extract knowledge structures (concepts) from photos and videos to provide diverse multimedia-related services. However, real-world photos and videos are complex and noisy, and extracting semantics and sentics from the multimedia content alone is a very difficult task because suitable concepts may be exhibited in different representations. Hence, it is desirable to analyze UGCs from multiple modalities for a better understanding. To this end, we first present the EventBuilder system that deals with semantics understanding and automatically generates a multimedia summary for a given event in real-time by leveraging different social media such as Wikipedia and Flickr. Subsequently, we present the EventSensor system that aims to address sentics understanding and produces a multimedia summary for a given mood. It extracts concepts and mood tags from visual content and textual metadata of UGCs, and exploits them in supporting several significant multimedia-related services such as a musical multimedia summary. Moreover, EventSensor supports sentics-based event summarization by leveraging EventBuilder as its semantics engine component. Experimental results confirm that both EventBuilder and EventSensor outperform their baselines and efficiently summarize knowledge structures on the YFCC100M dataset. This work results in the following publications.
Accurate map matching has been a fundamental but challenging problem that has drawn great research attention in recent years. It aims to reduce the uncertainty in a trajectory by matching the GPS points to the road network on a digital map. Most existing work has focused on estimating the likelihood of a candidate path based on the GPS observations, while neglecting to model the probability of a route choice from the perspective of drivers. Here we propose a novel feature-based map matching algorithm that estimates the cost of a candidate path based on both GPS observations and human factors. To take human factors into consideration is very important especially when dealing with low sampling rate data where most of the movement details are lost. Additionally, we simultaneously analyze a subsequence of coherent GPS points by utilizing a new segment-based probabilistic map matching strategy, which is less susceptible to the noisiness of the positioning data. We have evaluated the proposed approach on a public large-scale GPS dataset, which consists of 100 trajectories distributed all over the world. The experimental results show that our method is robust to sparse data with large sampling intervals (e.g., 60 s ~ 300 s) and challenging track features (e.g., u-turns and loops). Compared with two state-of-the-art map matching algorithms, our method substantially reduces the route mismatch error by 6.4% ~ 32.3% and obtains the best map matching results in all the different combinations of sampling rates and challenging features. This work results in the following publication.
In multimedia-based e–learning systems, the accessibility and searchability of most lecture video content is still insufficient due to the unscripted and spontaneous speech of the speakers. Moreover, this problem becomes even more challenging when the quality of such lecture videos is not sufficiently high. To extract the structural knowledge of a multi-topic lecture video and thus make it easily accessible it is very desirable to divide each video into shorter clips by performing an automatic topic-wise video segmentation. To this end, we first present the ATLAS syetsm which leverages the visual content and transcription of a lecture video to determine segment boundaries. Subsequently, we present the TRACE system that leverages existing knowledge bases such as Wikipedia in addition to visual content and transcription to determine segment boundaries. TRACE has two main contributions: (i) the extraction of a novel linguistic-based Wikipedia feature to segment lecture videos efficiently, and (ii) the investigation of the late fusion of video segmentation results derived from state-of-the-art algorithms. Specifically for the late fusion, we combine confidence scores produced by the models constructed from visual, transcriptional, and Wikipedia features. This work results in the following publications.
An interesting recent trend, enabled by the ubiquitous availability of mobile devices, is that regular citizens report events which news providers then disseminate, e.g., CNN iReport. Often such news are captured in places with very weak network infrastructures and it is imperative that a citizen journalist can quickly and reliably upload videos in the face of slow, unstable, and intermittent Internet access. We envision that some middleboxes are deployed to collect these videos over energy-efficient short-range wireless networks. Multiple videos may need to be prioritized, and then optimally transcoded and scheduled. In this study we introduce an adaptive middlebox design, called NEWSMAN, to support citizen journalists. NEWSMAN jointly considers two aspects under varying network conditions: (i) choosing the optimal transcoding parameters, and (ii) determining the uploading schedule for news videos. We design, implement, and evaluate an efficient scheduling algorithm to maximize a user-specified objective function. We conduct a series of experiments using trace-driven simulations, which confirm that our approach is practical and performs well. For instance, NEWSMAN outperforms the existing algorithms (i) by 12 times in terms of system utility (i.e., sum of utilities of all uploaded videos), and (ii) by 4 times in terms of the number of videos uploaded before their deadline. This work results in the following publication.
We provide solution for SMS and FAQs matching in Malayalam, Hindi and English laguages. In order to perform a matching between SMS queries and FAQ database, we introduce enhanced similarity score, proximity score, enhanced length score and an answer matching system. We introduce the stemming of terms and consider the effects of joining adjacent terms in SMS query and FAQ to improve the similarity score. We propose a novel method to normalize FAQ and SMS tokens to improve the accuracy for Hindi language. Moreover, we suggest a few character substitutions to handle error in the SMS query. We demonstrate the effectiveness of our approach by considering many real-life FAQ-datasets provided by FIRE from a number of different domains such as Health, Telecom, Insurance and Railway booking. Experimental results confirm that our solution for the SMS-based FAQ Retrieval monolingual task is very encouraging and among the top submissions which performed very well for English, Hindi and Malayalam. The Mean Reciprocal Rank (MRR) scores for our approach are 0.971, 0.973 and 0.761 respectively for English, Hindi and Malayalam SMS-based FAQ Retrieval monolingual task in FIRE 2012. Furthermore, our solution topped the task for Hindi language with MRR score equal to 0.971 in FIRE 2013. Our approach performs very well for English language as well in FIRE 2013 despite transcripts of the speech queries are included in test dataset along with the normal SMS queries. This work results in the following publications.