Accepted Papers at iWML 2016

Below is given a list of papers that will be presented at iWML 2016.

Click on [Abstract] to toggle the visibility of the abstract of the paper. Extended abstracts of the papers will appear in [PDF] format shortly

Oral Presentation

Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications
Himanshu Jain, Yashoteja Prabhu and Manik Varma
[PDF] | [Abstract]
The choice of the loss function is critical in extreme multi-label learning where the objective is to annotate each data point with the most relevant subset of labels from an extremely large label set. Unfortunately, existing loss functions, such as the Hamming loss, are unsuitable for learning, model selection, hyperparameter tuning and performance evaluation. This paper addresses the issue by developing propensity scored losses which: (a) prioritize predicting the few relevant labels over the millions of irrelevant ones; (b) do not erroneously treat missing labels as irrelevant; and (c) promote the accurate prediction of hard to predict, but rewarding tail labels. Another contribution is the development of algorithms which efficiently scale to extremely large datasets with up to 9 million labels, 70 million points and 2 million dimensions and which give significant improvements over the state-of-the-art. We also demonstrate that the proposed contributions achieve superior clickthrough rates on sponsored search ranking problems in Bing.
Deep Convolutional Networks for Modeling Image Virality
Abhimanyu Dubey and Sumeet Agarwal
[PDF] | [Abstract]
Study of virality and information diffusion is a topic gaining traction rapidly in the computational social sciences. Computer vision and social network analysis research has also focused on understanding the impact of content and information diffusion in making content viral. We present a novel algorithm to model image virality on online networks using the increasingly popular deep convolutional neural network architectures. Our proposed model provides significant insights into the features that are responsible for promoting virality and surpass the existing state-of-the-art by a 10% relative improvement in prediction.
Sparse Local Embeddings for Extreme Multi-label Classification
Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma and Prateek Jain
[PDF] | [Abstract]
The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subest of labels from an extremely large label set. Embedding based approaches make training and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of labels can be reduced by projecting the high dimensional label vectors onto a low dimensional linear subspace. Still, leading embedding approaches have been unable to deliver high prediction accuracies or scale to large problems as the low rank assumption is violated in most real world applications.

This paper develops the SLEEC classifier to address both limitations. The main technical contribution in SLEEC is a formulation for learning a small ensemble of local distance preserving embeddings which can accurately predict infrequently occurring (tail) labels. This allows SLEEC to break free of the traditional low-rank assumption and boost classification accuracy by learning embeddings which preserve pairwise distances between only the nearest label vectors.

Spotlight Presentation

Graphical Modeling Driven Policy Insights for Household Air Pollution: Insights from 204,912 Patients's POSEIDON Study (Invited Paper)
Tavpritesh Sethi, Anurag Agrawal, Aditya Nagori, Suveen Angraal and Sundeep Salvi
[PDF] | [Abstract]
While healthcare researchers intuitively recognize the connectivity of physiological systems, data-driven decision models for such interactions are lacking. Graphical models are amongst the best techniques of capturing interactions in complex interconnected data. In this work we leveraged the strengths of probabilistic graphical models applied to a massive, one-of-its-kind clinical data of 204,912 patients collected all across India on a single day to derive policy insights into Household Air Pollution associated respiratory disease. This unique data resource, acronymed POSEIDON (Prevalence Of Symptoms on a single Indian healthcare Day On a Nationwide scale) documented 60 clinical symptoms and diseases in 204,912 patients visiting the OPDs of 7400 randomly sampled general practitioners on a single day, all across India. We constructed "Symptome maps" across the decadal age groups of patients using Association Networks, discovered dynamically changing community structures and visualized these as Alluvial maps. These revealed novel qualitative insights into comorbidity patterns of various diseases and confirmed the high comorbid occurrence of airway diseases across all age groups. To quantify the effect of unclean cooking fuel upon airway disease, we fused these data with district wise Liquefied Petroleum Gas (LPG) consumption and used Bayesian structure learning and inference algorithms to quantify the causal impact of clean cooking fuels (LPG) on lowering Obstructive Airway Disease (OAD). We used Tabu search algorithm to learn structure and approximate inference upon the nodes of interest showed that the rate of OAD could be lowered by as much as 50% if the LPG usage is moved from the lowest to the highest quartile. To our knowledge, this is the first of its kind application of graphical models to a practice based clinical data combined with an indicator of clean energy usage to yield policy insights.
Asynchronous Non-Convex Optimization for Separable Problems
Sandeep Kumar and Ketan Rajawat
[PDF] | [Abstract]
This paper considers the distributed optimization of a sum of locally observable, non-convex functions. The optimization is performed over a multi-agent networked system, and each local function depends only on a subset of the variables. An asynchronous and distributed alternating directions method of multipliers (ADMM) method that allows the nodes to defer or skip the computation and transmission of updates is proposed. The algorithm can tolerate any bounded level of asynchrony and converges to local minima under certain regularity conditions.
Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks
Sahil Bhatia and Rishabh Singh
[PDF] | [Abstract]
We present a technique for providing feedback on syntax errors that uses Recurrent neural networks (RNNs) to model syntactically valid token sequences. Our approach is inspired from the recent work on learning language models from Big Code (large code corpus). For a given programming assignment, we first learn an RNN to model all valid token sequences using the set of syntactically correct student submissions. Then, for a student submission with syntax errors, we query the learnt RNN model with the prefix token sequence to predict token sequences that can fix the error by either replacing or inserting the predicted token sequence at the error location. We evaluate our technique on over 14,000 student submissions with syntax errors.

Our technique can completely repair 31.69% (4501/14203) of submissions with syntax errors and in addition partially correct 6.39% (908/14203) of the submissions.
Structured and Unstructured Machine Learning for Crowdsourced Spatial Data
Musfira Jilani, Padraig Corcoran and Michela Bertolotto
[PDF] | [Abstract]
Recent years have seen a significant increase in the number of applications requiring accurate and up-to-date spatial data. In this context crowdsourced maps such as OpenStreetMap (OSM) have the potential to provide a free and timely representation of our world. However, one factor that negatively influences the proliferation of these maps is the uncertainty about their data quality. This paper presents structured and unstructured machine learning methods to automatically assess and improve the semantic quality of streets in the OSM database.
Distance Metric Learning by Optimization on the Stiefel Manifold
Ankita Shukla and Saket Anand
[PDF] | [Abstract]
Distance metric learning has proven to be very successful in various problem domains. Most techniques learn a global metric in the form of a n x n symmetric positive semidefinite (PSD) Mahalanobis distance matrix, which has O(n^2) unknowns. The PSD constraint makes solving the metric learning problem even harder making it computationally intractable for high dimensions. In this work, we propose a flexible formulation that can employ different regularization functions, while implicitly maintaining the positive semidefiniteness constraint. We achieve this by eigendecomposition of the rank-p Mahalanobis distance matrix followed by a joint optimization on the Stiefel manifold S_{n,p} and the positive orthant R^p_+. The resulting nonconvex optimization problem is solved by employing an alternating strategy. We use a recently proposed projection free approach for efficient optimization over the Stiefel manifold. Even though the problem is nonconvex, we empirically show competitive classification accuracy on UCI and USPS digits datasets.

Poster Presentation

Supervised Heterogeneous Domain Adaptation via Random Forests
Sanatan Sukhija and Narayanan C Krishnan
[PDF] | [Abstract]
Heterogeneity of features and lack of correspondence between data points of different domains are the two primary challenges while performing feature transfer. In this paper, we present a novel supervised domain adaptation algorithm (SHDA-RF) that learns the mapping between heterogeneous features of different dimensions using random forests. Our algorithm uses the shared label distributions present across the domains as pivots for learning a sparse feature transformation. We conduct extensive experiments on three diverse datasets of varying dimensions and sparsity to verify the superiority of the proposed approach over other baseline and state of the art transfer approaches.
Modelling Right to Information Queries via Item Response Theory
Nayantara Kotoky, Praveen Kumar Kolla and Saradhi V. Vijaya
[PDF] | [Abstract]
The present work is an attempt to use models present in Item Response Theory, which are psychometric models, for modelling Right to Information (RTI) queries. A synthetic matrix resembling our RTI dataset (construction on-going) has been created and we have applied Graded Response Model (GRM) to the data. Outcomes of the experiment have helped us identify latent patterns regarding the RTI query-reply process. The model assigns values to the 'latent ability' of each institution, which we shall call 'transparency' of an institution, and showed us how each institution behaves differently to different topics of RTI queries. It also shows which institutions are good at replying to RTI queries and which institutions perform poorly. Such differences across different departments and institutions imply a possible scope for amendments to the ordinances of the institutions.
Brain Tractography Classication using Curvature Points
Vedang Patel, Anand Parmar, Mani Srivastava, Aditya Nigam and Arnav Bhavsar
[PDF] | [Abstract]
Classification of fiber tracts is an important problem in brain tractography. We propose a supervised algorithm which learns features for anatomically meaningful fiber clusters, from labeled DTI white matter data. The classification is performed at two levels: a) Grey vs White matter (macro level) and b) White matter clusters (micro level). Our approach focuses on high curvature points in the fiber tracts, which embodies the unique characteristics of each class. A test fiber is classified into one of these learned classes by comparing proximity using the learned curvature-point model (for micro level) and with a Neural Network classifier (at macro level). The proposed algorithm is validated over brain DTI data for three subjects containing about 2,50,000 fibers per subject, and has been shown to yield high classification accuracy at both macro and micro levels.
Quantitative Dermatopathology of 11 Skin Conditions through Transfer Learning of Tissue Photon Interaction Statistical Physics
Sumit Agrawal, Sri Phani Krishna Karri, Kausik Basak, Tamoghna Ojha and Debdoot Sheet
[PDF] | [Abstract]
Quantitative dermatopathology is being widely employed in clinical practices for precise detection of various skin conditions. Dermatoscopes are placed in contact with the skin for visual observation and this (a) risks cross-contamination between investigated subjects and (b) limits instrument use when investigating infectious lesions and wounds. This paper presents a framework for quantitative dermatopathology using consumer grade camera for contact free imaging of skin conditions. Inductive transfer learning of tissue-photon-interaction (TPI) statistical physics modeled for 11 skin conditions estimates TPI as a spatially localized poisson process of photons sensed by the RGB sensors, and utilizes this knowledge to detect skin condition. Inter-camera and illumination variations are compensated by including a maximum likelihood estimation (MLE) based photon density normalization across all images in the dataset. 8-folded crossvalidation experiments over a dataset of 440 images of 11 skin conditions provided the condition identification accuracy of 94.55 +- 5.05% within the top-3 predictions.
Landscaping of Random Forests through Controlled Deforestation
Kausik Das, Abhijit Guha Roy, Jyotirmoy Chatterjee and Debdoot Sheet
[PDF] | [Abstract]
Random forest (RF) is an ensemble learner constructed using a set of decision trees, where each tree is trained using randomly bootstrapped samples and aggregated to provide a decision. While the generalization error is reduced by increasing the number of trees in a RF, it substantially increases the testing time complexity, inhibiting its fast deployment in practical applications. In this paper, we propose a post-training optimization technique termed landscaping of RF for reducing computational complexity by compensating for trees associated with similar decision boundary. This allows faster deployment of the RF without compromising its performance. Landscaping is achieved through a two stage mechanism: (i) computation of decision similarity between all pairs of trees in the RF, and (ii) deletion of the computationally expensive tree in the RF with decision bias compensation for the removed tree. Performance of the proposed methodology was evaluated using three publicly available datasets. The RF performance before and after landscaping over the datasets was observed to have an error of 0:1084 0:03 and 0:1087 0:03, respectively, while testing times of the RF before landscaping was 2:5508 0:08 sec. and 0:9066 0:19 sec. after landscaping with 32 -- 76% reduction in execution time. These results strongly substantiates our claim of achieving deployment speedup without compromising the decision quality with landscaping of RF through controlled deforestation.
DASA: Domain Adaptation in Stacked Autoencoders using Systematic Dropout
Abhijit Guha Roy and Debdoot Sheet
[PDF] | [Abstract]
Domain adaptation deals with adapting behaviour of machine learning based systems trained using samples in source domain to their deployment in target domain where the statistics of samples in both domains are dissimilar. The task of directly training or adapting a learner in the target domain is challenged by lack of abundant labeled samples. In this paper we propose a technique for domain adaptation in stacked autoencoder (SAE) based deep neural networks (DNN) performed in two stages: (i) unsupervised weight adaptation using systematic dropouts in mini-batch training, (ii) supervised fine-tuning with limited number of labeled samples in target domain. We experimentally evaluate performance in the problem of retinal vessel segmentation where the SAE-DNN is trained using large number of labeled samples in the source domain (DRIVE dataset) and adapted using less number of labeled samples in target domain (STARE dataset).
Faster K-Means Cluster Estimation
Siddhesh Khandelwal and Amit Awekar
[PDF] | [Abstract]
K-means is a widely used iterative clustering algorithm. There has been considerable work on improving k-means in terms of mean squared error (MSE) and speed, both. However, most of the k-means variants tend to compute distance of each data point to each cluster centroid for every iteration. We propose two heuristics to overcome this bottleneck and speed up k-means. Our first heuristic predicts the candidate clusters for each data point by looking at nearby clusters after first iteration of k-means. Our second heuristic further reduces this candidate cluster list aggressively. We augment well known variants of k-means with our heuristics to demonstrate effectiveness of our heuristics. For various synthetic and real-world datasets, our heuristics achieve speed-up of up-to 10 times without significant increase in MSE.
Incremental Shared Nearest Neighbor Density-Based Clustering Algorithms for Dynamic Datasets
Panthadeep Bhattacharjee and Amit Awekar
[PDF] | [Abstract]
Dynamic datasets undergo frequent changes where small number of data points are added and deleted. Such dynamic datasets are frequently encountered in many real world applications such as search engines and recommender systems. Incremental data mining algorithms process these updates to datasets efficiently to avoid redundant computation. Shared nearest neighbor density based clustering (SNN-DBSCAN) is a widely used clustering algorithm, mainly for its robustness. Existing incremental extension to SNN-DBSCAN cannot handle deletions to dataset and handles insertions only point by point. We overcome both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. We present three different incremental algorithms with varying efficiency at elimination of redundant computation. We show effectiveness of our algorithms by performing experiments on large synthetic as well as real world datasets. Our algorithms are up to 2 Orders of Magnitude faster than non-incremental algorithm and up-to 50 times faster than existing incremental algorithm while guaranteeing exact same output.
Variance Change Point Detection
Santosh Srivastava, Ritwik Chaudhuri, Ankur Narang, Sahil Mathur and Gyana Parija
[PDF] | [Abstract]
We present a novel oine variance change point detection algorithm based on dynamic mode decomposition (DMD). The developed algorithm dynamic mode decomposition based variance change point detection (DVCPD) is completely data driven, doesn't require any knowledge of underlying governing equation or any probabilistic model assumption for time series. It uses a local adaptable window and sequential hypothesis testing to iteratively detect variance change points. A local adaptable window is defined whose location and size is automatically governed by acceptance and rejection of hypothesis. The DVCPD algorithm has been demonstrated to work robustly on different time series data sets and detects multiple variance change points accurately.
Role of expectation and memory constraints in Hindi comprehension: An eyetracking corpus analysis
Arpit Agarwal, Sumeet Agarwal and Samar Husain
[PDF] | [Abstract]
We have used the Potsdam-Allahabad Hindi Eye-tracking Corpus that contains eye-movement data from 30 participants on 153 sentences to investigate the effect of surprisal and retrieval on reading time, while controlling for word-level predictors (word complexity, syllable length, unigram and bigram frequency) and integration and storage costs. We find that surprisal has a significant coefficient in only in First Pass Reading Time while storage cost shows up only in Total Fixation Time, thus indicating that the two measures of predictability capture different cognitive variables.
Dissimilarity Based Contrastive Divergence for Anomaly Detection
Sahil Manocha, Adepu Ravisankar and Vineeth N Balasubramanian
[PDF] | [Abstract]
This paper describes training of a Restricted Boltzmann Machine using dissimilarity-based contrastive divergence to obtain an anomaly detector. We go over the merits of the method over other approaches and describe the methods usefulness to obtain a generative model.
Iterative Semi Supervised Data Denoising with Procrustes Analysis
Abhishek, Priyanshu Goyal and Shekhar Verma
[PDF] | [Abstract]
A wireless sensor network localization with only few location aware nodes is difficult due to noisy medium and other environmental effects. This situation is similar to semi supervised learning where in the given data set, a small portion is labeled while majority remains unlabeled and the aim is to find unknown labels based on available information. This is achieved by exploiting the underlying intrinsic geometrical manifold structure between data including unlabeled data. In this paper we propose iterative graph Laplacian least square regression with Procrustes analysis which ensures error minimization in localization through iterative manifold structure learning. The results indicate that high localization accuracy is achieved as compared to other similar technique.