2nd Indian Workshop on Machine Learning

July 1-3, 2016, IIT Kanpur

Please click on a card for more details on that speaker's talk, or just browse down for the details of all talks.

Rahul Agrawal

Microsoft R&D

India

Microsoft R&D

India

Indrajit

Bhattacharya

IBM Research

India

Bhattacharya

IBM Research

India

Soma Biswas

Indian Institute

of Science

Indian Institute

of Science

Amit Chandak

Amazon

Amazon

Anirban

Dasgupta

IIT Gandhinagar

Dasgupta

IIT Gandhinagar

Ambedkar

Dukkipati

Indian Institute

of Science

Dukkipati

Indian Institute

of Science

Aditya Gopalan

Indian Institute

of Science

Indian Institute

of Science

Prateek Jain

Microsoft Research

India

Microsoft Research

India

Shivaram

Kalyanakrishnan

IIT Bombay

Kalyanakrishnan

IIT Bombay

Animesh

Mukherjee

IIT Kharagpur

Mukherjee

IIT Kharagpur

Vaibhav Rajan

Xerox Research

Centre India

Xerox Research

Centre India

Ganesh

Ramakrishnan

IIT Bombay

Ramakrishnan

IIT Bombay

Vikas Raykar

IBM Research

IBM Research

Shirish Shevade

Indian Institute

of Science

Indian Institute

of Science

Parag Singla

IIT Delhi

IIT Delhi

Partha Talukdar

Indian Institute

of Science

Indian Institute

of Science

Rahul Agrawal

Microsoft R&D India

**Title: Deep Learning + NLP: Introduction and insights from Advertising data**

**Abstract:** In recent times, Deep Learning has made significant advances in modeling natural
language understanding. In this talk, we will introduce natural language modeling
using deep learning. We will start with the discussion of various word embedding,
language models and application of deep learning to conventional NLP tasks such as
NER. We will also take a look at experimental insights of the modeling on real world
data such as search queries, web pages, ad descriptions, product catalogs etc.

Microsoft R&D India

Indrajit Bhattacharya

IBM Research India

**Title: Deep Nonparametric Admixture Models**

**Abstract:** Deep generative models provide a hierarchical probabilistic
representation of data points, for example documents as mixtures of layered
entities, with topics, which are distributions over words at the first
layer, and then authors, which are distributions over topics, at the next
layer. Such a representation is called an admixture model (or a mixed
membership model) since multiple topics and multiple authors are used to
represent a single document. We consider deep variants of such models which
are allowed to grow to arbitrary (but finite) number of levels. Since
over-fitting is always a concern for deep models, we investigate
nonparametric models, where the number of parameters at each layer is not
fixed apriori, but is allowed to grow with data size. Dirichlet Processes
(DPs) are the basic building blocks of nonparametric modeling. While
Dirichlet Process mixture models enable infinite mixture modeling, infinite
admixture models result from Hierarchical Dirichlet Processes (HDPs), where
a draw from one DP becomes the base distribution for a second DP. In this
work, we investigate how HDPs can be coupled together to create deep
nonparametric admixture models. We show that this can be done by nesting
Hierarchical Dirichlet Processes, where each layer has its own HDP, and
each HDP has as its base distribution the HDP of the previous layer. We
show that such nested HDPs arise naturally as the infinite limit of
existing multi-layer parametric models such as the Author-Topic Model.
Inference is naturally a concern for such deep nonparametric models. We
study extensions of two different Gibbs sampling algorithms that exist for
the HDP - the direct sampling scheme, which directly samples the entities
at each layer for each word, and the Chinese Restaurant Franchise (CRF)
scheme which samples entities via pointers (called tables). While both
schemes are known to mix well for the HDP, we show that the complexity of
the CRF scheme grows exponentially with layers, and is not practical to use
beyond a single layer. We show using experiments on real text corpora that
the direct sampling scheme, whose complexity grows linearly with layers, is
a more practical alternative.

(Joint work with Lavanya Tekumalla (IISc) and Priyanka Agrawal (IBM Research))

**Recommended Prior Reading: **

IBM Research India

(Joint work with Lavanya Tekumalla (IISc) and Priyanka Agrawal (IBM Research))

- Dirichlet Process, Yee Whye Teh, Enclycopedia of Machine Learning, 2010, [link to pdf]
- Hierarchical Dirichlet Processes, Teh, Jordan, Beal, Blei, JASA, 2006, [link to pdf]

Soma Biswas

Indian Institute of Science

**Title: Machine Learning for Face Recognition**

**Abstract:** Face recognition is a very important area of research in
computer vision with lots of applications in surveillance. Unconstrained
pose, illumination, expression, low-resolution, etc. makes this task
extremely challenging. In this talk, we will discuss two successful
approaches for this task, namely distance metric learning and dictionary
learning. We will discuss some of the classical approaches in these two
areas and also the work that we are currently doing in our lab.
Bio: Soma Biswas is an assistant professor in the Electrical Engineering
Department, Indian Institute of Science, Bangalore, India. She received
the MTech degree from the Indian Institute of Technology, Kanpur and PhD
degree in Electrical and Computer Engineering from University of Maryland,
College Park in 2004 and 2009. Her research interests include image
processing, computer vision, and pattern recognition.

Indian Institute of Science

Amit Chandak

Amazon

**Title: Size Recommendations for Apparels & Shoes**

**Abstract:** Online shopping is becoming the preferred mode of purchasing goods since it offers
wide selection, competitive prices and easy returns. Apparel and shoes are important
product categories with significant revenue impact. Many customers avoid purchasing
them online due to fit issues. Sizing variations across brands and locales confuses
customers when picking the right size for a desired product. When not sure about
their correct size, customers tend to buy multiple sizes of a product and return the
rest. This is a recurring pain point for Amazon customers which causes high return
rates and bad customer experience.
We address this pain point by predicting the right size of a product for a given
user. We estimate true sizes of customers and products by minimizing an ordinal loss
function defined on their purchase history and also leveraging a wide range of
signals such as the collective user purchase/return history, sizing information in
seller catalog, specified user and product attributes (e.g., feet size, waist
length, product material). Our proposed algorithm is scalable and can be easily
parallelized. We compare our results with an existing in-house developed algorithm
which generates recommendations by clustering products and customers based on
co-purchases.

Amazon

Anirban Dasgupta

Indian Institute of Technology Gandhinagar

**Title: Estimating Network Parameters using Sampling**

**Abstract:** Large networks are ubiquitous, and yet,
estimating the key parameters of large networks is not an easy problem due to various
constraints--- either because of their size, or because of the fact that they are not
available in their entirety and can only be accessed through some restricted interface
(e.g. a rate-limited API). In this talk, we outline sampling based algorithms to
estimate a couple of key network properties e.g. the average degree and the size of a
sub-population. We will also discuss strategies to come up with bounds on the number
of queries made to the API, and how different sampling strategies fare in this metric.

Joint work(s) with Ravi Kumar, Tamas Sarlos, Flavio Chiericetti and Silvio Lattanzi.

**Recommended Prior Reading: **

Indian Institute of Technology Gandhinagar

Joint work(s) with Ravi Kumar, Tamas Sarlos, Flavio Chiericetti and Silvio Lattanzi.

- The talk should be accessible to anyone with a basic graduate level knowledge in probability and graph theory.

Ambedkar Dukkipati

Indian Institute of Science

**Title: Spectral methods for clustering and graph partitioning problems**

**Abstract:** Spectral methods for clustering and graph partitioning problems have been
very successful for two reasons: (i) good practical solutions and easy
implementations, (ii) theoretical guarantees. In this talk I will introduce
these methods by giving details of underlying spectral graph theory
results. I will also talk about theoretical guarantees that one can achieve
by using stochastic block models. Towards the end of the talk I will
provide a brief overview of our recent results on hypergraph partitioning
problems and their applications to computer vision and machine learning.

**Recommended Prior Reading: **

Indian Institute of Science

- Only basic linear algebra is assumed.

Aditya Gopalan

Indian Institute of Science

**Title: Online Learning with Structure**

**Abstract:** Online learning or sequential decision-making studies how an agent can
learn to perform a task (optimize a metric of interest) through repeated
interaction and observation (trial and error). An increasing number of
modern-day automated systems are tasked with learning to make sequential
decisions, e.g., Internet recommendation engines, dynamic pricing
algorithms, automated trading systems, etc., by observing a variety of
feedback signals, e.g., mouse clicks, demand responses, ratings, etc. A
widely employed model for decision-making under uncertainty is the
Multi-Armed Bandit, nearly eight decades old. Here, a decision maker
repeatedly faces a choice of playing one of several “arms” or actions,
each with an a priori unknown payoff, and must learn to play arms for
maximum net reward. Bandit problems with large action spaces are
challenging to solve in general, but the presence of additional structure
often makes them significantly easier to tackle. Starting from the basic
multi-armed bandit, we will explore several variations of online learning
problems with rich, complex structures, and highlight efficient
algorithmic paradigms to solve them. We will also attempt to cover recent
results from our investigations into the robustness structure of linearly
parameterized bandits and an online low-rank matrix completion bandit
problem.

Indian Institute of Science

Prateek Jain

Microsoft Research India

**Title: TBD**

**Abstract:** TBD

Microsoft Research India

Shivaram Kalyanakrishnan

Indian Institute of Technology Bombay

**Title: Deep Reinforcement Learning**

**Abstract:** If an agent can sense, think, and act in an unknown environment, how
MUST it act? Reinforcement Learning (RL) is a general paradigm under
which an agent, by trial and error, can discover actions that result
in long-term gain. Integrating ideas from various disciplines, RL has
matured as a field of study in the last few decades. Its successes
range from training programs to decrease the waiting time for
elevators in a building; to maximising profits from stock-trading; and
to numerous tasks in robotic control and decision-making.
In spite of these successes, RL is yet to be counted upon as a
practically-viable technology. To a large extent, the gap between the
promise of RL and its practical effectiveness has arisen from the lack
of suitable representations (and representation-discovery mechanisms)
in domains of interest. Interestingly, it is precisely this gap that
the emerging paradigm of deep learning is beginning to fill. Deep
learning is a data-driven technique that is capable of learning
complex non-linear input-output relationships, which are represented
using neural networks with a large number of layers (hence
"deep"). The combination of RL with deep learning has registered
remarkable successes in recent months. Notably, it has resulted in
AI-based agents that exceed human-level performance on a suite of
ATARI console games, and also on the more challenging game of Go.
I will begin this talk with an overview of RL and its key algorithmic
elements. Specifically, I will outline the role of representations in
determining the success of RL. I will follow with a brief survey of
deep learning, and proceed to describe the architecture of the agents
that have recently succeeded at ATARI and Go. I will end the talk with
thoughts about this new beginning for "deep reinforcement learning".

Indian Institute of Technology Bombay

Animesh Mukherjee

Indian Institute of Technology Kharagpur

**Title: In-depth analysis of large-scale citation networks**

**Abstract:** In this talk I shall present an overview of our four year long research
initiative in citation networks. Following a very brief introduction about
temporal networks, I shall talk about our investigations in the context of
the computer science domain further sub-divided into its constituent fields
(e.g., AI, ALGO, NETW, ML, IR etc.). Out of a number of things, I shall
mainly focus on three issues -- (a) impact of ancient papers (b) the
evolutionary landscape of the CS fields; (c) long-time characteristics of
the scientific citation profiles and accurate prediction of future citation
counts and (d) formation of co-authorship circles (very briefly).

Indian Institute of Technology Kharagpur

Vaibhav Rajan

Xerox Research Centre India

**Title: Dependency Modeling with Copulas**

**Abstract:** I'll begin with a brief introduction to copulas and their use in modeling
dependencies in multivariate distributions. I'll then describe parameter estimation
methods for Gaussian and Gaussian Mixture Copulas for continuous- and ordinal-
valued data. I'll conclude with some of our recent results in dependency-seeking
clustering using Gaussian mixture copulas.

**Recommended Prior Reading: **

Xerox Research Centre India

- Probability Distributions, Clustering, Gaussian Mixture Models
- [Optional]: Expectation Maximization, MCMC

Ganesh Ramakrishnan

Indian Institute of Technology Bombay

**Title: Optimal Subset Selection over DAGs with applications in Machine Learning**

**Abstract:** In this talk we present some subset selection problems in Data
mining and Machine Learning over Directed Acyclic Graphs (DAGs) and the
corresponding optimization formulations and algorithms. We begin with the
problem of supervised subset selection over lattices of features and
present the framework of Hierarchical Kernel Learning (HKL) and illustrate
its utility in the domain of feature/rule learning. HKL involves Multiple
Kernel Learning over a set of given base kernels assumed to be embedded on
a directed acyclic graph. We present a two-fold generalization of HKL: the
first is employing a generic l1/lp block-norm regularizer (p in [1, 2]) that
alleviates a key limitation of the HKL formulation. The second is a
generalization to the case of multi-class, multi-label and more generally,
multi-task applications. We present a highly specialized partial dual of
the proposed generalized HKL formulation and an efficient mirror descent
based active set algorithm for solving it. We present further
generalizations of this approach in settings involving (a) feature spaces
ranging from conjunctions of propositions to features in first order logic
and (b) sequentially structured output spaces.

Next, we look at some subset selection problems over DAGs in data mining settings. More specifically, we study the problem of summarizing DAG-structured hierarchies of nodes over a given set of instances. Example applications include

(a) automatically generating Wikipedia disambiguation pages for a set of articles; in this setting, the nodes are Wikipedia categories and the instances are specific Wikipedia entities (b) generating candidate multi-labels for preparing machine learning datasets (e.g., for text classification, functional genomics, and image classification); in this setting, the nodes are labels and instances are documents (c) building compact lexicons of patterns for cross-domain machine translation; in this setting, the nodes are patterns and instances are sentences in the corpus.

Existing approaches to this problem either focus on clustering the set of instances using the node hierarchy as features or approximate the node subset selection problem as a problem of selecting good individual nodes. We directly pose the problem as that of submodular optimization on a hierarchy of nodes by designing feature functions on the instances and node subsets. Desirable properties/features of the chosen subset of nodes include instance coverage, specificity, node diversity, and node homogeneity, each of which, we show, is naturally modeled by a submodular function. Other information, provided say by unsupervised approaches such as LDA and its variants, can also be utilized by defining a submodular function that expresses coherence between the chosen nodes and this information. We use a large-margin framework to learn convex mixtures over the set of submodular components. We present empirically evaluations of our method on (a) the problem of automatically generating Wikipedia disambiguation pages using human generated clusterings as ground truth and (b) the problem of determining the right subset of patterns for curating (bilingual) dictionaries that would improve performance in tasks such as statistical machine translation.

Indian Institute of Technology Bombay

Next, we look at some subset selection problems over DAGs in data mining settings. More specifically, we study the problem of summarizing DAG-structured hierarchies of nodes over a given set of instances. Example applications include

(a) automatically generating Wikipedia disambiguation pages for a set of articles; in this setting, the nodes are Wikipedia categories and the instances are specific Wikipedia entities (b) generating candidate multi-labels for preparing machine learning datasets (e.g., for text classification, functional genomics, and image classification); in this setting, the nodes are labels and instances are documents (c) building compact lexicons of patterns for cross-domain machine translation; in this setting, the nodes are patterns and instances are sentences in the corpus.

Existing approaches to this problem either focus on clustering the set of instances using the node hierarchy as features or approximate the node subset selection problem as a problem of selecting good individual nodes. We directly pose the problem as that of submodular optimization on a hierarchy of nodes by designing feature functions on the instances and node subsets. Desirable properties/features of the chosen subset of nodes include instance coverage, specificity, node diversity, and node homogeneity, each of which, we show, is naturally modeled by a submodular function. Other information, provided say by unsupervised approaches such as LDA and its variants, can also be utilized by defining a submodular function that expresses coherence between the chosen nodes and this information. We use a large-margin framework to learn convex mixtures over the set of submodular components. We present empirically evaluations of our method on (a) the problem of automatically generating Wikipedia disambiguation pages using human generated clusterings as ground truth and (b) the problem of determining the right subset of patterns for curating (bilingual) dictionaries that would improve performance in tasks such as statistical machine translation.

Vikas Raykar

IBM Research India

**Title: Evolving predictive models**

**Abstract:** A conventional textbook prescription for building good predictive models is
to split the data into three parts: training set (for model fitting),
validation set (for model selection), and test set (for final model
assessment). However in practice searching for the best predictive model is
often an iterative and continuous process. Predictive models can
potentially evolve over time as developers improve their performance either
by acquiring new data or improving the existing model. In this talk we will
discuss some of the challenges such scenarios and propose a few solutions
to safeguard the bias due to the repeated use of the existing validation or
the test set.

(Joint work with Amrita Saha, IBM Research)

**References**:

IBM Research India

(Joint work with Amrita Saha, IBM Research)

- Data Split Strategies for Evolving Predictive Models. Vikas C. Raykar and Amrita Saha, ECML-PKDD 2015
- The reusable holdout: Preserving validity in adaptive data analysis. Cynthia Dwork et al, Science 2015

Shirish Shevade

Indian Institute of Science

**Title: Optimization Algorithms for Large-Scale Machine Learning**

**Abstract:** Optimization techniques have played an important role in
machine learning, especially after the inception of support vector machines
(SVMs). The classical convex quadratic programming problem of SVM and its
variants continue to attract both the Optimization and Machine Learning
communities. In this talk, we review some efficient optimization algorithms
designed to solve such quadratic programming problems. These algorithms
are simple, easy to implement and do not require any sophisticated
mathematical libraries. We will also discuss about the issues involved
when the problems have large size and present some solution approaches
for such problems.

Indian Institute of Science

Parag Singla

Indian Institute of Technology Delhi

**Title: Exploiting Contextual Symmetries in Statistical (Relational) Models**

**Abstract:** Several domains of interest including those in social network analysis, biology, vision, NLP and IE need to represent the underlying relational structure as well as model uncertainty. Statistical relational models such as Markov logic achieve this by combining the power of relational representations (e.g. first-order logic) with statistical models (e.g. Markov networks). To efficiently perform reasoning in these models, lifted inference exploits the underlying symmetry by grouping together objects (or states) which behave similar to each other, and hence, have the same associated probabilities. In this talk, starting with the background on Markov logic, we will look at the novel idea of lifting using contextual symmetries i.e. symmetries available only under specific assignments to a subset of variables. We will formally define what contextual symmetries are and then proceed to describe their use in lifting a general purpose MCMC sampler for graphical models. We will present our evaluation on two representative domains followed by directions for future work. The work presented here has been accepted for publication at IJCAI 2016.

Indian Institute of Technology Delhi

Partha Talukdar

Indian Institute of Science

**Title: Never-Ending Learning**

**Abstract:** Whereas people learn many different types of knowledge from diverse
experiences over many years, most current machine learning systems acquire
just a single function or data model from just a single data set. In this
talk, I shall propose a never-ending learning paradigm for machine
learning, to better reflect the more ambitious and encompassing type of
learning performed by humans. As a case study, we describe the Never-Ending
Language Learner (NELL), which achieves some of the desired properties of a
never-ending learner, and we discuss lessons learned. NELL has been
learning to read the web 24 hours/day since January 2010, and so far has
acquired a knowledge base with over 80 million confidence-weighted beliefs
(e.g., servedWith(tea, biscuits)). NELL has also learned millions of
features and parameters that enable it to read these beliefs from the web.
Additionally, it has learned to reason over these beliefs to infer new
beliefs, and is able to extend its ontology by synthesizing new relational
predicates. NELL can be tracked online at http://rtw.ml.cmu.edu, and
followed on Twitter at @CMUNELL.

**Recommended Prior Reading: **

Indian Institute of Science

- Never Ending Learning [link to pdf]