Project Examples for Deep Structured Learning (Fall 2019)

We suggest below some project ideas. Feel free to use this as inspiration for your project. Talk to us for more details.

Emergent Communication

Problem: Agents need to communicate to solve problems that require collaboration. The goal of this project is to apply techniques (for example using sparsemax or reinforcement learning) to induce communication among agents.
Data: See references below.
Evaluation: See references below.
References:

Problem: Neural networks are black boxes and not amenable to interpretation. The goal of this project is to develop and study methods that lead to explainability of neural network model's predictons (for example using sparsemax attention).
Method: For example, sparse attention, gradient-based measures of feature importance, LIME (see below).
Data: Stanford Sentiment Treebank, IMDB Large Movie Reviews Corpus, etc. See references below.
References:

Problem: Compare different deep generative models' ability to generate discrete data (such as text).
Methods: Generative Adversarial Networks.
Data: SNLI (just the text), Yelp/Yahoo datasets for unaligned sentiment/topic transfer, other text data.
Evaluation: Some of the metrics in [3].
References:

Problem: Given images with object tags, can we accurately predict object location by training only with coarse, weak supervision about which objects are present in the data?
Method: Latent Potts model; Belief Propagation
Data: COCO-Stuff, Pascal VOC
References: See https://github.com/kazuto1011/deeplab-pytorch

Problem: Transformers and BERT models are extremely large and expensive to train and keep in memory. The goal of this project is to distill or induce sparser and smaller Transformer models without losing accuracy, applying them to machine translation or language modeling.
Method: See references below.
Data: WMT datasets, WikiText, etc. See references below.
References:

Problem: Embedding words as vectors (aka point masses) cannot distinguish between more vague or more specific concepts. One solution is to embed words as a mean vector μ and a covariance Σ. Muzellec & Cuturi have a nice framework for this, tested for learning non-contextualized embeddings. Can we extend it to contextualized embeddings via language modelling? E.g. a model that reads an entire sentence and predicts a context-dependent pair (μ, Σ) for each word (perhaps left-to-right or masked). What likelihood to use? How can we evaluate the learned embeddings downstream?
Method: See reference below.
References:
1. Boris Muzellec, Marco Cuturi. Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions. Arxiv 2018.

Problem: Use AD3 and dual decomposition techniques to impose logic/budget/structured constraints in structured problems. Possible tasks could involve generating diverse output, forbidding certain configurations, etc.
Data:
- "weasel words": detecting hedges/uncertainty in writing in order to improve clarity
- Coreference in quizbowl
Evaluation: task-dependent
References:

Problem: Multi-label classification is a learning setting where every sample can be assigned zero, one or more labels.
Method: Correlations between labels can be exploited by learning an affinity matrix of label correlation. Inference in a fully-connected correlation graph is hard; approximating the graph by a tree makes inference fast (Viterbi can be used.)
Data: Multi-label datasets
Evaluation: see here
References:
1. Sorower. A Literature Survey on Algorithms for Multi-label Learning.
2. Thomas Finley and Thorsten Joachims. 2008. Training structural SVMs when exact inference is intractable.
3. Pystruct
4. Scikit.ml (very strong methods not based on structured prediction)

Problem: Performing neural attention over very long sequences (e.g. for document-level classification, translation, ...)
Method: sparse hierarchical attention with product of sparsemaxes.
Data: text classification datasets
Evaluation: Accuracy; empirical analysis of where the models attend to.
Notes: If the top-level sparsemax gives zero probability to some paragraphs, those can be pruned from the computation graph. Can this lead to speedups?
References:
1. Yang, Yang, Dyer, He, Smola, Hovy. Hierarchical Attention Networks for Document Classification. NAACL 2016.
2. Martins and Astudillo. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. ICML 2016.

Problem: For structured data segmented into given "groups" (e.g. fields in a form, regions in an image, sentences in a paragraph), design a "group-sparse" attention mechanism that tends to give zero weight to entire groups when deemed not relevant enough.
Method: a Sparse Group-Lasso penalty in a generalized structured attention framework [2]
Notes: the L1 term is redundant when optimizing over the simplex; regular group lasso will be sparse!
References:

Problem: Predicting links in a large structured graph. For instance: predict co-authorship, movie recommendation, coreference resolution, discourse relations between sentences in a document.
Method: The simplest approach is independent binary classification: for every node pair (i, j), predict whether there is a link or not. Issues: Very high imbalance: most nodes are not linked. Structure and higher-order correlations are ignored in independent approach. Develop a method that can address the issues: incorporate structural correlations (e.g. with combinatorial inference, constraints, latent variables) and account for imbalance (ideally via pairwise ranking losses: learn a scorer such that S(i, j) > S(k, l) if there is an edge (i, j) but no edge (k, l).
Data: arXiv macro usage, Coreference in quizbowl
Notes: Can graph-CNNs (previous idea) be useful here?
References:
1. Rosenfeld, Meshi, Tarlow, Globerson. Learning Structured Models with the AUC Loss and Its Generalizations. AISTATS 2014.

Problem: Structured output prediction with energy networks: replace discrete structured inference with continuous optimization in a neural net. Applications: multi-label classification; simple structured problems: sequence tagging, arc-factored parsing?
Method: Learn a neural network E(x, y; w) to model the energy of an output configuration y (relaxed to be a continuous variable). Inference becomes min_y E(x, y; w). How far can this relaxation take us? Can it be better/faster than global combinatorial optimization approaches?
Data: Sequence tagging, parsing, optimal matching?
Notes: When E is a neural network, min_y E(x, y; w) is a non-convex optimization problem (possibly with mild constraints such as y in [0, 1]. Amos et al. have an approach that allows E to be a complicated neural net but remain convex in y. Is this beneficial? Are some kinds of structured data better suited for SPENs than others? E.g. sequence labelling seems "less structured" than dependency parsing.
References: