Project Examples for Deep Structured Learning (Fall 2020)

We suggest below some project ideas. Feel free to use this as inspiration for your project. Talk to us for more details.

Emergent Communication

Problem: Agents need to communicate to solve problems that require collaboration. The goal of this project is to apply techniques (for example using sparsemax or reinforcement learning) to induce communication among agents.
Data: See references below.
Evaluation: See references below.
References:

Explainability of Neural Networks

Problem: Neural networks are black boxes and not amenable to interpretation. The goal of this project is to develop and study methods that lead to explainability of neural network model's predictons (for example using sparsemax attention). This project can be either a survey about recent work in this area or it can explore some practical applications.
Method: For example, sparse attention, rationalizers, gradient-based measures of feature importance, LIME, influence functions, etc.
Data: BEER dataset, Stanford Sentiment Treebank, IMDB Large Movie Reviews Corpus, etc. See references below.
References:

Generative Adversarial Networks for Discrete Data

Problem: Compare different deep generative models' ability to generate discrete data (such as text).
Methods: Generative Adversarial Networks.
Data: SNLI (just the text), Yelp/Yahoo datasets for unaligned sentiment/topic transfer, other text data.
Evaluation: Some of the metrics in [3].
References:

Sub-quadratic Transformers

Problem: Transformers and BERT models are extremely large and expensive to train and keep in memory. The goal of this project is to make Transformers more efficient in terms of time and memory complexity by reducing the quadratic cost of self-attention or by inducing a sparser and smaller model.
Method: See references below.
Data: WMT datasets, WikiText, etc. See in the original Transformer paper and references below.
References:
1. Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. "Efficient Transformers: A Survey." ArXiv 2020 (e.g. Transformer-XL, Reformer, Linformer, Linear transformer, Compressive Transformer, etc.)
2. Gonçalo M. Correia, Vlad Niculae, André F.T. Martins. Adaptively Sparse Transformers. EMNLP 2019.
3. Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, Armand Joulin. Adaptive Attention Span in Transformers. ACL 2019.
4. Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever. Generating Long Sequences with Sparse Transformers. Arxiv 2019.

Contextual Probabilistic Embeddings / Language Modeling

Problem: Embedding words as vectors (aka point masses) cannot distinguish between more vague or more specific concepts. One solution is to embed words as a mean vector μ and a covariance Σ. Muzellec & Cuturi have a nice framework for this, tested for learning non-contextualized embeddings. Can we extend it to contextualized embeddings via language modelling? E.g. a model that reads an entire sentence and predicts a context-dependent pair (μ, Σ) for each word (perhaps left-to-right or masked). What likelihood to use? How can we evaluate the learned embeddings downstream?
Method: See reference below.
References:
1. Boris Muzellec, Marco Cuturi. Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions. Arxiv 2018.

Constrained Structured Classification with AD3

Problem: Use AD3 and dual decomposition techniques to impose logic/budget/structured constraints in structured problems. Possible tasks could involve generating diverse output, forbidding certain configurations, etc.
Data:
- "weasel words": detecting hedges/uncertainty in writing in order to improve clarity
- Coreference in quizbowl
Evaluation: task-dependent
References:

Structured multi-label classification

Problem: Multi-label classification is a learning setting where every sample can be assigned zero, one or more labels.
Method: Correlations between labels can be exploited by learning an affinity matrix of label correlation. Inference in a fully-connected correlation graph is hard; approximating the graph by a tree makes inference fast (Viterbi can be used.)
Data: Multi-label datasets
Evaluation: see here
References:
1. Sorower. A Literature Survey on Algorithms for Multi-label Learning.
2. Thomas Finley and Thorsten Joachims. 2008. Training structural SVMs when exact inference is intractable.
3. Pystruct
4. Scikit.ml (very strong methods not based on structured prediction)

Hierarchical sparsemax attention

Problem: Performing neural attention over very long sequences (e.g. for document-level classification, translation, ...)
Method: sparse hierarchical attention with product of sparsemaxes.
Data: text classification datasets
Evaluation: Accuracy; empirical analysis of where the models attend to.
Notes: If the top-level sparsemax gives zero probability to some paragraphs, those can be pruned from the computation graph. Can this lead to speedups?
References:
1. Yang, Yang, Dyer, He, Smola, Hovy. Hierarchical Attention Networks for Document Classification. NAACL 2016.
2. Martins and Astudillo. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. ICML 2016.

Sparse link prediction

Problem: Predicting links in a large structured graph. For instance: predict co-authorship, movie recommendation, coreference resolution, discourse relations between sentences in a document.
Method: The simplest approach is independent binary classification: for every node pair (i, j), predict whether there is a link or not. Issues: Very high imbalance: most nodes are not linked. Structure and higher-order correlations are ignored in independent approach. Develop a method that can address the issues: incorporate structural correlations (e.g. with combinatorial inference, constraints, latent variables) and account for imbalance (ideally via pairwise ranking losses: learn a scorer such that S(i, j) > S(k, l) if there is an edge (i, j) but no edge (k, l).
Data: arXiv macro usage, Coreference in quizbowl
Notes: Can graph-CNNs (previous idea) be useful here?
References:
1. Rosenfeld, Meshi, Tarlow, Globerson. Learning Structured Models with the AUC Loss and Its Generalizations. AISTATS 2014.

Energy Networks

Problem: Energy networks can be used for density estimation (estimating p(x)) and structured prediction (estimating p(y|x)) when y is structured. Both cases pose challenges due to intractability of computing the partition function and sampling. In structured output prediction with energy networks, the idea is to replace discrete structured inference with continuous optimization in a neural net. This project can be either a survey about recent work in this area or it can explore some practical applications. Applications: multi-label classification and sequence tagging.
Method: Learn a neural network E(x; w) to model the energy of x or E(x, y; w) to model the energy of an output configuration y (relaxed to be a continuous variable). Inference becomes min_y E(x, y; w). How far can this relaxation take us? Can it be better/faster than global combinatorial optimization approaches?
Data: MNIST, multi-label classification, sequence tagging.
References:

Memory-augmented Neural Networks

Problem: Improve the generalization of neural nets by searching similar examples in the training set.
Method: kNN + NN, fast search + NN, prototype attention (efficient attention over the dataset)
Data: See in references below.
References:

Quality Estimation and Uncertainty Estimation

Problem: Estimate the quality of a translation hypothesis without access to reference translations.
Method: See OpenKiwi: Neural Nets, Transfer Learning, BERT, XLM, etc.
Data: in WMT2020 page
References:

Causality and Disentanglement

Problem: Causal inference and discovery is a area of growing interest in machine learning and statistics, with numerous applications and connections to confounding removal, reinforcement learning, and disentanglement of factors of variation. This project can be either a survey about the area or it can explore some practical applications.
Method: Plenty to choose from!
Data: See the references below.
References: