# Project Examples for Deep Structured Learning (Fall 2019)

We suggest below some project ideas. Feel free to use this as inspiration for your project. Talk to us for more details.

# Emergent Communication

**Problem:**Agents need to communicate to solve problems that require collaboration. The goal of this project is to apply techniques (for example using sparsemax or reinforcement learning) to induce communication among agents.**Data:**See references below.**Evaluation:**See references below.-
**References:**- Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. NeurIPS 2017.
- Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, Stephen Clark. Emergent Communication through Negotiation. In ICLR 2018.
- Satwik Kottur, José Moura, Stefan Lee, Dhruv Batra. Natural Language Does Not Emerge Naturally in Multi-Agent Dialog. In EMNLP 2017.

# Explainability of Neural Networks

**Problem:**Neural networks are black boxes and not amenable to interpretation. The goal of this project is to develop and study methods that lead to explainability of neural network model's predictons (for example using sparsemax attention).**Method:**For example, sparse attention, gradient-based measures of feature importance, LIME (see below).**Data:**Stanford Sentiment Treebank, IMDB Large Movie Reviews Corpus, etc. See references below.-
**References:**- Marco T Ribeiro, Sammer Singh, and Carlos Guestrin. Why Should I Trust You? Explaining the Predictions of Any Classifier. KDD 2016.
- Sarthak Jain and Byron C Wallace. Attention is not explanation. NAACL 2019.
- Zachary C Lipton. The mythos of model interpretability. ICML 2016 Workshop on Human Interpretability in Machine Learning.
- Sofia Serrano and Noah A Smith. Is attention interpretable? ACL 2019.
- Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. EMNLP 2019.

# Generative Adversarial Networks for Discrete Data

**Problem:**Compare different deep generative models' ability to generate discrete data (such as text).**Methods:**Generative Adversarial Networks.**Data:**SNLI (just the text), Yelp/Yahoo datasets for unaligned sentiment/topic transfer, other text data.**Evaluation:**Some of the metrics in [3].-
**References:**

# Object detection with weak supervision

**Problem:**Given images with object tags, can we accurately predict object location by training only with coarse, weak supervision about which objects are present in the data?**Method:**Latent Potts model; Belief Propagation**Data:**COCO-Stuff, Pascal VOC**References:**See https://github.com/kazuto1011/deeplab-pytorch

# Sparse transformers

**Problem:**Transformers and BERT models are extremely large and expensive to train and keep in memory. The goal of this project is to distill or induce sparser and smaller Transformer models without losing accuracy, applying them to machine translation or language modeling.**Method:**See references below.**Data:**WMT datasets, WikiText, etc. See references below.-
**References:**- Gonçalo M. Correia, Vlad Niculae, André F.T. Martins. Adaptively Sparse Transformers. EMNLP 2019.
- Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, Armand Joulin. Adaptive Attention Span in Transformers. ACL 2019.
- Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever. Generating Long Sequences with Sparse Transformers. Arxiv 2019.
- Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu. TinyBERT: Distilling BERT for Natural Language Understanding. Arxiv 2019.

# Contextual Probabilistic Embeddings / Language Modeling

**Problem:**Embedding words as vectors (aka point masses) cannot distinguish between more vague or more specific concepts. One solution is to embed words as a mean vector μ and a covariance Σ. Muzellec & Cuturi have a nice framework for this, tested for learning non-contextualized embeddings. Can we extend it to contextualized embeddings via language modelling? E.g. a model that reads an entire sentence and predicts a context-dependent pair (μ, Σ) for each word (perhaps left-to-right or masked). What likelihood to use? How can we evaluate the learned embeddings downstream?**Method:**See reference below.-
**References:**

# Constrained Structured Classification with AD3

**Problem:**Use AD3 and dual decomposition techniques to impose logic/budget/structured constraints in structured problems. Possible tasks could involve generating diverse output, forbidding certain configurations, etc.**Data:****Evaluation:**task-dependent-
**References:**

# Structured multi-label classification

**Problem:**Multi-label classification is a learning setting where every sample can be assigned zero, one or more labels.**Method:**Correlations between labels can be exploited by learning an affinity matrix of label correlation. Inference in a fully-connected correlation graph is hard; approximating the graph by a tree makes inference fast (Viterbi can be used.)**Data:**Multi-label datasets**Evaluation:**see here-
**References:**- Sorower. A Literature Survey on Algorithms for Multi-label Learning.
- Thomas Finley and Thorsten Joachims. 2008. Training structural SVMs when exact inference is intractable.
- Pystruct
- Scikit.ml (very strong methods not based on structured prediction)

# Hierarchical sparsemax attention

**Problem:**Performing neural attention over very long sequences (e.g. for document-level classification, translation, ...)**Method:**sparse hierarchical attention with product of sparsemaxes.**Data:**text classification datasets**Evaluation:**Accuracy; empirical analysis of where the models attend to.**Notes:**If the top-level sparsemax gives zero probability to some paragraphs, those can be pruned from the computation graph. Can this lead to speedups?-
**References:**

# Sparse group lasso attention mechanism

**Problem:**For structured data segmented into given "groups" (e.g. fields in a form, regions in an image, sentences in a paragraph), design a "group-sparse" attention mechanism that tends to give zero weight to entire groups when deemed not relevant enough.**Method:**a Sparse Group-Lasso penalty in a generalized structured attention framework [2]**Notes:**the L1 term is redundant when optimizing over the simplex; regular group lasso will be sparse!-
**References:**

# Sparse link prediction

**Problem:**Predicting links in a large structured graph. For instance: predict co-authorship, movie recommendation, coreference resolution, discourse relations between sentences in a document.**Method:**The simplest approach is independent binary classification: for every node pair (i, j), predict whether there is a link or not. Issues: Very high imbalance: most nodes are not linked. Structure and higher-order correlations are ignored in independent approach. Develop a method that can address the issues: incorporate structural correlations (e.g. with combinatorial inference, constraints, latent variables) and account for imbalance (ideally via pairwise ranking losses: learn a scorer such that S(i, j) > S(k, l) if there is an edge (i, j) but no edge (k, l).**Data:**arXiv macro usage, Coreference in quizbowl**Notes:**Can graph-CNNs (previous idea) be useful here?-
**References:**

# Structured Prediction Energy Networks

**Problem:**Structured output prediction with energy networks: replace discrete structured inference with continuous optimization in a neural net. Applications: multi-label classification; simple structured problems: sequence tagging, arc-factored parsing?**Method:**Learn a neural network E(x, y; w) to model the energy of an output configuration y (relaxed to be a continuous variable). Inference becomes min_y E(x, y; w). How far can this relaxation take us? Can it be better/faster than global combinatorial optimization approaches?**Data:**Sequence tagging, parsing, optimal matching?**Notes:**When E is a neural network, min_y E(x, y; w) is a non-convex optimization problem (possibly with mild constraints such as y in [0, 1]. Amos et al. have an approach that allows E to be a complicated neural net but remain convex in y. Is this beneficial? Are some kinds of structured data better suited for SPENs than others? E.g. sequence labelling seems "less structured" than dependency parsing.-
**References:**