Project Examples for Deep Structured Learning (Fall 2018)

We suggest below some project ideas. Feel free to use this as inspiration for your project. Talk to us for more details.

Sparse Classification with Sparsemax

Problem: Apply sparsemax and/or sparsemax loss to a problem that requires outputting sparse label probabilities or sparse latent variables (attention).
Data: Multi-label datasets, SNLI, WMT, any data containing many labels for which only a few are plausible for each example.
Evaluation: F1, accuracy, inspection of where the model learns to attend to.
References:
1. Martins and Astudillo. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. ICML 2016.

Problem: Compare different deep generative models' ability to generate discrete data (such as text).
Methods: Generative Adversarial Networks, Variational Auto-Encoders.
Data: SNLI (just the text), Yelp/Yahoo datasets for unaligned sentiment/topic transfer, other text data.
Evaluation: Some of the metrics in [4].
References:

Problem: Use AD3 and dual decomposition techniques to impose logic/budget/structured constraints in structured problems. Possible tasks could involve generating diverse output, forbidding certain configurations, etc.
Data:
- "weasel words": detecting hedges/uncertainty in writing in order to improve clarity
- Coreference in quizbowl
Evaluation: task-dependent
References:

Problem: Multi-label classification is a learning setting where every sample can be assigned zero, one or more labels.
Method: Correlations between labels can be exploited by learning an affinity matrix of label correlation. Inference in a fully-connected correlation graph is hard; approximating the graph by a tree makes inference fast (Viterbi can be used.)
Data: Multi-label datasets
Evaluation: see here
References:
1. Sorower. A Literature Survey on Algorithms for Multi-label Learning.
2. Thomas Finley and Thorsten Joachims. 2008. Training structural SVMs when exact inference is intractable.
3. Pystruct
4. Scikit.ml (very strong methods not based on structured prediction)

Problem: Performing neural attention over very long sequences (e.g. for document-level classification, translation, ...)
Method: sparse hierarchical attention with product of sparsemaxes.
Data: text classification datasets
Evaluation: Accuracy; empirical analysis of where the models attend to.
Notes: If the top-level sparsemax gives zero probability to some paragraphs, those can be pruned from the computation graph. Can this lead to speedups?
References:
1. Yang, Yang, Dyer, He, Smola, Hovy. Hierarchical Attention Networks for Document Classification. NAACL 2016.
2. Martins and Astudillo. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. ICML 2016.

Problem: For structured data segmented into given "groups" (e.g. fields in a form, regions in an image, sentences in a paragraph), design a "group-sparse" attention mechanism that tends to give zero weight to entire groups when deemed not relevant enough.
Method: a Sparse Group-Lasso penalty in a generalized structured attention framework [2]
Notes: the L1 term is redundant when optimizing over the simplex; regular group lasso will be sparse!
References:

Problem: Go beyond vector (point) embeddings: embed objects as ellipses instead of points; capture notions of inclusion/overlap.
References:
1. Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions. Muzellec & Cuturi. 2018.

Problem: learn good fixed-size hidden representations for data that comes in graph format with different shapes and sizes.
Method: Graph convolutional networks
Data: arXiv macro usage, annotated semantic relationships datasets, paralex
References:
1. Kipf and Welling. Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017.
2. Defferrard et al. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. NIPS 2016.

Problem: Predicting links in a large structured graph. For instance: predict co-authorship, movie recommendation, coreference resolution, discourse relations between sentences in a document.
Method: The simplest approach is independent binary classification: for every node pair (i, j), predict whether there is a link or not. Issues: Very high imbalance: most nodes are not linked. Structure and higher-order correlations are ignored in independent approach. Develop a method that can address the issues: incorporate structural correlations (e.g. with combinatorial inference, constraints, latent variables) and account for imbalance (ideally via pairwise ranking losses: learn a scorer such that S(i, j) > S(k, l) if there is an edge (i, j) but no edge (k, l).
Data: arXiv macro usage, Coreference in quizbowl
Notes: Can graph-CNNs (previous idea) be useful here?
References:
1. Rosenfeld, Meshi, Tarlow, Globerson. Learning Structured Models with the AUC Loss and Its Generalizations. AISTATS 2014.

Problem: Structured output prediction with energy networks: replace discrete structured inference with continuous optimization in a neural net. Applications: multi-label classification; simple structured problems: sequence tagging, arc-factored parsing?
Method: Learn a neural network E(x, y; w) to model the energy of an output configuration y (relaxed to be a continuous variable). Inference becomes min_y E(x, y; w). How far can this relaxation take us? Can it be better/faster than global combinatorial optimization approaches?
Data: Sequence tagging, parsing, optimal matching?
Notes: When E is a neural network, min_y E(x, y; w) is a non-convex optimization problem (possibly with mild constraints such as y in [0, 1]. Amos et al. have an approach that allows E to be a complicated neural net but remain convex in y. Is this beneficial? Are some kinds of structured data better suited for SPENs than others? E.g. sequence labelling seems "less structured" than dependency parsing.
References: