ETH-FDS seminar series

More information about ETH Foundations of Data Science can be found here

×

Modal title

Modal content

Please subscribe here if you would you like to be notified about these events via e-mail. Moreover you can also subscribe to the iCal/ics Calender.

Spring Semester 2019

Date / Time Speaker Title Location
24 January 2019
16:15-17:00
Amin Karbasi
Yale University, USA
Details

ETH-FDS seminar

Title It Is Good to Relax (Maybe Best)
Speaker, Affiliation Amin Karbasi, Yale University, USA
Date, Time 24 January 2019, 16:15-17:00
Location HG E 22
Abstract The difficulty of searching through a massive amount of data in order to quickly make an informed decision is one of today’s most ubiquitous challenges. Many scientific and engineering models feature inherently discrete decision variables — from phrases in a corpus to objects in an image. The study of how to make near-optimal decisions from a massive pool of possibilities is at the heart of combinatorial optimization. Many of these problems are notoriously hard, and even those that are theoretically tractable may not scale to large instances. However, the problems of practical interest are often much more well-behaved and possess extra structures that allow them to be amenable to exact or approximate optimization techniques. Just as convexity has been a celebrated and well-studied condition under which continuous optimization is tractable, submodularity is a condition for which discrete objectives may be optimized. In order to provide the tightest approximation guarantees for submodular optimization problems, we usually need to leave the space of discrete domains and consider their continuous relaxations. To this end, we will explore the notion of submodularity in the continuous domains and introduce a broad class of non-convex objective functions. Despite the apparent lack of convexity, we will see that first-order optimization methods can provide strong approximation guarantees. We then show that such continuous relaxations can be used as an interface to provide tight approximation guarantees for maximizing stochastic submodular set functions. I will not assume any particular background on submodularity or optimization and will try to motivate and define all the necessary concepts during the talk.
Bio: Amin Karbasi is currently an assistant professor of Electrical Engineering, Computer Science, and Statistics at Yale University. He has been the recipient of ONR 2019 Young Investigator Award, AFOSR 2018 Young Investigator Award, Grainger Award 2017 from National Academy of Engineering, Microsoft Azure research award 2017, DARPA 2016 Young Faculty Award, Simons-Berkeley fellowship 2016, Google Faculty Award 2015, and ETH fellowship 2013. His work has been recognized with a variety of paper awards, including Medical Image Computing and Computer Assisted Interventions Conference (MICCAI) 2017, International Conference on Artificial Intelligence and Statistics (AISTAT) 2015, IEEE ComSoc Data Storage 2013, International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2011, ACM SIGMETRICS 2010, and IEEE International Symposium on Information Theory (ISIT) 2010 (runner-up). His Ph.D. work received the Patrick Denantes Memorial Prize 2013 from the School of Computer and Communication Sciences at EPFL, Switzerland.
It Is Good to Relax (Maybe Best)read_more
HG E 22
27 March 2019
12:30-13:30
Manfred K. Warmuth
UC Santa Cruz and Google Zürich
Details

ETH-FDS seminar

Title Reverse Iterative Volume Sampling for Linear Regression
Speaker, Affiliation Manfred K. Warmuth, UC Santa Cruz and Google Zürich
Date, Time 27 March 2019, 12:30-13:30
Location HG D 5.2
Abstract Joint work with Michal Derezinski.
Consider the following basic one dimensional linear regression problem. You are given a set of n points and are to predict the hidden response value for each point. The goal is to achieve total square loss close to the optimallinear predictor without seeing most of the hidden responses. We show that if you sample JUST ONE point from the set and request its response value, then the linear predictor fitting that point response pair has exactly twice the optimum total square loss. In d dimensions, sampling a subset of just d points and fitting their responses achieves total loss d+1 times the optimum. The key trick is to use a joint sampling technique called volume sampling for sampling a diverse subset of points. We show that the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of the optimal solution based on all n responses. This unbiasedness is a desirable property that is not shared by other common subset selection techniques.
Motivated by these basic properties, we develop a theoretical framework for studying volume sampling, resulting in a number of new matrix expectation equalities and statistical guarantees which are of importance not only to least squares regression but also to numerical linear algebra in general. Our methods also lead to a regularized variant of volume sampling, and we propose the first efficient algorithm for volume sampling which makes this technique a practical tool in the machine learning toolbox. Finally, we provide experimental evidence which confirms our theoretical findings.
Documents Video Manfred K. Warmuth - ETH-FDS talk on 27 March 2019file_download
Reverse Iterative Volume Sampling for Linear Regressionread_more
HG D 5.2
5 April 2019
12:30-13:30
Bart De Moor
KU Leuven
Details

ETH-FDS seminar

Title More Mathematics Mandatory in Data Science
Speaker, Affiliation Bart De Moor, KU Leuven
Date, Time 5 April 2019, 12:30-13:30
Location HG G 19.1
Abstract The tsunami of data and the increasing availability of massive computing power create many opportunities for AI in general, and for data science in particular. However, in many instances, the results delivered by machine learning algorithms are difficult to ‘understand’ or ‘explain’. Black boxes as they are, we basically do not know when and how they are successful, and when and why not. The whole field is pervaded, if not swamped, by hundreds of ‘heuristic tricks’, such as tampering with the neuronal activation function (sigmoid, RELU (rectified linear unit),…), experimenting with the number of neurons and hidden layers, selecting optimization methods (numerical ones, such as back-propagation or Gauss-Newton, or randomized ones, such as genetic algorithms), specifying user choices (initial guesses, step sizes, convergence tests, early stopping and/or regularization measures, seeds and cross-over/mutation rates, ...), etc.Therefore, machine learning is still more of an art than a science, and many results published in the literature do not even satisfy the minimal scientific requirement of ‘reproducibility’. Admittedly, this situation is not new. Even in the (‘classical’) field of system identification or statistical modeling of linear time-invariant finite dimensional dynamical linear systems, the origins of which predate those of deep learning by many years, heuristics prevail. Of course, there is a certain systematic methodology to tackle data-based identification problems:
1. Collect data;
2. Select a pre-specified model class that is parametrized by unknown parameters;
3. Choose an appropriate approximation criterion;
4. ‘Solve’ the resulting nonlinear optimization problem by numerical algorithms that output `optimal' parameters;
5. Validate the resulting model on validation/test sets or by assessing the quality of predictions.
6. Re-iterate the whole loop when necessary.
It is in Step 4 that still plenty of heuristics are required. Typically, the optimization problem is nonlinear and provably non-convex. The nonlinearities originate in the assumptions that are made on the data model (e.g. additive misfit on the observations, or unobserved additional inputs), and the fact that products between unknowns occur (e.g. between unknown model parameters and unobserved inputs as in ARMAX models, or between unknown model matrices and states in subspace identification methods). The nonlinearities can be dealt with depending on the identification framework one deploys: by invoking asymptotic orthogonality and numerical linear algebra tools in subspace identification, by using the machinery of instrumental variables in errors-in-variables approaches, or by implementing iterative nonlinear least squares algorithms like in Prediction-Error-Methods. Despite hundreds of person-years of experience, and thousands of published papers, books and software suites, all of these ‘solutions’ contain plenty of heuristics. The main message of this talk is that we have to be braver: There is a lot of old and new mathematics out there that we collectively ignore, but that could provide us with a deeper understanding of our identification approaches. Specifically, for the problem of identifying LTI models from observed data, the central observation is that all nonlinearities involved are multivariate polynomial. With least squares as an approximation criterion, we end up with a multivariate polynomial optimization problem. From algebraic geometry, we know that such problem is fundamentally equivalent to a large-scale eigenvalue problem. We will show how the set of first order optimality conditions comprises a multi-parameter eigenvalue problem (MEVP), the solution of which is surprisingly little studied in the literature, requiring ‘new’ mathematics: Such a MEVP can be solved by recursively building up a quasi-block-Toeplitz block Macaulay matrix, the null space of which is block multi-shift invariant (a property studied in operator theory). By then applying multi-dimensional (exact) realization theory in that null space (new multi-dimensional system theory), we can find the optimal parameters from the eigenvectors and -values of a specific large-scale matrix. This ‘solution’ as described, satisfies the very scientific definition of the word, because a set of a priori minimal assumptions on data and model leads to a sequence of mathematically verifiable and reconstructable steps that uniquely characterize the optimal solution to be an eigenvalue problem. Even if heuristics would still be needed to compute the optimal solution because of the mere size of the problem, we now know that we are looking for a minimizing eigenvalue-eigenvector pair of a matrix constructed from the data and chosen model class. Can we learn something about the challenges posed by deep learning from this very special case of linear system identification? The answer is a resounding ‘yes’, and we will provide some first indications why our mathematical framework as outlined above might also be applicable to neural networks. The take home message is that more - old and new - mathematics is mandatory to understand and explain deep learning. The endeavor is difficult, challenging, time consuming, requires patience and endurance, and the research involved is definitely high-risk high-gain. The journey will take us into the realms of mathematical disciplines such as one- and multidimensional system theory, algebraic geometry, operator theory and numerical linear algebra and optimization. For sure, diving deeper in the mathematical depts of neural networks is largely a-typical in the current massive wave of heuristic deep learning activities. It will require a creative understanding of mathematics, that is more profound than the acquired know-how that most practitioners of AI and deep learning currently possess. Yet, there is no alternative if we want to turn the toolbox of deep learning heuristics into real science.
Documents Video Bart De Moor - ETH-FDS talk on 5 April 2019file_download
More Mathematics Mandatory in Data Science read_more
HG G 19.1
26 June 2019
16:00-17:00
Francis Bach
INRIA and Ecole normale superieure PSL Research University, Paris France.
Details

ETH-FDS seminar

Title Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
Speaker, Affiliation Francis Bach, INRIA and Ecole normale superieure PSL Research University, Paris France.
Date, Time 26 June 2019, 16:00-17:00
Location HG F 3
Abstract We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model. (Joint work with Loucas Pillaud-Vivien and Alessandro Rudi)
Documents Video Francis Bach - ETH-FDS talk on 26 June 2019file_download
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passesread_more
HG F 3
9 July 2019
11:00-12:00
Suvrit Sra
MIT
Details

ETH-FDS seminar

Title Non-convex optimization through a geometric lens
Speaker, Affiliation Suvrit Sra, MIT
Date, Time 9 July 2019, 11:00-12:00
Location ML F 34
Abstract Nonconvex optimization is currently enjoying intense interest in machine learning. This interest is largely fueled by deep learning, and also by more complex tasks in related areas. Although an understanding of why neural networks work so well remains elusive, there has been impressive progress in algorithms, software, and systems for nonconvex optimization. But in today's talk, I want to take a step back from algorithmic advances (fast nonconvex SGD, escaping saddle-points, adaptive methods, etc.). Instead, I want to draw your attention to a new set of tools that expand the repertoire of tractable nonconvex models. In particular, I will present a rich subclass of nonconvex problems that can be solved to global optimality by leveraging geometry. The specific concept that I'll discuss is geodesic convexity, which generalizes the usual vector-space notion of convexity to nonlinear spaces (more generally, I will also cover geodesically non-convex problems). My aim is to outline how geometric thinking leads to improved models or insights for fundamental tasks such as large-scale PCA, metric learning, Gaussian mixture models, among others. I will outline both theoretical and practical aspects, including iteration complexity theory, and conclude with some open problems.
Non-convex optimization through a geometric lensread_more
ML F 34
11 July 2019
11:00-12:00
Stefanie Jegelka
MIT
Details

ETH-FDS seminar

Title Representational Power of ResNet and Graph Neural Networks
Speaker, Affiliation Stefanie Jegelka, MIT
Date, Time 11 July 2019, 11:00-12:00
Location CAB G 51
Abstract In recent years, neural networks have achieved impressive empirical results on a wide range of tasks. One fundamental step in understanding their theoretical properties is to understand their expressive power. Classical results study this question for shallow and very wide networks, whereas recent results address deeper networks that are more common in practice. We focus on two types of networks that are popular in practice: (1) the ResNet architecture, and (2) neural networks for graphs. For ResNet, we show how narrow the network can be - if it can be deep - to still be a universal approximator. Our result reveals a distinction from other architectures. For graph neural networks, we study what graphs they can distinguish, and what properties of the network affect this discriminative power.
Representational Power of ResNet and Graph Neural Networksread_more
CAB G 51
19 July 2019
11:00-12:00
Stefanie Jegelka
MIT
Suvrit Sra
MIT
Details

ETH-FDS seminar

Title Tutorial by Stefanie Jegelka and Suvrit Sra on Negative Dependence, Stable Polynomials, and All That
Speaker, Affiliation Stefanie Jegelka, MIT
Suvrit Sra, MIT
Date, Time 19 July 2019, 11:00-12:00
Location CAB G 51
Abstract Negative Dependence, Stable Polynomials, and All That This tutorial provides an introduction to a rapidly evolving topic: the theory of negative dependence and its numerous ramifications in machine learning. Indeed, negatively dependent probability measures provide a powerful tool for modeling non-i.i.d. data, and thus can impact all aspects of learning, including supervised, unsupervised, interpretable, interactive, and large-scale setups. The most well-known examples of negatively dependent distributions are perhaps the Determinantal Point Processes (DPPs), which have already found numerous ML applications. But DPPs are just the tip of the iceberg; the class of negatively dependent measures is much broader, and given the vast web of mathematical connections it enjoys, its holds great promise as a tool for machine learning. This tutorial exposes the ML audience to this rich mathematical toolbox, while outlining key theoretical ideas and motivating fundamental applications. Tasks that profit from negative dependence include anomaly detection, information maximization, experimental design, validation of black-box systems, architecture learning, fast MCMC sampling, dataset summarization, interpretable learning. Time permitting, we will also touch up recent related breakthrough on the broader class of so-called "completely log-concave" polynomials.
Tutorial by Stefanie Jegelka and Suvrit Sra on Negative Dependence, Stable Polynomials, and All Thatread_more
CAB G 51
22 July 2019
11:00-12:00
Stefanie Jegelka
MIT
Suvrit Sra
MIT
Details

ETH-FDS seminar

Title Tutorial by Stefanie Jegelka and Suvrit Sra on Negative Dependence, Stable Polynomials, and All That
Speaker, Affiliation Stefanie Jegelka, MIT
Suvrit Sra, MIT
Date, Time 22 July 2019, 11:00-12:00
Location CAB G 51
Abstract Negative Dependence, Stable Polynomials, and All That This tutorial provides an introduction to a rapidly evolving topic: the theory of negative dependence and its numerous ramifications in machine learning. Indeed, negatively dependent probability measures provide a powerful tool for modeling non-i.i.d. data, and thus can impact all aspects of learning, including supervised, unsupervised, interpretable, interactive, and large-scale setups. The most well-known examples of negatively dependent distributions are perhaps the Determinantal Point Processes (DPPs), which have already found numerous ML applications. But DPPs are just the tip of the iceberg; the class of negatively dependent measures is much broader, and given the vast web of mathematical connections it enjoys, its holds great promise as a tool for machine learning. This tutorial exposes the ML audience to this rich mathematical toolbox, while outlining key theoretical ideas and motivating fundamental applications. Tasks that profit from negative dependence include anomaly detection, information maximization, experimental design, validation of black-box systems, architecture learning, fast MCMC sampling, dataset summarization, interpretable learning. Time permitting, we will also touch up recent related breakthrough on the broader class of so-called "completely log-concave" polynomials.
Tutorial by Stefanie Jegelka and Suvrit Sra on Negative Dependence, Stable Polynomials, and All Thatread_more
CAB G 51
JavaScript has been disabled in your browser