| 09:30 - 10:00 | Elad Hazan |
Provably Efficient Learning in Nonlinear Dynamical Systems
Abstract. Learning in dynamical systems is a fundamental challenge underlying modern sequence modeling. Despite extensive study, efficient algorithms with formal guarantees for general nonlinear systems have remained elusive. This talk presents a provably efficient framework for learning in any bounded and Lipschitz nonlinear dynamical system, establishing the first sublinear regret guarantees in a dimension-free setting. Our approach combines Koopman lifting, Luenberger observers, and, crucially, spectral filtering to show that nonlinear dynamics are learnable. These insights motivate a new neural architecture, the Spectral Transform Unit (STU), which achieves state-of-the-art performance on language modeling, dynamical system, and differential equation benchmarks.
|
| 10:00 - 10:30 | Alexandre d'Aspremont |
Acceleration and training LLMs
Abstract. We survey various techniques imported from convex optimization and aimed at accelerating the training of large language models.
|
| 10:30 - 11:00 | Bo Han |
Exploring Trustworthy Foundation Models: Benchmarking, Finetuning, and Reasoning
Abstract. In the current landscape of machine learning, where foundation models must navigate imperfect real-world conditions such as noisy data and unexpected inputs, ensuring their trustworthiness through rigorous benchmarking, safety-focused finetuning, and robust reasoning is more critical than ever. In this talk, I will focus on three recent research advancements that collectively advance these dimensions, offering a comprehensive approach to building trustworthy foundation models. For benchmarking, I will introduce CounterAnimal, a dataset designed to systematically evaluate CLIP’s vulnerability to realistic spurious correlations, revealing that scaling models or data quality can mitigate these biases, yet scaling data alone does not effectively address them. Transitioning to finetuning, we delve deep into the process of unlearning undesirable model behaviours. We propose a general framework to examine and understand the limitations of current unlearning methods and suggest enhanced revisions for more effective unlearning. Furthermore, addressing reasoning, we investigate the reasoning robustness under noisy rationales by constructing the NoRa dataset and propose contrastive denoising with noisy chain-of-thought, a method that markedly improves denoising-reasoning capabilities by contrasting noisy inputs with minimal clean supervision.
|
| 11:00 - 11:30 | Rolf Krause |
Parallel Training of Neural Networks
Abstract. Training deep neural networks gives rise to large-scale non-convex optimization problems that are often highly sensitive to step-size and hyperparameter choices. We present the Additively Preconditioned Trust-Region Strategy (APTS), a parallel nonlinear optimization method inspired by nonlinear domain decomposition techniques for PDEs and trust-region methods for globalization. The method is based on a decomposition of the parameter space into subspaces, on which local nonlinear subproblems are defined. These subproblems yield local nonlinear corrections that are combined into a global trial step, which can be interpreted as an additive nonlinear preconditioner for the underlying minimization problem. Global convergence is achieved by embedding this step into a trust-region framework, where the full objective is evaluated to determine step acceptance and update the trust-region radius. This stabilization mechanism significantly reduces sensitivity to hyperparameter tuning. We further introduce a non-monotone variant of APTS using a windowed acceptance criterion, allowing controlled non-monotonicity to improve step acceptance in highly non-convex landscapes. Numerical results for purely data based neural network training on standard benchmarks and for PINNs demonstrate the robustness and effectiveness of the proposed approach.
|
| 11:30 - 12:00 | Akiko Takeda |
Modern BFGS Methods: From Global Superlinear Guarantees to Robust Noisy Optimization
Abstract. Recent progress in quasi-Newton analysis has renewed interest in BFGS methods with explicit global performance guarantees. In this talk, we highlight two complementary advances in modern BFGS research, spanning theory-driven acceleration and noise-robust practice. First, we introduce a generalized eigenpair-based BFGS scheme that explicitly minimizes a strict Hessian-approximation accuracy measure at each iteration, yielding non-asymptotic global superlinear convergence with a doubly-exponential rate. Second, we consider optimization with noisy objective values and propose a regularized quasi-Newton method with a noise-aware relaxation of the Armijo sufficient-decrease test; the method retains the standard $O(\varepsilon^{-2})$ global complexity for reaching first-order stationary points.
|
| 12:00 - 13:30 | Lunch Break |
| 13:30 - 14:00 | Jon Lee |
On the relationship between MESP and 0/1 D-Opt and their upper bounds
Abstract. We establish strong connections between two fundamental nonlinear 0/1 optimization problems coming from the area of experimental design, namely maximum entropy sampling and 0/1 D-Optimality. The connections are based on maps between instances, and we analyze the behavior of these maps. Using these maps, we transport basic upper-bounding methods between these two problems, and we are able to establish new domination results and other inequalities relating various basic upper bounds. Further, we establish results relating how different branch-and-bound schemes based on these maps compare. Additionally, we observe some surprising numerical results, where bounding methods that did not seem promising in their direct application to real-data MESP instances, are now useful for MESP instances that come from 0/1 D-Optimality.
|
| 14:00 - 14:30 | Yuan Zhou |
All Cyclic Group Facets Inject
Abstract. We study cut-generating functions in the setting of the Gomory–Johnson group relaxations for integer programming. We address an open question: whether every facet (extreme function) for a finite cyclic group relaxation injects into the space of extreme functions for the infinite group problem. Specifically, we show that any piecewise linear minimal function with rational breakpoints in 1/qZ and rational values at these breakpoints can be approximated by piecewise linear two-slope extreme functions while preserving all function values on 1/qZ. As a corollary, every extreme function for the finite group problem on 1/qZ is the restriction of a continuous piecewise linear two-slope extreme function for the infinite group problem, with breakpoints on a refinement 1/(Mq)Z. Combined with Gomory’s master theorem, this establishes that the infinite group problem indeed serves as the correct master problem for facets of one-row group relaxations. This is a joint work with Matthias Koeppe.
|
| 14:30 - 15:00 | Hiroshi Kera |
Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic
Abstract. The chain of thought is fundamental in Transformers, which is to perform step-by-step reasoning. Besides what intermediate steps work, the order of these steps critically affects the difficulty of the reasoning. This talk presents a novel task of unraveling chain of thought - reordering decoder input tokens to a learning-friendly sequence for Transformers to learn arithmetic tasks. The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage. As the search space grows factorially with sequence length, we propose a two-stage hierarchical approach for inter- and intra-block reordering. Experiments on four order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates. Notably, on the multiplication task, it recovered the reverse-digit order reported in prior studies.
|
| 15:00 - 15:30 | I-Lin Wang |
Influence Maximization Scheduling with Majority Illusion in Social Networks
Abstract. Social networks have become central platforms for word-of-mouth diffusion in marketing, e-commerce, and public health, elevating the importance of designing effective influence maximization strategies under limited resources. However, conventional influence maximization studies largely neglect the majority illusion, a cognitive bias where local neighborhood observations create a false impression of global consensus, potentially misleading both decision-makers and individuals about the true level of adoption. This research investigates how to schedule seed activations to not only maximize overall influence but also to control majority-illusion nodes, thereby improving the perceptual fidelity of diffusion outcomes in social networks. The problem is formulated as the Majority Illusion Influence Maximization Scheduling (MIB) problem, modeled through a two-stage mixed-integer programming framework under the Linear Threshold diffusion model. In Phase I, the model selects and schedules seed nodes over a finite planning horizon to maximize the number of activated nodes subject to seeding budget and timing constraints. Phase II then preserves the influence level achieved in Phase I while explicitly minimizing the number of majority-illusion nodes, thus balancing diffusion reach and perceptual accuracy. To explore structural interventions, an extended network design variant, MID, further incorporates edge modification decisions to study how network structure shapes both activation patterns and illusion phenomena. Due to the NP-hard nature and large-scale complexity of MIB, exact solutions via mixed-integer programming are tractable only for relatively small networks. To enhance scalability, three algorithmic approaches are developed: a fast greedy heuristic (GR), a genetic algorithm (GA), and a rolling horizon (RH) algorithm that decomposes the planning horizon into overlapping time windows. Hybrid warm-start strategies are also designed to feed heuristic solutions into the exact solver, accelerating convergence on medium-sized instances. Computational experiments are conducted on SNAP ego-Twitter networks with sizes ranging from 50 to 2000 nodes under different seeding budgets. Results show that RH consistently provides the best trade-off between solution quality and computation time, matching optimal mixed-integer solutions on 300-node instances with about 12% of the runtime and yielding up to 3.7 and 32 times better solutions than the exact model’s incumbent for 800- and 2000-node cases within the same time limit.
|
| 15:30 - 16:00 | Coffee Break |
| 16:00 - 16:30 | Akshay Gupte |
Adaptive MIP discretizations enhanced with GNN and Bayesian optimization for QCQPs
Abstract. We present a general framework for finding good feasible solutions, or improving incumbent solutions, to quadratically constrained quadratic programs (QCQPs) by using mixed-integer linear programs (MIPs) arising from variable discretizations. An adaptive algorithm iteratively updates the parameters of the discretized model by creating finer discretizations around the incumbent point. Our method complements existing algorithms for adaptive updates of MIP relaxations of QCQPs. We use a graph neural network (GNN) approach to leverage the intrinsic graph structure of QCQPs, and Bayesian optimization to jointly tune the parameters of GNN and adaptive algorithm. Computational experiments performed on specific types of QCQPs exhibit improved solution quality by 25% over the standard adaptive method and by 60% over a state-of-the-art global solver.
|
| 16:30 - 17:00 | Bruno Lourenço |
Convex facial reduction and nice convex sets
Abstract. In this talk, given two convex sets, we present a facial reduction algorithm that either decides that the sets do not intersect or finds minimal faces of each set containing the intersection. This can be leveraged to construct duals of convex programs that always afford strong duality even when constraint qualifications fail. Along the way, we introduce our notions of niceness of convex sets and vertical niceness of convex functions which allow us to simplify the theory. This is a joint work with Ying Lin (University of Hong Kong) and Tianxiang Liu (University of Tsukuba).
|
| 17:00 - 17:30 | Kaibo Liu |
A Neural Network-based Adaptive Sampling in Monitoring High-dimensional Processes
Abstract. Multivariate statistical process control techniques have been widely used for online monitoring in various applications. However, when monitoring high-dimensional processes, practical resource constraints such as limited data transmission bandwidth, number of sensors, and sensor battery life often restrict our ability to collect observations from all data streams. In this paper, we propose a neural network-based adaptive sampling strategy for monitoring high-dimensional processes. To adaptively select data streams for observation, it is essential to dynamically assess the importance of each stream. Accordingly, we propose learning an importance function through a neural network. By incorporating domain knowledge of process monitoring, a monotonic neural network is constructed. A key challenge in training this network is the lack of ground truth for importance assessment. To address this challenge, we simulate training data and design a novel loss function. Compared to existing adaptive sampling strategies, our method does not rely on heuristic techniques and offers enhanced scalability. With the proposed adaptive sampling strategy, a generic monitoring scheme is then developed for the monitoring of high-dimensional processes. We have conducted thorough numerical simulations and a case study, which demonstrate that our method significantly reduces detection delays and increases the likelihood of observing shifted data streams.
|
| 17:30 - 18:00 | Pierre-Louis Poirion |
A random subspace Frank-Wolfe Algorithm
Abstract. In this talk we present a random-subspace Frank–Wolfe algorithm for smooth convex optimization over uniformly convex compact domains. The method replaces the full linear minimization oracle by a low-dimensional oracle restricted to a randomly sampled subspace, substantially reducing the cost per iteration. We show that several known convergence guarantees for the classical Frank–Wolfe method extend to this randomized low-dimensional method.
|