Algorithms for economical deep discovering – Google AI Blog

[ad_1]

Posted by Sanjiv Kumar, VP and Google Fellow, Google Analysis

(This is Part 4 in our series of posts covering different topical locations of investigation at Google. You can obtain other posts in the collection right here.)

The explosion in deep discovering a decade in the past was catapulted in section by the convergence of new algorithms and architectures, a marked maximize in facts, and access to larger compute. In the previous 10 a long time, AI and ML models have become greater and more subtle — they’re further, a lot more intricate, with much more parameters, and properly trained on much far more knowledge, resulting in some of the most transformative outcomes in the heritage of device learning.

As these styles significantly locate by themselves deployed in output and small business applications, the efficiency and costs of these models has absent from a minimal thought to a principal constraint. In reaction, Google has ongoing to make investments closely in ML performance, getting on the major difficulties in (a) economical architectures, (b) training efficiency, (c) information efficiency, and (d) inference performance. Past effectiveness, there are a number of other challenges around factuality, safety, privateness and freshness in these styles. Beneath, we spotlight an array of works that display Google Research’s endeavours in creating new algorithms to tackle the previously mentioned troubles.

Successful architectures

A essential dilemma is “Are there far better ways of parameterizing a model to permit for larger performance?” In 2022, we targeted on new techniques for infusing exterior information by augmenting models by using retrieved context mixture of professionals and building transformers (which lie at the coronary heart of most huge ML products) far more productive.

Context-augmented styles

In the quest for bigger excellent and performance, neural styles can be augmented with external context from significant databases or trainable memory. By leveraging retrieved context, a neural network might not have to memorize the huge amount of entire world information within its inner parameters, primary to better parameter performance, interpretability and factuality.

In “Decoupled Context Processing for Context Augmented Language Modeling”, we explored a very simple architecture for incorporating exterior context into language products primarily based on a decoupled encoder-decoder architecture. This led to major computational cost savings though giving aggressive final results on automobile-regressive language modeling and open up domain concern answering responsibilities. Nonetheless, pre-experienced big language designs (LLMs) consume a considerable quantity of data as a result of self-supervision on big training sets. But, it is unclear exactly how the “world knowledge” of these kinds of versions interacts with the introduced context. With information aware good-tuning (KAFT), we bolster both equally controllability and robustness of LLMs by incorporating counterfactual and irrelevant contexts into typical supervised datasets.

An encoder-decoder cross-consideration system for context incorporation that allows decoupling of context encoding from language design inference, primary to productive context-augmented products.

1 of the concerns in the quest for a modular deep network is how a databases of ideas with corresponding computational modules could be built. We proposed a theoretical architecture that would “remember events” in the type of sketches saved in an external LSH table with ideas to modules that system such sketches.

Another obstacle in context-augmented designs is rapid retrieval on accelerators of information and facts from a large database. We have developed a TPU-based similarity look for algorithm that aligns with the effectiveness design of TPUs and offers analytical guarantees on anticipated recall, acquiring peak functionality. Search algorithms normally require a huge amount of hyperparameters and design and style possibilities that make it challenging to tune them on new responsibilities. We have proposed a new constrained optimization algorithm for automating hyperparameter tuning. Fixing the preferred cost or remember as input, the proposed algorithm generates tunings that empirically are really close to the pace-remember Pareto frontier and give main overall performance on normal benchmarks.

Mixture-of-experts products

Mixture-of-professionals (MoE) versions have established to be an efficient suggests of growing neural network model capability without the need of extremely escalating their computational charge. The simple thought of MoEs is to build a community from a selection of specialist sub-networks, the place every enter is processed by a suited subset of industry experts. So, compared to a normal neural community, MoEs invoke only a tiny part of the overall design, resulting in higher efficiency as revealed in language product purposes this kind of as GLaM.

The architecture of GLaM in which each and every enter token is dynamically routed to two picked pro networks out of 64 for prediction.

The choice of which professionals must be energetic for a provided enter is identified by a routing purpose, the style of which is complicated, because one particular would like to stop both equally beneath- and about-utilization of each individual pro. In a latest work, we proposed Expert Preference Routing, a new routing mechanism that, in its place of assigning each input token to the prime-k gurus, assigns each qualified to the top-k tokens. This instantly makes certain load-balancing of industry experts when also naturally enabling for an enter token to be taken care of by various gurus.

Specialist Option Routing. Industry experts with predetermined buffer capability are assigned leading-k tokens, consequently guaranteeing even load balancing. Every token can be processed by a variable selection of industry experts.

Efficient transformers

Transformers are well-liked sequence-to-sequence types that have proven amazing accomplishment in a vary of demanding difficulties from eyesight to normal language comprehending. A central component of these kinds of versions is the consideration layer, which identifies the similarity involving “queries” and “keys”, and takes advantage of these to build a suitable weighted mixture of “values”. Even though effective, awareness mechanisms have bad (i.e., quadratic) scaling with sequence length.

As the scale of transformers continues to develop, it is attention-grabbing to analyze if there are any naturally developing structures or patterns in the acquired versions that could assistance us decipher how they function. In direction of that, we studied the discovered embeddings in intermediate MLP layers, revealing that they are extremely sparse — e.g, T5-Big models have <1% nonzero entries. Sparsity further suggests that we can potentially reduce FLOPs without affecting model performance.

We recently proposed Treeformer, an alternative to standard attention computation that relies on decision trees. Intuitively, this quickly identifies a small subset of keys that are relevant for a query and only performs the attention operation on this set. Empirically, the Treeformer can lead to a 30x reduction in FLOPs for the attention layer. We also introduced Sequential Attention, a differentiable feature selection method that combines attention with a greedy algorithm. This technique has strong provable guarantees for linear models and scales seamlessly to large embedding models.

In Treeformer, attention computation is modeled as a nearest neighbor retrieval problem. Hierarchical decision trees are used to find which keys to pay attention to for each query, reducing the quadratic cost of classical attention substantially.

Another way to make transformers efficient is by making the softmax computations faster in the attention layer. Building on our previous work on low-rank approximation of the softmax kernel, we proposed a new class of random features that provides the first “positive and bounded” random feature approximation of the softmax kernel and is computationally linear in the sequence length. We also proposed the first approach for incorporating various attention masking mechanisms, such as causal and relative position encoding, in a scalable manner (i.e., sub-quadratic with relation to the input sequence length).

Top

Training efficiency

Efficient optimization methods are the cornerstone of modern ML applications and are particularly crucial in large scale settings. In such settings, even first order adaptive methods like Adam are often expensive, and training stability becomes challenging. In addition, these approaches are often agnostic to the architecture of the neural network, thereby ignoring the rich structure of the architecture leading to inefficient training. This motivates new techniques to more efficiently and effectively optimize modern neural network models. We are developing new architecture-aware training techniques, e.g., for training transformer networks, including new scale-invariant transformer networks and novel clipping methods that, when combined with vanilla stochastic gradient descent (SGD), results in faster training. Using this approach, for the first time, we were able to effectively train BERT using simple SGD without the need for adaptivity.

Moreover, with LocoProp we proposed a new method that achieves performance similar to that of a second-order optimizer while using the same computational and memory resources as a first-order optimizer. LocoProp takes a modular view of neural networks by decomposing them into a composition of layers. Each layer is then allowed to have its own loss function as well as output target and weight regularizer. With this setup, after a suitable forward-backward pass, LocoProp proceeds to perform parallel updates to each layer’s “local loss”. In fact, these updates can be shown to resemble those of higher-order optimizers, both theoretically and empirically. On a deep autoencoder benchmark, LocoProp achieves performance comparable to that of higher-order optimizers while being significantly faster.

Similar to backpropagation, LocoProp applies a forward pass to compute the activations. In the backward pass, LocoProp sets per neuron “targets” for each layer. Finally, LocoProp splits model training into independent problems across layers where several local updates can be applied to each layer’s weights in parallel.

One key assumption in optimizers like SGD is that each data point is sampled independently and identically from a distribution. This is unfortunately hard to satisfy in practical settings such as reinforcement learning, where the model (or agent) has to learn from data generated based on its own predictions. We proposed a new algorithmic approach named SGD with reverse experience replay, which finds optimal solutions in several settings like linear dynamical systems, non-linear dynamical systems, and in Q-learning for reinforcement learning. Furthermore, an enhanced version of this method — IER — turns out to be the state of the art and is the most stable experience replay technique on a variety of popular RL benchmarks.

Top

Data efficiency

For many tasks, deep neural networks heavily rely on large datasets. In addition to the storage costs and potential security/privacy concerns that come along with large datasets, training modern deep neural networks on such datasets incurs high computational costs. One promising way to solve this problem is with data subset selection, where the learner aims to find the most informative subset from a large number of training samples to approximate (or even improve upon) training with the entire training set.

We analyzed a subset selection framework designed to work with arbitrary model families in a practical batch setting. In such a setting, a learner can sample examples one at a time, accessing both the context and true label, but in order to limit overhead costs, is only able to update its state (i.e., further train model weights) once a large enough batch of examples is selected. We developed an algorithm, called IWeS, that selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. We provide a theoretical analysis, proving generalization and sampling rate bounds.

Another concern with training large networks is that they can be highly sensitive to distribution shifts between training data and data seen at deployment time, especially when working with limited amounts of training data that might not cover all of deployment time scenarios. A recent line of work has hypothesized “extreme simplicity bias” as the key issue behind this brittleness of neural networks. Our latest work makes this hypothesis actionable, leading to two new complementary approaches — DAFT and FRR — that when combined provide significantly more robust neural networks. In particular, these two approaches use adversarial fine-tuning along with inverse feature predictions to make the learned network robust.

Top

Inference efficiency

Increasing the size of neural networks has proven surprisingly effective in improving their predictive accuracy. However, it is challenging to realize these gains in the real-world, as the inference costs of large models may be prohibitively high for deployment. This motivates strategies to improve the serving efficiency, without sacrificing accuracy. In 2022, we studied different strategies to achieve this, notably those based on knowledge distillation and adaptive computation.

Distillation

Distillation is a simple yet effective method for model compression, which greatly expands the potential applicability of large neural models. Distillation has proved widely effective in a range of practical applications, such as ads recommendation. Most use-cases of distillation involve a direct application of the basic recipe to the given domain, with limited understanding of when and why this ought to work. Our research this year has looked at tailoring distillation to specific settings and formally studying the factors that govern the success of distillation.

On the algorithmic side, by carefully modeling the noise in the teacher labels, we developed a principled approach to reweight the training examples, and a robust method to sample a subset of data to have the teacher label. In “Teacher Guided Training”, we presented a new distillation framework: rather than passively using the teacher to annotate a fixed dataset, we actively use the teacher to guide the selection of informative samples to annotate. This makes the distillation process shine in limited data or long-tail settings.

We also researched new recipes for distillation from a cross-encoder (e.g., BERT) to a factorized dual-encoder, an important setting for the task of scoring the relevance of a [query, document] pair. We studied the reasons for the performance gap between cross- and dual-encoders, noting that this can be the result of generalization rather than capacity limitation in dual-encoders. The careful construction of the loss function for distillation can mitigate this and reduce the gap between cross- and dual-encoder performance. Subsequently, in EmbedDistill, we looked at further improving dual-encoder distillation by matching embeddings from the teacher model. This strategy can also be used to distill from a large to small dual-encoder model, wherein inheriting and freezing the teacher’s document embeddings can prove highly effective.

In EmbedDistill, teacher to student distillation is done by designing new loss functions that match the geometry of student embeddings with that of the teacher in addition to matching the final predictions.

On the theoretical side, we provided a new perspective on distillation through the lens of supervision complexity, a measure of how well the student can predict the teacher labels. Drawing on neural tangent kernel (NTK) theory, this offers conceptual insights, such as the fact that a capacity gap may affect distillation because such teachers’ labels may appear akin to purely random labels to the student. We further demonstrated that distillation can cause the student to underfit points the teacher model finds “hard” to model. Intuitively, this may help the student focus its limited capacity on those samples that it can reasonably model.

Adaptive computation

While distillation is an effective means of reducing inference cost, it does so uniformly across all samples. Intuitively however, some “easy” samples may inherently require less compute than the “hard” samples. The goal of adaptive compute is to design mechanisms that enable such sample-dependent computation.

Confident Adaptive Language Modeling (CALM) introduced a controlled early-exit functionality to Transformer-based text generators such as T5. In this form of adaptive computation, the model dynamically modifies the number of transformer layers that it uses per decoding step. The early-exit gates use a confidence measure with a decision threshold that is calibrated to satisfy statistical performance guarantees. In this way, the model needs to compute the full stack of decoder layers for only the most challenging predictions. Easier predictions only require computing a few decoder layers. In practice, the model uses about a third of the layers for prediction on average, yielding 2–3x speed-ups while preserving the same level of generation quality.

Text generation with a regular language model (top) and with CALM (bottom). CALM attempts to make early predictions. Once confident enough (darker blue tones), it skips ahead and saves time.

One popular adaptive compute mechanism is a cascade of two or more base models. A key issue in using cascades is deciding whether to simply use the current model’s predictions, or whether to defer prediction to a downstream model. Learning when to defer requires designing a suitable loss function, which can leverage appropriate signals to act as supervision for the deferral decision. We formally studied existing loss functions for this goal, demonstrating that they may underfit the training sample owing to an implicit application of label smoothing. We showed that one can mitigate this with post-hoc training of a deferral rule, which does not require modifying the model internals in any way.

For the retrieval applications, standard semantic search techniques use a fixed representation for each embedding generated by a large model. That is, irrespective of downstream task and its associated compute environment or constraints, the representation size and capability is mostly fixed. Matryoshka representation learning introduces flexibility to adapt representations according to the deployment environment. That is, it forces representations to have a natural ordering within its coordinates such that for resource constrained environments, we can use only the top few coordinates of the representation, while for richer and precision-critical settings, we can use more coordinates of the representation. When combined with standard approximate nearest neighbor search techniques like ScaNN, MRL is able to provide up to 16x lower compute with the same recall and accuracy metrics.

Top

Concluding thoughts

Large ML models are showing transformational outcomes in several domains but efficiency in both training and inference is emerging as a critical need to make these models practical in the real-world. Google Research has been investing significantly in making large ML models efficient by developing new foundational techniques. This is an on-going effort and over the next several months we will continue to explore core challenges to make ML models even more robust and efficient.

Acknowledgements

The work in efficient deep learning is a collaboration among many researchers from Google Research, including Amr Ahmed, Ehsan Amid, Rohan Anil, Mohammad Hossein Bateni, Gantavya Bhatt, Srinadh Bhojanapalli, Zhifeng Chen, Felix Chern, Gui Citovsky, Andrew Dai, Andy Davis, Zihao Deng, Giulia DeSalvo, Nan Du, Avi Dubey, Matthew Fahrbach, Ruiqi Guo, Blake Hechtman, Yanping Huang, Prateek Jain, Wittawat Jitkrittum, Seungyeon Kim, Ravi Kumar, Aditya Kusupati, James Laudon, Quoc Le, Daliang Li, Zonglin Li, Lovish Madaan, David Majnemer, Aditya Menon, Don Metzler, Vahab Mirrokni, Vaishnavh Nagarajan, Harikrishna Narasimhan, Rina Panigrahy, Srikumar Ramalingam, Ankit Singh Rawat, Sashank Reddi, Aniket Rege, Afshin Rostamizadeh, Tal Schuster, Si Si, Apurv Suman, Phil Sun, Erik Vee, Ke Ye, Chong You, Felix Yu, Manzil Zaheer, and Yanqi Zhou.