[ad_1]
To control the distribution shift practical experience by learning-centered controllers, we request a mechanism for constraining the agent to regions of large knowledge density in the course of its trajectory (left). Here, we current an method which achieves this goal by combining characteristics of density products (center) and Lyapunov functions (suitable).
In purchase to make use of device understanding and reinforcement finding out in managing real world units, we must style algorithms which not only reach fantastic performance, but also interact with the technique in a safe and sound and dependable manner. Most prior get the job done on safety-crucial management focuses on sustaining the protection of the actual physical process, e.g. preventing slipping above for legged robots, or colliding into hurdles for autonomous motor vehicles. Nevertheless, for discovering-primarily based controllers, there is a further resource of basic safety concern: because machine studying products are only optimized to output right predictions on the teaching knowledge, they are susceptible to outputting erroneous predictions when evaluated on out-of-distribution inputs. Therefore, if an agent visits a condition or normally takes an action that is quite different from individuals in the teaching knowledge, a finding out-enabled controller might “exploit” the inaccuracies in its acquired component and output steps that are suboptimal or even dangerous.
To reduce these possible “exploitations” of product inaccuracies, we propose a new framework to purpose about the protection of a finding out-dependent controller with respect to its schooling distribution. The central plan driving our function is to watch the education data distribution as a security constraint, and to draw on tools from control principle to command the distributional change skilled by the agent during shut-loop control. More precisely, we’ll go over how Lyapunov stability can be unified with density estimation to deliver Lyapunov density versions, a new form of protection “barrier” purpose which can be used to synthesize controllers with guarantees of maintaining the agent in locations of substantial knowledge density. In advance of we introduce our new framework, we will to start with give an overview of existing tactics for guaranteeing physical security by means of barrier operate.
In regulate theory, a central subject of research is: provided acknowledged process dynamics, $s_t+1=f(s_t, a_t)$, and recognised method constraints, $s in C$, how can we style a controller that is assured to maintain the procedure in the specified constraints? Below, $C$ denotes the set of states that are deemed protected for the agent to take a look at. This dilemma is hard mainly because the specified constraints need to have to be happy around the agent’s full trajectory horizon ($s_t in C$ $forall 0leq t leq T$). If the controller employs a easy “greedy” technique of steering clear of constraint violations in the upcoming time action (not taking $a_t$ for which $f(s_t, a_t) notin C$), the process may nevertheless finish up in an “irrecoverable” state, which itself is considered risk-free, but will inevitably lead to an unsafe condition in the long run no matter of the agent’s long term steps. In buy to steer clear of browsing these “irrecoverable” states, the controller must use a a lot more “long-horizon” approach which involves predicting the agent’s total upcoming trajectory to stay clear of protection violations at any issue in the foreseeable future (stay away from $a_t$ for which all feasible $ a_hatt _hatt=t+1^H$ lead to some $bart$ where by $s_bart notin C$ and $t
Control theorists tackle this challenge by designing “barrier” functions, $v(s)$, to constrain the controller at each step (only allow $a_t$ which satisfy $v(f(s_t, a_t)) leq 0$). In order to ensure the agent remains safe throughout its entire trajectory, the constraint induced by barrier functions ($v(f(s_t, a_t))leq 0$) prevents the agent from visiting both unsafe states and irrecoverable states which inevitably lead to unsafe states in the future. This strategy essentially amortizes the computation of looking into the future for inevitable failures when designing the safety barrier function, which only needs to be done once and can be computed offline. This way, at runtime, the policy only needs to employ the greedy constraint satisfaction strategy on the barrier function $v(s)$ in order to ensure safety for all future timesteps.
Here, we used the notion of a “barrier” function as an umbrella term to describe a number of different kinds of functions whose functionalities are to constrain the controller in order to make long-horizon guarantees. Some specific examples include control Lyapunov functions for guaranteeing stability, control barrier functions for guaranteeing general safety constraints, and the value function in Hamilton-Jacobi reachability for guaranteeing general safety constraints under external disturbances. More recently, there has also been some work on learning barrier functions, for settings where the system is unknown or where barrier functions are difficult to design. However, prior works in both traditional and learning-based barrier functions are mainly focused on making guarantees of physical safety. In the next section, we will discuss how we can extend these ideas to regulate the distribution shift experienced by the agent when using a learning-based controller. To prevent model exploitation due to distribution shift, many learning-based control algorithms constrain or regularize the controller to prevent the agent from taking low-likelihood actions or visiting low likelihood states, for instance in offline RL, model-based RL, and imitation learning. However, most of these methods only constrain the controller with a single-step estimate of the data distribution, akin to the “greedy” strategy of keeping an autonomous drone safe by preventing actions which causes it to crash in the next timestep. As we saw in the illustrative figures above, this strategy is not enough to guarantee that the drone will not crash (or go out-of-distribution) in another future timestep. How can we design a controller for which the agent is guaranteed to stay in-distribution for its entire trajectory? Recall that barrier functions can be used to guarantee constraint satisfaction for all future timesteps, which is exactly the kind of guarantee we hope to make with regards to the data distribution. Based on this observation, we propose a new kind of barrier function: the Lyapunov density model (LDM), which merges the dynamics-aware aspect of a Lyapunov function with the data-aware aspect of a density model (it is in fact a generalization of both types of function). Analogous to how Lyapunov functions keeps the system from becoming physically unsafe, our Lyapunov density model keeps the system from going out-of-distribution. An LDM ($G(s, a)$) maps state and action pairs to negative log densities, where the values of $G(s, a)$ represent the best data density the agent is able to stay above throughout its trajectory. It can be intuitively thought of as a “dynamics-aware, long-horizon” transformation on a single-step density model ($E(s, a)$), where $E(s, a)$ approximates the negative log likelihood of the data distribution. Since a single-step density model constraint ($E(s, a) leq -log(c)$ where $c$ is a cutoff density) might still allow the agent to visit “irrecoverable” states which inevitably causes the agent to go out-of-distribution, the LDM transformation increases the value of those “irrecoverable” states until they become “recoverable” with respect to their updated value. As a result, the LDM constraint ($G(s, a) leq -log(c)$) restricts the agent to a smaller set of states and actions which excludes the “irrecoverable” states, thereby ensuring the agent is able to remain in high data-density regions throughout its entire trajectory.
How exactly does this “dynamics-aware, long-horizon” transformation work? Given a data distribution $P(s, a)$ and dynamical system $s_t+1 = f(s_t, a_t)$, we define the following as the LDM operator: $mathcalTG(s, a) = max-log P(s, a), min_a’ G(f(s, a), a’)$. Suppose we initialize $G(s, a)$ to be $-log P(s, a)$. Under one iteration of the LDM operator, the value of a state action pair, $G(s, a)$, can either remain at $-log P(s, a)$ or increase in value, depending on whether the value at the best state action pair in the next timestep, $min_a’ G(f(s, a), a’)$, is larger than $-log P(s, a)$. Intuitively, if the value at the best next state action pair is larger than the current $G(s, a)$ value, this means that the agent is unable to remain at the current density level regardless of its future actions, making the current state “irrecoverable” with respect to the current density level. By increasing the current the value of $G(s, a)$, we are “correcting” the LDM such that its constraints would not include “irrecoverable” states. Here, one LDM operator update captures the effect of looking into the future for one timestep. If we repeatedly apply the LDM operator on $G(s, a)$ until convergence, the final LDM will be free of “irrecoverable” states in the agent’s entire future trajectory. To use an LDM in control, we can train an LDM and learning-based controller on the same training dataset and constrain the controller’s action outputs with an LDM constraint ($G(s, a)) leq -log(c)$). Because the LDM constraint prevents both states with low density and “irrecoverable” states, the learning-based controller will be able to avoid out-of-distribution inputs throughout the agent’s entire trajectory. Furthermore, by choosing the cutoff density of the LDM constraint, $c$, the user is able to control the tradeoff between protecting against model error vs. flexibility for performing the desired task.
So far, we have only discussed the properties of a “perfect” LDM, which can be found if we had oracle access to the data distribution and dynamical system. In practice, though, we approximate the LDM using only data samples from the system. This causes a problem to arise: even though the role of the LDM is to prevent distribution shift, the LDM itself can also suffer from the negative effects of distribution shift, which degrades its effectiveness for preventing distribution shift. To understand the degree to which the degradation happens, we analyze this problem from both a theoretical and empirical perspective. Theoretically, we show even if there are errors in the LDM learning procedure, an LDM constrained controller is still able to maintain guarantees of keeping the agent in-distribution. Albeit, this guarantee is a bit weaker than the original guarantee provided by a perfect LDM, where the amount of degradation depends on the scale of the errors in the learning procedure. Empirically, we approximate the LDM using deep neural networks, and show that using a learned LDM to constrain the learning-based controller still provides performance improvements compared to using single-step density models on several domains.
Currently, one of the biggest challenges in deploying learning-based controllers on real world systems is their potential brittleness to out-of-distribution inputs, and lack of guarantees on performance. Conveniently, there exists a large body of work in control theory focused on making guarantees about how systems evolve. However, these works usually focus on making guarantees with respect to physical safety requirements, and assume access to an accurate dynamics model of the system as well as physical safety constraints. The central idea behind our work is to instead view the training data distribution as a safety constraint. This allows us to make use of these ideas in controls in our design of learning-based control algorithms, thereby inheriting both the scalability of machine learning and the rigorous guarantees of control theory. This post is based on the paper “Lyapunov Density Models: Constraining Distribution Shift in Learning-Based Control”, presented at ICML 2022. You
Illustrative example of a drone whose goal is to fly as straight as possible while avoiding obstacles. Using the “greedy” strategy of avoiding safety violations (left), the drone flies straight because there’s no obstacle in the next timestep, but inevitably crashes in the future because it can’t turn in time. In contrast, using the “long-horizon” strategy (right), the drone turns early and successfully avoids the tree, by considering the entire future horizon future of its trajectory.
The blue region denotes the of states allowed by the barrier function constraint, $s $. Using a “long-horizon” barrier function, the drone only needs to greedily ensure that the barrier function constraint $v(s) leq 0$ is satisfied for the next state, in order to avoid safety violations for all future timesteps.
Example of data distributions (middle) and their associated LDMs (right) for a 2D linear system (left). LDMs can be viewed as “dynamics-aware, long-horizon” transformations on density models.
Example evaluation of ours and baseline methods on a hopper control task for different values of constraint thresholds (x- axis). On the right, we show example trajectories from when the threshold is too low (hopper falling over due to excessive model exploitation), just right (hopper successfully hopping towards target location), or too high (hopper standing still due to over conservatism).
Evaluation of our method (LDM) compared to constraining a learning-based controller with a density model, the variance over an ensemble of models, and no constraint at all on several domains including hopper, lunar lander, and glucose control.
find more details in our paper and on our website. We thank Sergey Levine, Claire Tomlin, Dibya Ghosh, Jason Choi, Colin Li, and Homer Walke for their valuable feedback on this blog post.
[ad_2]
Source link