Abstract
This work addresses the challenge of optimal energy management in microgrids through a collaborative and privacy-preserving framework. We propose the FedTRPO methodology, which integrates Federated Learning (FL) and Trust Region Policy Optimization (TRPO) to manage distributed energy resources (DERs) efficiently. Using a customized version of the CityLearn environment and synthetically generated data, we simulate designed net-zero energy scenarios for microgrids composed of multiple buildings. Our approach emphasizes reducing energy costs and carbon emissions while ensuring privacy. Experimental results demonstrate that FedTRPO is comparable with state-of-the-art federated RL methodologies without hyperparameter tunning. The proposed framework highlights the feasibility of collaborative learning for achieving optimal control policies in energy systems, advancing the goals of sustainable and efficient smart grids.
In a nutshell
We designed a microgrid scenario that was net-zero by definition. The optimal policy is simple, but we wanted to evaluate how easy it was to learn it and the benefits of training RL agents in a collaborative setup.
The policy is straightforward: the agent must save a portion of the solar-generated energy until the battery is full. Later, when the sun goes down, the load will be supplied entirely by the stored energy.
FedPPO in a 5 building setup
We noticed that a collaborative setup of PPOs manages to learn policies that are pretty close to the optimal but fail to close the last optimality gap, which in the long term represents extra emissions and cost.
FedTRPO in a 5 building setup
However, the TRPO version struggled to learn the optimal policies, although it detected some of the patterns. However, there was no need for hyperparameter tuning.
As expected, second-order methods with an adaptive step size are suitable for learning in a challenging setup (not any tunning), but they’re vulnerable to the initial point. In other words, the initial exploration contains trajectories of bad quality, and that implies that the first computations for the second-order information are not very good; from then, it’s harder to reach optimality.
Next steps
Our experimental results show that FedTRPO performs competitively compared to the state-of-the-art PPO operating in a federated setup. While PPO demonstrated superior performance in some instances, TRPO provided improved convergence without hyperparameter tuning. We believe the best solution would involve a mix of the two RL algorithms to address their challenges. PPO faces difficulties when it is close to the optimal solution due to the fixed learning rate, and a scheduler might be a solution for some instances, but it becomes a new hyperparameter. TRPO relies entirely on sampling to converge in a few steps to the optimal. Still, the quality of the first trajectories and the initialization of the parameters limit its best performance, which is a common problem of second-order optimization methods. The simple solution could be performing a few iterations of PPO to create a better initialization point and then continue with TRPO, profiting from its adaptive step size, which will play an essential role in the vicinity of the optimal solution.