面向高频动态到达的天文卫星机遇目标任务规划方法

    Scheduling Methods for Astronomical Satellite Target of Opportunity Tasks with High-frequency Dynamic Arrivals

    • 以巡天设备每天将探测到数以万计的变源天体以及对变源天体的观测需求增长为背景, 形成了由高频动态到达的机遇目标(ToO)及其后随观测任务组成的长序列任务规划问题. 该问题具有观测事件随机性、数据获取高时效性、可选择性多和约束复杂的特点, 常被视为NP (非确定性多项式)难题, 因此获取监督学习的标签数据不易. 而针对采用无监督学习的深度强化学习(DRL)方法求解长序列任务规划问题时, 卫星作为智能体难以快速收敛至全局最优策略. 为此本文借鉴局部注意力(LA)机制的思想对指针网络(PN)进行改进, 提出局部注意力指针网络(LA-PN)算法. 该算法通过引入时间窗口的方式, 使模型专注于对当前决策有重要影响的序列部分, 减少了无效探索. 通过仿真结果对比分析, 验证算法的收益性、实时性和泛化性.

       

      Abstract: Based on the background where sky survey equipment detects tens of thousands of variable sources daily and the demand for observing these variable sources increases, a long-sequence task planning problem is formed. This problem comprises dynamically arriving high-frequency Targets of Opportunity (ToO) and their subsequent follow-up observation tasks. This type of problem is classified as an NP hard problem due to the randomness of observation events, strong timeliness of data acquisition, high selectivity of decision paths, and the coupling effect of complex constraints such as Earth occlusion and observation time windows. Traditional supervised learning methods are difficult to apply due to the difficulty in obtaining high-quality labeled data, while Deep Reinforcement Learning (DRL) combines the perceptual ability of deep learning with the sequential decision optimization mechanism of reinforcement learning to construct end-to-end autonomous decision-making agents, demonstrating the universal advantage of handling complex planning problems. This article is based on the DRL framework and constructs a Markov Decision Process (MDP) model for long sequence task planning problems. State space encoding, action space encoding, and reward functions are defined in the task planning problem to guide reinforcement learning algorithms to converge towards optimizing the target set. However, classical DRL methods have significant limitations in unsupervised learning mode - when faced with long sequence dynamic task flows, agents have difficulty quickly converging to the global optimal strategy due to the large exploration space, and lack sensitivity to local key tasks. Therefore, this paper improves the Pointer Network (PN) by drawing on the concept of the Local Attention (LA) mechanism, proposing the Local Attention Pointer Network (LA-PN) algorithm. This algorithm innovatively introduces a sliding time window constraint on the attention range, allowing the model to dynamically focus on task subsequences related to the current decision, and filter high-value targets through local context awareness to reduce redundant computation. By comparing and analyzing simulation results, verify the profitability, real-time performance, and generalization of the algorithm.

       

    /

    返回文章
    返回