Abstract:
Based on the background where sky survey equipment detects tens of thousands of variable sources daily and the demand for observing these variable sources increases, a long-sequence task planning problem is formed. This problem comprises dynamically arriving high-frequency Targets of Opportunity (ToO) and their subsequent follow-up observation tasks. This type of problem is classified as an NP hard problem due to the randomness of observation events, strong timeliness of data acquisition, high selectivity of decision paths, and the coupling effect of complex constraints such as Earth occlusion and observation time windows. Traditional supervised learning methods are difficult to apply due to the difficulty in obtaining high-quality labeled data, while Deep Reinforcement Learning (DRL) combines the perceptual ability of deep learning with the sequential decision optimization mechanism of reinforcement learning to construct end-to-end autonomous decision-making agents, demonstrating the universal advantage of handling complex planning problems. This article is based on the DRL framework and constructs a Markov Decision Process (MDP) model for long sequence task planning problems. State space encoding, action space encoding, and reward functions are defined in the task planning problem to guide reinforcement learning algorithms to converge towards optimizing the target set. However, classical DRL methods have significant limitations in unsupervised learning mode - when faced with long sequence dynamic task flows, agents have difficulty quickly converging to the global optimal strategy due to the large exploration space, and lack sensitivity to local key tasks. Therefore, this paper improves the Pointer Network (PN) by drawing on the concept of the Local Attention (LA) mechanism, proposing the Local Attention Pointer Network (LA-PN) algorithm. This algorithm innovatively introduces a sliding time window constraint on the attention range, allowing the model to dynamically focus on task subsequences related to the current decision, and filter high-value targets through local context awareness to reduce redundant computation. By comparing and analyzing simulation results, verify the profitability, real-time performance, and generalization of the algorithm.