This study investigates how large language models (LLMs) handle the universality problem for regular expressions. We first establish the theoretical background of universality and approximate universality. Next, we construct a benchmark dataset of regular expressions, including both universal and non-universal cases. We then evaluate the performance of several state-of-the-art LLMs under different prompting strategies, measuring both correctness and reasoning quality. Finally, we conduct error analysis to identify common failure modes and propose fine-tuning directions. Together, these results provide insight into the capabilities and limitations of LLMs in tackling algorithmically hard problems. We found that all four models performed remarkably well on the universality problem, achieving near-perfect accuracy across both universal and non-universal cases, with only minor variations observed in Gemini at certain depths.