“A new benchmark called HALU-Hard shows that all current LLM systems are still making hallucination errors.”