Loading
I’ve spent about 8 hours comparing insurance PDS’s. I’ve attempted to have Grok and co read these for a comparison. The LLM’s have consistently come back with absolutely random, vague and postulated figures that in no way actually reflect the real thing. Some LLMS come back with reasonable summarisation and limit their creativity but anything like Grok that’s doing summary +1, consistently comes back with numbers in particular that simply don’t exist – particularly when comparing things.
This seems common with my endeavours into Copilot Studio in a professional environment when adding large but patchy knowledge sources. There’s simply put, still an enormous propensity for these things to sound authoritative, but spout absolute unchecked-garbage.
For code, it’s training data set is infinitely larger and there is more room for a “working” answer – but for anything legalistic, I just can’t see these models being useful for a seriously authoritative response.
tldr; Am I alone here or are LLM’s still, currently just so far off being reliable for actual single-shot-data-processing outside of loose summarisation?
submitted by /u/Mullazman
[link] [comments]