- Have an AI chat model come up with an answer to a problem.
- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.
- Have a second AI model with no knowledge of the problem grade the report, and write it's own report either (a) asking for clarification / more information about the problem that the original model didn't provide or (b) pointing out an inconsistency in the argument posed by the original model. Give this report back to the original model and ask it to write it's own report back with either the necessary information or changes.
- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.
It's super clunky but has given pretty good results in the cases where I tried it lol
It seemed like a pretty good idea, though I'd guess that it would greatly increase token usage. I'd also be concerned that the LLM as a judge might struggle to grade things accurately if it wasn't also able to generate good enough answers to begin with.
itissid 6 minutes ago [-]
Isn't this kind of another way of how Inference Time Scaling works? It will basically produce several chain of thoughts and then pursue one that has maximum reward based on an internal function?
JumpCrisscross 2 hours ago [-]
Kagi’s Assistant feature makes this super easy. Just switch assistants and ask them to check the other’s work.
StopDisinfo910 43 minutes ago [-]
For anything semi-adversarial, I have had good results asking the AI to come up with a plan, then take the side of the opponent coming with counter play/way to defeat the plan, finally asking for a revision of the initial plan given the potential reaction from the opponent.
The final plan you obtain is generally a lot more well rounded and thought out.
I find that amusing because the technique also works when I apply it to me. Picking flaws in your plan before revisiting it actually works.
hsuduebc2 2 hours ago [-]
We're there any situation that first conclusion from AI was completely changed? Can you give generally examples of situations where it changed or significantly improved overall result? It sounds cool.
nomel 51 minutes ago [-]
I would be interested to know how ofter "oscillations" occur, where they flip flop from being too "agreeable" to challenges (which probably is just a sparse latent space). This happens to me pretty frequently, where you can repeatedly say "no that's wrong" and the LLM will do a 360, explaining why it was "in fact" wrong and you are "right", repeat.
Lerc 3 hours ago [-]
I kind of want to try something like this at a larger scale in an always-on mode where I have a 'senate' of debate. Rather than responding to prompts on a case by case basis, provide a list of tasks (potentially with deadlines) and let the senate work on them, break off into groups to manage subtasks, challenge results , make suggestions. Even potentially a tree of analysts where suggestions only gets passed up the tree when the parent node thinks a lower analysis is particularly insightful.
I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.
Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.
mikepurvis 3 hours ago [-]
In doing some DevOps-y type tasks recently (ansible, packer, docker, baking images with guestfish), I've found it very frustrating how much ChatGPT will confidently tell me to use flags on tools that don't exist, or hallicinate completely non-existent functions or behaviours. And then when I spend time trying what it suggests only to hit a wall and come back like wtf mate it breezily goes "oh yes so you're right, good job figuring that out! You're so close now! Your next step is to do X and Y," and then serves up the same detailed tutorial as before but with the flag or whatever it was that it had wrong subtly changed.
It definitely makes me feel like I'm dealing with an overenthusiastic intern who is throwing stuff over the wall without checking their work, and like maybe having a second bot sitting in front of the first one being like ARE YOUR SURE ABOUT THAT could really improve things.
MoonGhost 1 hours ago [-]
You can't get more info from LLMs than it actually holds. Like Anthropic pointed if LLMs knows the name but has no other info it starts hallucinating. The same probably happens here. LLM knows there must be a flag but can't remember all of them. Likely short reminder in prompt will help. (or search web for GPT) Just my $0.02.
mikepurvis 40 minutes ago [-]
It certainly feels like you can just by challenging it; then it happily finds other paths to what you want. So maybe internally it needs a second voice questioning it, to see how sure it is.
0x20cowboy 2 hours ago [-]
I did a stint in Devops and I found every models to be like this for all of the infra-as-code languages. Anything yaml based was especially bad.
Even Amazon’s own offering completely made things up about Amazon’s own formats.
I’d be curious as to why that is. It seems like there would be enough training data, and for Amazon in particular it seems like they could make a validation tool the model could use.
mikepurvis 42 minutes ago [-]
Maybe I'm excessively anthropomorphizing, but it does feel a bit analogous to my own thought process, like "I need feature XYZ, and based on other tools I'm more familiar with it should be an --xyz flag, so let me google for that and see if I'm right or if I instead find a four-year-old wontfix on Github where someone asked for what I need and got denied."
Except... the model is missing that final step; instead it just belches out its hypothesis, all dressed up in chirpy, confident-sounding language, certain that I'm moments away from having everything working just perfectly.
vunderba 2 hours ago [-]
100%. This has happened enough to me that I wished I could just inject the man page docs into it to at least act as a sanity check.
organsnyder 2 hours ago [-]
I've enjoyed watching Claude try running commands with incorrect flags, trying them, and then adapting.
nonelog 2 hours ago [-]
Spot on.
vunderba 2 hours ago [-]
A year or so ago I experimented with splitting a user prompt down to a set of "different AI personas" that would each try to approach the user's problem in a different way and then bubble back up with a master arbiter for consensus.
I modeled it after the concept of advisors from Civilization II. It worked reasonably well though I think it was at least somewhat limited by being constrained to a single LLM (Mistral). It also lit my computer on fire.
bee_rider 28 minutes ago [-]
What sort of personalities did you try? A group where some members have grudges against each other and will irrationally poke holes in each other’s plans could be a fun experiment.
nonethewiser 3 hours ago [-]
In theory couldnt this just be baked into a single adversarial model?
tonmoy 3 hours ago [-]
Yes, but I guess the model is optimized for relatively quick response, whereas these techniques are allowing the model to spend more time to generate a higher quality response
Lerc 2 hours ago [-]
To an extent, but different models are better at different things.
That is something I'm also curious about. Given models (that use the same tokenisation) that are better at different things, would their be interesting things to find by analysing the logprobs for tokens generated from identical inputs (including cross feeding the generated token from one to another)
Surely there must be something notable at particular points when a model goes off on the wrong path.
crowcroft 2 hours ago [-]
Like, just endlessly grinding tokens, then processing the output and pulling out good ideas when the endless debate generates them?
Would be interesting what it comes up with with enough time and tokens.
danielmarkbruce 2 hours ago [-]
This is being done, and you could apply it to a lot of domains. Go for it for whatever use case you have.
2 hours ago [-]
cube2222 4 hours ago [-]
This is really cool!
One strategy I often use (which is much simpler and more limited than this), is to finish my message with: “Please do a round of thinking in <thinking></thinking> tags, then a round of self-critique in <critique></critique> tags, and then a final round of <thinking>, before responding.”
It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).
bentt 45 minutes ago [-]
Oh I really like that. It makes me want to have it score its ideas with metrics and then keep iterating until it meets some score.
danielbln 4 hours ago [-]
I always do "now again but put on your critical hat"
CSSer 3 hours ago [-]
Makes me wonder how it would do if you tell it "put on your robe and wizard hat"
tomrod 3 hours ago [-]
ChatGPT calls you a superstar and it drops into bruhspeak. Emojis proliferate.
sumtechguy 3 hours ago [-]
it proceeds to spit out the entirety of bash.org
electroly 4 hours ago [-]
This seems to be different than I expected from the title. I thought it would be explicitly adversarial.
1. You are the assistant. Please answer the question directly.
2. You are the cross-examiner. The assistant is wrong. Explain why.
3. You are the assistant. The cross-examiner is wrong. Defend your claim.
4. You are a judge. Did either party make their case, or is another round of argumentation required?
I haven't tried this. No idea if it works. But I find it's helpful to ask ChatGPT, in separate prompts, "XYZ is true, explain why" and "XYZ is false, explain why" and see which one seems more convincing.
ChadMoran 1 hours ago [-]
Check out Fast Agent! (I have no affiliation with it, just use it).
Techniques like this have been around since GPT-3.5. There are boatloads of papers on the topic.
I have no idea why anyone thinks this is novel. I guess that speaks to the state of HN
moribunda 19 minutes ago [-]
Exactly... I thought that implementing STORM was just a basic step in this topic... Looks like we're running in circles.
nonethewiser 3 hours ago [-]
Chatgpt shares context between chats. I wonder how that impacts it?
It seems like a good approach though. What you dont want to do is ever suggest that its wrong yourself. Usually it will just assume it is wrong.
Actually what I find impressive is when I do this and it actually pushes back to defend itself.
the_af 42 minutes ago [-]
Does it share context even if no "memory updated" message appears indicating it has stored a fact about you?
I asked ChatGPT and it says no, but then again it's not reliable at introspection or at revealing data about how it works.
3np 2 hours ago [-]
Also a little clickbaity with "my AI" and then it's all Mistral...
hnuser123456 4 hours ago [-]
I'm having a lot of fun experimenting with stuff like this. I'm trying to put together an unrealengine blueprints style graph editor to allow people to design workflows like this where you start with the user prompt input, which goes to one agent, which makes an initial attempt, and then that conversation history gets passed to another "agent" with a different system prompt telling it to be a harsh critic, but to also give a pass/fail signal, and loop back until the critic judges pass, then send that back to the user as output. Ideally as a little website that can call your own LLM endpoints and save/load/share workflow graphs.
Mistral small 3.1 and gemma 3 feel like the first semi-competent models that can be run locally, but that competence is just a seed, and they still need to be guided with a framework that keeps them on track.
Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.
irthomasthomas 3 hours ago [-]
I think you can do most of this already with llm-consortium (maybe needs the llm-openrouter plugin with my pr merging)
A consortium sends the same prompt to multiple models in parallel and the responses are all sent to one arbiter model which judges the model responses. The arbiter decides if more iterations are required.
It can also be forced to iterate more until confidence-threshold or min-iterations.
Now, using the pr i made to llm-openrouter, you can save an alias to a model that includes lots of model options. For examples, you can do llm openrouter save -m qwen3 -o online -o temperature 0, system "research prompt" --name qwen-researcher
And now, you can build a consortium where one member is an online research specialist. You could make another uses JSON mode for entity extraction, and a third which writes a blind draft. The arbiter would then make use of all that and synthesize a good answer.
kridsdale1 2 hours ago [-]
Any links or names of example implementations of this?
also, you aren't limited to cli. When you save a consortium it creates a model. You can then interact with a consortium as if it where a normal model (albeit slower and higher quality).
You can then serve your custom models on an openai endpoint and use them with any chat client that supports custom openai endpoints.
The default behaviour is to output just the final synthesis, and this should conform to your user prompt. I recently added the ability to continue conversations with a consortium. In this case it only includes your user prompt and final synthesis in the conversation, so it mimics a normal chat, unlike running multiple iterations in the consortium, where full iteration history and arbiter responses are included.
In this example I used -n 2 on the qwen model since it's so cheap we can include multiple instances of it in a consortium
Gemini flash works well as the arbiter for most prompts. However if your prompt has complex formatting requirements, then embedding that within an already complex consortium prompt often confuses it. In that case use gemini-2.5-pro for the arbiter.
.
andai 4 hours ago [-]
I am thinking the same thing! Multiple "personalities", in parallel, or in series. For example, I have approximated, in GPT, some of Gemini's ability to call out nonsense, sloppy thinking, by telling GPT to be mean! (The politeness seems to filter out much that is of great value!)
However, the result is not pleasant to read. Gemini solved this in their training, by doing it in two phases... and making the first phase private! ("Thinking.")
So I thought, what I need is a two-phase approach, where that "mean" output gets humanized a little bit. (It gets harsh to work in that way for more than short intervals.)
As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities. I don't know if such a thing exists, but I haven't seen it yet, although the message object format seems to have been designed with it in mind (e.g. every message has a name, to allow for multiple users and multiple AIs).
Even better if it supports multiple providers, since they have different strengths. (It's like getting a second opinion.)
jbm 3 hours ago [-]
I disagree.
If anything, telling GPT to be blunt seems to downgrade its IQ; it hallucinates more and makes statements without considering priors or context. I jokingly call it Reddit mode.
dingnuts 3 hours ago [-]
why would that be a joke? there's a ton of Reddit comments in the training data, and the output is of similar quality. LLMs are literally outputting average Reddit comments.
MoonGhost 1 hours ago [-]
Reddit works hard to make comments accessible to only Google. However MS + OIA might have grabbed something before Reddit-Google contract.
inanutshellus 1 hours ago [-]
See, he's not joking, he's "joking" ...
NitpickLawyer 4 hours ago [-]
> As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities.
This is the basic idea behind autogen. They also have a web UI now in autogen studio, it's gotten a bit better. You can create "teams" of agents (with different prompts, themes, tools, etc.) and have them discuss / cooperate. I think they even added memory recently. Have a look at it, might be what you need.
theturtletalks 2 hours ago [-]
MoE, but an abstraction deeper?
globalise83 4 hours ago [-]
Have you tried n8n? It allows you to build flows like that - you can run the community version in a Docker container within a few minutes and share the configurations for the flows you have built very easily.
mecsred 4 hours ago [-]
_#_ has to be one of the worst word shortening schemes I've ever seen get widespread. It only works with a very small number of long-lived technologies, in which case they basically just get a nickname, "k8s" "i18n". It does not at all work for larger contexts. You're basically making someone solve a crossword (2 across, 10 letters with two filled in) just to parse your sentence.
jjj123 4 hours ago [-]
I just googled it and it looks like “n8n” is the name of the service. The op wasn’t abbreviating anything so I don’t think it’s the same phenomenon as what you’re describing.
lgas 4 hours ago [-]
Well, the service is doing the same thing though. The part I don't understand is that I assume n8n is short for "Nation" but literally every single person I've seen talk about it on YouTube (which is quite a lot) say "En Eight En" every time.
nemomarx 3 hours ago [-]
nation is too short for 8 - maybe navigation?
pkaye 3 hours ago [-]
Looks like n8n is short for nodemation
firesteelrain 2 hours ago [-]
Why do we do this to ourselves?
Y_Y 15 minutes ago [-]
Techno-flagellation is the only way to atone
eddieroger 3 hours ago [-]
It's just another form of any other jargon - unknown until you know it, and usually specific to the use case. I see k8s and i18n or a11y and I know exactly what they mean because at some point I learned it and it's part of the world I live in. Searching for stuff is how we learn, not solving crosswords.
wongarsu 2 hours ago [-]
I kind of get k8s and can live with i18n (at least it's a long word). But a11y just shouldn't exist. "Oh look, it looks like ally, what a cute play on words". Yeah, but for a dumb joke and 9 saved keystrokes you literally made the word accessibility less accessible. That's exactly the opposite of what accessibility is about
mecsred 2 hours ago [-]
Right, my complaint is that it only works like jargon, where you are just giving something a context-specific nickname. As a word shortening scheme, it's terrible. A world where many projects have names like s11g is a nightmare.
hnuser123456 4 hours ago [-]
I had not, but that looks awesome. Microsoft put out something called "agent flows" that also fits this category.[1] I'm working on more of an "at home" version - no "talk to sales" button.
How far is this going to go? Are we going to have a team of AI agents that runs a scrum team and meets for stand ups every couple of hours?
Are we going to replicate government bureaucracy with agents all debating topics all day long to find the best opinion?
bilekas 1 hours ago [-]
This is an interesting approach, it reminds me of YT creator actually. I'll find the YT creator, but basically he would make some script that would play the game like a race-course, with the goal being the finish line and iterate it N number of times, the script would keep iterating until it found the fastest solution.
I believe they called that machine learning.. Or re-enforced training.
I'm being slightly facetious, but my ignorant understanding of AI these days is basically the same no ?
I feel like itd be cool to try prompts based on an adversarial justice system… attorney agents arguing both sides, a judge ruling on “the law”—adherence to instructions etc
jedberg 3 hours ago [-]
We're really going to need to figure out how to power all these GPUs with green power real quick, or we're going to melt the planet having AIs debate with themselves on the optimal solution to tik-tac-toe...
nonethewiser 3 hours ago [-]
Ive felt this way when using chatgpt for a simple search. Stuff that google could handle but would just be slower, mostly from me having to manually filter.
Sometimes its the easiest way to complete a very small task but the cost difference on the backend has to be pretty damn large. The user inevitably ends up not caring whatsoever. Its just not real to them.
Xcelerate 4 hours ago [-]
I think this is how we get ML models to come up with novel ideas. Diagonalize against all the ideas they’ve already tried and dismissed via self-argument but keep certain consistency constraints. (Obviously much easier said than done.)
jwally 4 hours ago [-]
Scaled up and spread out - this probably gets you pretty close to consciousness(?)
Conway's game of life, but instead of colored squares with rules, they're LLM's with some kind of weighting - all chattering back and forth with one another - bubbling up somehow to cause speach/action
lubujackson 3 hours ago [-]
Decades ago I read The Society of Mind by Marvin Minsky. He pushed this sort of idea, that consciousness is composed of individual, competing processes. Worth a revisit!
andai 4 hours ago [-]
What you just said is what I tried and failed to say ten minutes ago!
These models have limitations obviously, but many critiques apply equally or more to people.
If people were tasked with one shot, 10 second answers, to be written out in near errorless grammar, the LLM’s viewing our responses to prompts would be spending a lot of time discussing our limitations and how to game us into better responses. Humor, not at all humor.
alexmolas 3 hours ago [-]
There are two examples in the repo, one with CoRT and another one without. And the one without it it's much better than the one that uses it. Weird choice of examples...
2cheeze4u 3 hours ago [-]
I think the names were switched up.
1 hours ago [-]
hu3 59 minutes ago [-]
Here's some related challenge I'm facing. Maybe someone can help me:
I also managed to make AI critique itself and that improved code generation a ton.
For a TypeScript backend project that runs with Bun, I tell AI to also generate and run unit tests after every code change suggested by AI.
How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?
Docker works but I like to keep things simple.
Deno supports revoking file access but I'd like to keep using Bun.
zactato 55 minutes ago [-]
Either you trust AI or you don't?
If you don't trust it then you need to review what it's writing.
Docker seems like a pretty low complexity way to create an isolated environment to run automation.
derwiki 54 minutes ago [-]
Manually approve every terminal command it wants to run instead of vibe mode. Tbh I think an rm -rf scenario is exceedingly unlikely.
joshstrange 4 hours ago [-]
I've thought about trying this cross-model as well. Have Claude generate something, have OpenAI check it, have Gemini check that check. Firing multiple of these in parallel.
There was a post here a week or so ago doing the "model checking model"-type thing with GH PRs IIRC that was interesting. I haven't had a chance to play with this idea yet.
ChadMoran 1 hours ago [-]
Fast Agent has this as a first-class citizen called "Evaluator Optimizer" pattern. Where it in a loop with a defined number of max refinements judge itself and give the output a rating, demanding it improve it's output.
Highly encourage others to check out Fast Agent. It has been delightful to use. It has interactive chat mode which I love and it's really tight and easy to implement.
I tried something similar when Llama2 came out, pitting two assistants, who each believed the other is the user, against each other. Ultimately, it was the same model talking with itself. The system prompts for both had various instructions to disagree and criticise the opinion of the user. I provided the first message to get things started. Usually, it’s be along the lines of “nuclear proliferation is harmful to humanity”.
After 15 or so iterations, both assistants would keep repeating the same things and find agreement anyway. Sometimes, the chat became unhinged and useless, but 95/100 times, it was agreement.
Happy someone else made it work.
generalizations 2 hours ago [-]
I always assumed you'd have to use different models. Even if only one of them is large, the others would inject enough difference of opinion to keep it useful.
badmonster 1 hours ago [-]
Have you experimented with weighting the self-evaluations based on specific criteria (e.g., correctness, clarity, creativity), or using external validators to guide the AI’s final choice? Curious how much tuning the evaluation step impacts overall performance.
cwillu 1 hours ago [-]
Any api that lets you constrain output to a formal syntax should let you do away with the “first output a number, and only then explain yourself” boilerplate.
K0balt 4 hours ago [-]
I’ll second this. I often use a “research assistant “ and skeptical“department head” personas working together/against each other as a research team. It works well and is occasionally hilarious, replete with the occasional HR complaint when things go off the rails. ( I typically use local uncensored models)
jbellis 31 minutes ago [-]
does it actually make a difference to do M rounds of N vs one round of M*N?
mritchie712 3 hours ago [-]
Did something similar (OverkiLLM) to this waayyyy back in August with open LLMs. I'm sure it'd work much better now:
So glad to see a write up on this finally. I'm no machine learning phd but I always wondered why this wasn't more of a thing. Like an extension of a GAN conceptually, sort of, not really at all Im sure.
Also I think I kind of assumed OpenAI might be doing this behind the curtain?
throwawayForMe2 2 hours ago [-]
I wonder if the Scholastic method of the Schoolmen would be useful with its argument and counter argument style.
daxfohl 3 hours ago [-]
Maybe have a "reconcile" option, for it to see if it can mix and match the best parts of each alternative rather than just choosing one.
grzracz 3 hours ago [-]
Your readme demo images are wrong: the terminal one is the non-CoRT one and the GUI one is the one with CoRT. Confused me for a while
thunderbong 2 hours ago [-]
A lot of the comments here are reminiscent of the early Google days when everyone was finding ways to search better!
noworriesnate 3 hours ago [-]
I’ve had success telling the model it really needs to poop and if it gets to the point quickly it’ll be able to leave the meeting and go do that. It actually works amazingly well.
It’s also a lot more ethical than verbal abuse, which some people say improves the results as well.
Programming isn’t what it used to be.
tinix 2 hours ago [-]
this works for getting out of traffic tickets too lol
yieldcrv 1 hours ago [-]
Reminds me of baby agi from 2 years ago
but I guess that was before chain of thought models
53 minutes ago [-]
Garlef 4 hours ago [-]
Similarly, letting the LLM generate a socratic dialogue can work pretty well to get deeper into a topic.
celltalk 1 hours ago [-]
One of my doctoral propositions is,
dialog leads to true artificial intelligence.
irthomasthomas 3 hours ago [-]
my favourite pattern rn:
llm "write a savage, yet grounded roast of: $content"
llm -c "Write an equally savage rebuttal"
llm -c "first arbitrate and then synthesize a final review."
csours 3 hours ago [-]
Yes, give the computers anxiety too!
j45 1 hours ago [-]
There appear to be no shortage of token saving attempts that can end up using more tokens, whether it's a monthly paid plan or API.
Having an approach to recognize what is needed from the AI software, and anticipate how it may default to respond based on it's programming is critical.
mparnisari 3 hours ago [-]
So like rubber ducking for AI?
z2 2 hours ago [-]
I would really like to see a fusion guidebook of mental tricks that work for humans and just as well for AI. Or humorously, perhaps prompt-engineering tricks that are also great mental hacks for better or clearer human thinking.
1970-01-01 2 hours ago [-]
"While hallucinating a duck, check my script for errors."
Der_Einzige 4 hours ago [-]
Debate as a reasoning tactic is massively undervalued. There's tons of papers on this at places like NeurIPS, ICML, ICLR, etc.
I got to meet and talk to the authors of this paper at NeurIPS. They're class acts!
lenerdenator 3 hours ago [-]
I, too, like to give Terminator lite anxiety.
getcrunk 3 hours ago [-]
Hello cnn’s
k2xl 3 hours ago [-]
I've done something similar for learning about a controversial topic. I ask it to act as if it is called Bob is a well informed supporter of one side (like Ukraine) and then act as if it is something named Alice who is a well informed supporter of another side (Russia) and they have to debate each other over a few prompts with a moderator named 'Sue'
Then after a few rounds of the debate where Sue asks a bunch of questions, I ask it to go to the judges - Mark, Phil, Sarah (and I add a few personalities to each of them... Sometimes I pretend they are famous moral philosophers) and then I have them each come up with a rubric and decide who is the winner.
Really fun, and helps me understand different sides of issues.
rat87 48 minutes ago [-]
That seems like a terrible idea. At best it seems likely to help you make a false but convincing sounding case.
I really hope no one is using that to help them understand controversial topics much less using that to determine their stances.
Id recommend looking into actual human experts who are trustworthy and reading them. Trying to get LLM to argue the case will just get you a lot of false information presented in a more convincing fashion
firgrove 4 hours ago [-]
this is amazing - I love seeing novel approaches to optimizing
2 hours ago [-]
antisthenes 4 hours ago [-]
Cool. Now I can justify talking to myself.
casenmgreen 4 hours ago [-]
[flagged]
hackinthebochs 4 hours ago [-]
Don't people get tired of having this same "debate" on every post about LLMs? And I scare quote debate because the naysayers never support their strong claims beyond the most superficial of responses. It's all just so tiring at this point.
casenmgreen 1 hours ago [-]
The most superficial response is adequate, where the claim is so improper.
LLM/AI are extremely useful. I am in no way disputing this.
stevenAthompson 4 hours ago [-]
Can you define "thinking" in a way that excludes what the AI is doing, but includes what humans do?
I haven't' really seen anyone else manage it without talking about ghosts or some other kind of metaphysical voodoo.
dttze 53 minutes ago [-]
Using a conceptual understanding of something to deduce or infer something else.
An LLM doesn't know what anything is. Just what goes around the token representation of that thing.
stevenAthompson 3 minutes ago [-]
"conceptual understanding of something" is just another way of saying "the relationship between concepts", which is exactly what transformer models use.
*EDIT* To elaborate, how can you define anything in isolation of every other concept/thing? You can't. Things are only defined by their relationships to each other, which is exactly the same thing transformer models do.
Cantinflas 4 hours ago [-]
Why is it so hard to believe that a complex neural network can think? You literally have one over your shoulders that does exactly that.
consumer451 4 hours ago [-]
I am not sure that you can make that absolute statement. Reasoning is subdivided into types, and one of those types is inductive reasoning.
> Inductive reasoning refers to a variety of methods of reasoning in which the conclusion of an argument is supported not with deductive certainty, but with some degree of probability. Unlike deductive reasoning (such as mathematical induction), where the conclusion is certain, given the premises are correct, inductive reasoning produces conclusions that are at best probable, given the evidence provided.
Doesn't predicting the next token qualify as doing just that?
Markov chains have done that for ages. They aren't AI. This is just that scaled up.
Just because it can infer a token doesn't mean it can infer a conclusion to an argument.
casenmgreen 58 minutes ago [-]
To add a bit to this : expert systems have two properties. They give an answer, and they explain their reasoning.
LLM cannot explain their reasoning, and that is because there is no reasoning.
verytrivial 3 hours ago [-]
Sorry to join the pile-on, but can I just ask: In what way does a brain think that an Ai does not? And does the distinction apply from human brains down to fruit flies? Is it a property of embodiment? (I have suspected for years that consciousness isn't just emergent but specifically that it is NOTHING besides that. It's all about scale and large models are just starting climb the ladder. The ladder does not necessarily go up the same way as embodied thought though.)
throwaway150 4 hours ago [-]
How can you be so sure? How do you know that our brains don't work like transformers too, except for having the advantage of having more types of sensory data? How can you settle this debate without defining what "thinking" and "reasoning" is and how what LLMs do is not similar to what a kindergarten level kid may be capable of? I think we all agree kindergarten kids can think and reason, don't we?
senko 3 hours ago [-]
Agree 100%.
We should also not be calling a pointing device "a mouse" because it's not a small rodent, there aren't any actual windows inside a computer, and I haven't seen anyone balancing their laptop trying to surf the web.
Also smartphones are not actually smart and are only barely phones.
casenmgreen 1 hours ago [-]
I'm finding laymen are thinking AI is reasoning, because the term makes it look like this is what it is.
The potential confusion of terms such as mouse/windows/surfing is not the same as calling LLM AI, and then going on to say it is "thinking" and "reasoning".
rapfaria 4 hours ago [-]
Or "thinking" just got a new meaning and it's to convey information in the field - perhaps the Oxford dictionary will add it soon?
jasonthorsness 4 hours ago [-]
What words would you use instead?
Philpax 4 hours ago [-]
You say these things with such certainty. How can you be so sure?
pfdietz 3 hours ago [-]
You are the critic. Construct three rebuttals to your claim.
dqewijodjqweido 50 minutes ago [-]
[flagged]
m3kw9 3 hours ago [-]
Isn’t this best of n?
3 hours ago [-]
DyslexicAtheist 1 hours ago [-]
> "I made my AI think" ...
utterly moronic.
They don't “think” ... not even in the most autistic sense of the word.
They can generate solutions by combining existing knowledge in unique ways. But they don't “think”.
- Have an AI chat model come up with an answer to a problem.
- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.
- Have a second AI model with no knowledge of the problem grade the report, and write it's own report either (a) asking for clarification / more information about the problem that the original model didn't provide or (b) pointing out an inconsistency in the argument posed by the original model. Give this report back to the original model and ask it to write it's own report back with either the necessary information or changes.
- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.
It's super clunky but has given pretty good results in the cases where I tried it lol
It seemed like a pretty good idea, though I'd guess that it would greatly increase token usage. I'd also be concerned that the LLM as a judge might struggle to grade things accurately if it wasn't also able to generate good enough answers to begin with.
The final plan you obtain is generally a lot more well rounded and thought out.
I find that amusing because the technique also works when I apply it to me. Picking flaws in your plan before revisiting it actually works.
I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.
Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.
It definitely makes me feel like I'm dealing with an overenthusiastic intern who is throwing stuff over the wall without checking their work, and like maybe having a second bot sitting in front of the first one being like ARE YOUR SURE ABOUT THAT could really improve things.
Even Amazon’s own offering completely made things up about Amazon’s own formats.
I’d be curious as to why that is. It seems like there would be enough training data, and for Amazon in particular it seems like they could make a validation tool the model could use.
Except... the model is missing that final step; instead it just belches out its hypothesis, all dressed up in chirpy, confident-sounding language, certain that I'm moments away from having everything working just perfectly.
I modeled it after the concept of advisors from Civilization II. It worked reasonably well though I think it was at least somewhat limited by being constrained to a single LLM (Mistral). It also lit my computer on fire.
That is something I'm also curious about. Given models (that use the same tokenisation) that are better at different things, would their be interesting things to find by analysing the logprobs for tokens generated from identical inputs (including cross feeding the generated token from one to another)
Surely there must be something notable at particular points when a model goes off on the wrong path.
Would be interesting what it comes up with with enough time and tokens.
One strategy I often use (which is much simpler and more limited than this), is to finish my message with: “Please do a round of thinking in <thinking></thinking> tags, then a round of self-critique in <critique></critique> tags, and then a final round of <thinking>, before responding.”
It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).
1. You are the assistant. Please answer the question directly.
2. You are the cross-examiner. The assistant is wrong. Explain why.
3. You are the assistant. The cross-examiner is wrong. Defend your claim.
4. You are a judge. Did either party make their case, or is another round of argumentation required?
I haven't tried this. No idea if it works. But I find it's helpful to ask ChatGPT, in separate prompts, "XYZ is true, explain why" and "XYZ is false, explain why" and see which one seems more convincing.
https://github.com/evalstate/fast-agent
I have no idea why anyone thinks this is novel. I guess that speaks to the state of HN
It seems like a good approach though. What you dont want to do is ever suggest that its wrong yourself. Usually it will just assume it is wrong.
Actually what I find impressive is when I do this and it actually pushes back to defend itself.
I asked ChatGPT and it says no, but then again it's not reliable at introspection or at revealing data about how it works.
Mistral small 3.1 and gemma 3 feel like the first semi-competent models that can be run locally, but that competence is just a seed, and they still need to be guided with a framework that keeps them on track.
Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.
A consortium sends the same prompt to multiple models in parallel and the responses are all sent to one arbiter model which judges the model responses. The arbiter decides if more iterations are required. It can also be forced to iterate more until confidence-threshold or min-iterations.
Now, using the pr i made to llm-openrouter, you can save an alias to a model that includes lots of model options. For examples, you can do llm openrouter save -m qwen3 -o online -o temperature 0, system "research prompt" --name qwen-researcher
And now, you can build a consortium where one member is an online research specialist. You could make another uses JSON mode for entity extraction, and a third which writes a blind draft. The arbiter would then make use of all that and synthesize a good answer.
also, you aren't limited to cli. When you save a consortium it creates a model. You can then interact with a consortium as if it where a normal model (albeit slower and higher quality). You can then serve your custom models on an openai endpoint and use them with any chat client that supports custom openai endpoints.
The default behaviour is to output just the final synthesis, and this should conform to your user prompt. I recently added the ability to continue conversations with a consortium. In this case it only includes your user prompt and final synthesis in the conversation, so it mimics a normal chat, unlike running multiple iterations in the consortium, where full iteration history and arbiter responses are included.
UV tool install llm
llm install llm-consortium
llm install llm-model-gateway
llm consortium save qwen-gem-sonnet -m qwen3-32b -n 2 -m sonnet-3.7 -m gemini-2.5-pro --arbiter gemini-2.5-flash --confidence-threshold 95 --max-iterations 3
llm serve qwen-gem-sonnet
In this example I used -n 2 on the qwen model since it's so cheap we can include multiple instances of it in a consortium
Gemini flash works well as the arbiter for most prompts. However if your prompt has complex formatting requirements, then embedding that within an already complex consortium prompt often confuses it. In that case use gemini-2.5-pro for the arbiter. .
However, the result is not pleasant to read. Gemini solved this in their training, by doing it in two phases... and making the first phase private! ("Thinking.")
So I thought, what I need is a two-phase approach, where that "mean" output gets humanized a little bit. (It gets harsh to work in that way for more than short intervals.)
As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities. I don't know if such a thing exists, but I haven't seen it yet, although the message object format seems to have been designed with it in mind (e.g. every message has a name, to allow for multiple users and multiple AIs).
Even better if it supports multiple providers, since they have different strengths. (It's like getting a second opinion.)
If anything, telling GPT to be blunt seems to downgrade its IQ; it hallucinates more and makes statements without considering priors or context. I jokingly call it Reddit mode.
This is the basic idea behind autogen. They also have a web UI now in autogen studio, it's gotten a bit better. You can create "teams" of agents (with different prompts, themes, tools, etc.) and have them discuss / cooperate. I think they even added memory recently. Have a look at it, might be what you need.
https://www.microsoft.com/en-us/microsoft-copilot/blog/copil...
Are we going to replicate government bureaucracy with agents all debating topics all day long to find the best opinion?
I believe they called that machine learning.. Or re-enforced training.
I'm being slightly facetious, but my ignorant understanding of AI these days is basically the same no ?
https://www.youtube.com/watch?v=SX08NT55YhA
Sometimes its the easiest way to complete a very small task but the cost difference on the backend has to be pretty damn large. The user inevitably ends up not caring whatsoever. Its just not real to them.
Conway's game of life, but instead of colored squares with rules, they're LLM's with some kind of weighting - all chattering back and forth with one another - bubbling up somehow to cause speach/action
https://news.ycombinator.com/item?id=43835798
These models have limitations obviously, but many critiques apply equally or more to people.
If people were tasked with one shot, 10 second answers, to be written out in near errorless grammar, the LLM’s viewing our responses to prompts would be spending a lot of time discussing our limitations and how to game us into better responses. Humor, not at all humor.
I also managed to make AI critique itself and that improved code generation a ton.
For a TypeScript backend project that runs with Bun, I tell AI to also generate and run unit tests after every code change suggested by AI.
How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?
Docker works but I like to keep things simple.
Deno supports revoking file access but I'd like to keep using Bun.
Docker seems like a pretty low complexity way to create an isolated environment to run automation.
There was a post here a week or so ago doing the "model checking model"-type thing with GH PRs IIRC that was interesting. I haven't had a chance to play with this idea yet.
Highly encourage others to check out Fast Agent. It has been delightful to use. It has interactive chat mode which I love and it's really tight and easy to implement.
https://github.com/evalstate/fast-agent
The whole point of reasoning models is to automatically use COT and related techniques to bring out more capabilities.
It would be interesting to see if this is doing anything that’s not already being exploited.
https://lepisma.xyz/2024/10/19/interventional-debates-for-st...
I believe there are researches on this too.
After 15 or so iterations, both assistants would keep repeating the same things and find agreement anyway. Sometimes, the chat became unhinged and useless, but 95/100 times, it was agreement.
Happy someone else made it work.
https://www.definite.app/blog/overkillm
Also I think I kind of assumed OpenAI might be doing this behind the curtain?
It’s also a lot more ethical than verbal abuse, which some people say improves the results as well.
Programming isn’t what it used to be.
but I guess that was before chain of thought models
Having an approach to recognize what is needed from the AI software, and anticipate how it may default to respond based on it's programming is critical.
Hell, even a whole quanta article. https://www.quantamagazine.org/debate-may-help-ai-models-con...
I got to meet and talk to the authors of this paper at NeurIPS. They're class acts!
Then after a few rounds of the debate where Sue asks a bunch of questions, I ask it to go to the judges - Mark, Phil, Sarah (and I add a few personalities to each of them... Sometimes I pretend they are famous moral philosophers) and then I have them each come up with a rubric and decide who is the winner.
Really fun, and helps me understand different sides of issues.
Id recommend looking into actual human experts who are trustworthy and reading them. Trying to get LLM to argue the case will just get you a lot of false information presented in a more convincing fashion
LLM/AI are extremely useful. I am in no way disputing this.
I haven't' really seen anyone else manage it without talking about ghosts or some other kind of metaphysical voodoo.
An LLM doesn't know what anything is. Just what goes around the token representation of that thing.
*EDIT* To elaborate, how can you define anything in isolation of every other concept/thing? You can't. Things are only defined by their relationships to each other, which is exactly the same thing transformer models do.
> Inductive reasoning refers to a variety of methods of reasoning in which the conclusion of an argument is supported not with deductive certainty, but with some degree of probability. Unlike deductive reasoning (such as mathematical induction), where the conclusion is certain, given the premises are correct, inductive reasoning produces conclusions that are at best probable, given the evidence provided.
Doesn't predicting the next token qualify as doing just that?
https://en.wikipedia.org/wiki/Inductive_reasoning
Just because it can infer a token doesn't mean it can infer a conclusion to an argument.
LLM cannot explain their reasoning, and that is because there is no reasoning.
We should also not be calling a pointing device "a mouse" because it's not a small rodent, there aren't any actual windows inside a computer, and I haven't seen anyone balancing their laptop trying to surf the web.
Also smartphones are not actually smart and are only barely phones.
The potential confusion of terms such as mouse/windows/surfing is not the same as calling LLM AI, and then going on to say it is "thinking" and "reasoning".
utterly moronic.
They don't “think” ... not even in the most autistic sense of the word.
They can generate solutions by combining existing knowledge in unique ways. But they don't “think”.