▲Language Models Pack Billions of Concepts into 12k Dimensionsnickyoder.com

181 points by lawrenceyan 6 hours ago | 58 comments

yorwba 4 hours ago [-]

I think the author is too focused on the case where all vectors are orthogonal and as a consequence overestimates the amount of error that would be acceptable in practice. The challenge isn't keeping orthogonal vectors almost orthogonal, but keeping the distance ordering between vectors that are far from orthogonal. Even much smaller values of epsilon can give you trouble there.

So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.

sigmoid10 3 hours ago [-]

Since vectors are usually normalized to the surface of an n-sphere and the relevant distance for outputs (via loss functions) is cosine similarity, "near orthogonality" is what matters in practice. This means during training, you want to move unrelated representations on the sphere such that they become "more orthogonal" in the outputs. This works especially well since you are stuck with limited precision floating point numbers on any realistic hardware anyways.

Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.

motorest 4 minutes ago [-]

> Since vectors are usually normalized to the surface of an n-sphere (...)

In classification tasks, each feature is normalized independently. Otherwise you would have an entry with feature Foo and Bar which depending on the value of Bar it would be made out to be less Foo when normalized.

This vectors are not normalized in n-spheres, and their codomain ends up being an hypercube.

bjornsing 2 hours ago [-]

I agree the OPs argument is a bad one. But I’m still optimistic about the representational capacity of those 20k dimensions.

wer232essf 54 minutes ago [-]

[dead]

gpjanik 2 hours ago [-]

Language models don't "pack concepts" into the C dimension of one layer (I guess that's where the 12k number came from), neither do they have to be orthogonal to be viewed as distinct or separate. LLMs generally aren't trained to make distinct concepts far apart in the vector space either. The whole point of dense representations, is that there's no clear separation between which concept lives where. People train sparse autoencoders to work out which neurons fire based on the topics involved. Neuronpedia demonstrates it very nicely: https://www.neuronpedia.org/.

prmph 22 minutes ago [-]

Agreed, if you relax the requirement for perfect orthogonality, then, yes, you can pack in much more info. You basically introduced additional (fractional) dimensions clustered with the main dimensions. Put another way, many concepts are not orthogonal, but have some commonality or correlation.

So nothing earth shattering here. The article is also filled with words like "remarkable", "fascinating", "profound", etc. that make me feel like some level of subliminal manipulation is going on. Maybe some use of an LLM?

rob_c 1 minutes ago [-]

Ok.

Now try to separate the "learning the language" from "learning the data".

If we have a model pre trained on language does it then learn concepts quicker, the same or different?

Can we compress just data in a lossy into an LLM like kernel which regenerates the input to a given level of fidelity?

rossant 3 hours ago [-]

Tangential, but the ChatGPT vibe of most of the article is very distracting and annoying. And I say this as someone who consistently uses AI to refine my English. However, I try to avoid letting it reformulate too dramatically, asking it specifically to only fix grammar and non-idiomatic parts while keeping the tone and formulation as much as possible.

Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.

airstrike 3 hours ago [-]

Ironically, the use of "fascinating", "crucial" and "delving" in your second paragraph, as well as its overall structure, make it read very much like it was filtered through ChatGPT

OtherShrezzing 2 hours ago [-]

I think that was satire

rossant 17 minutes ago [-]

Correct

danlitt 2 hours ago [-]

You have to hope so.

2 hours ago [-]

GolDDranks 3 hours ago [-]

Which parts felt GPT'y to you? The list-happy style?

rebolek 2 hours ago [-]

For me, the GPT feeling started with "tangential" and ended with "effectively".

dwohnitmok 4 hours ago [-]

These set of intuitions and the Johnson-Lindenstrauss lemma in particular are what power a lot of the research effort behind SAEs (Sparse Autoencoders) in the field of mechanistic interpretability in AI safety.

A lot of the ideas are explored in more detail in Anthropic's 2022 paper that's one of the foundational papers in SAE research: https://transformer-circuits.pub/2022/toy_model/index.html

emil-lp 4 hours ago [-]

Where can I read the actual paper? Where is it published?

yorwba 4 hours ago [-]

That is the actual paper, it's published on transformer-circuits.pub.

emil-lp 4 hours ago [-]

It's not peer-reviewed?

mutkach 2 hours ago [-]

How would you peer-review something like that? Or rather, how would you even _reproduce_ any of that? Some inception-based model with some synthetic data? Where is the data? I'm sure the paper was written in good faith, but all it leads to is just more crackpottery. If I had a penny for each time I've heard that LLMs are quantum in nature "because softmax is essentially a wave function collapse" and "because superposition," then I would have more than one penny!

yorwba 3 hours ago [-]

Google Scholar claims 380 citations, which is, I think, a respectable number of peers to have reviewed it.

emil-lp 3 hours ago [-]

That's not at all how peer review works.

yorwba 1 hours ago [-]

It's not how pre-publication peer review works. There, the problem is that many papers aren't worth reading, but to determine whether it's worth reading or not, someone has to read it and find out. So the work of reading papers of unknown quality is farmed out over a large number of people each reading a small number of randomly-assigned papers.

If somebody's paper does not get assigned as mandatory reading for random reviewers, but people read it anyway and cite it in their own work, they're doing a form of post-publication peer review. What additional information do you think pre-publication peer review would give you?

adroniser 21 minutes ago [-]

peer review would encourage less hand wavy language and more precise claims. They would penalize the authors for bringing up bizarre analogies to physics concepts for seemingly no reason. They would criticize the fact that they spend the whole post talking about features without a concrete definition of a feature.

The sloppiness of the circuits thread blog posts has been very damaging to the health of the field, in my opinion. People first learn about mech interp from these blog posts, and then they adopt a similarly sloppy style in discussion.

Frankly, the whole field currently is just a big circle jerk, and it's hard not to think these blog posts are responsible for that.

I mean do you actually think this kind of slop would be publishable in NeurIPS if they submitted the blog post as it is?

golem14 3 hours ago [-]

Unless it’s part of a link review farm. I haven’t looked, and you are probably correct; but I would do a bit of research before making any assumptions

aabhay 4 hours ago [-]

My intuition of this problem is much simpler — assuming there’s some rough hierarchy of concepts, you can guesstimate how many concepts can exist in a 12,000-d space by taking the combinatorial of the number of dimensions. In that world, each concept is mutually orthgonal with every other concept in at least some dimension. While that doesn’t mean their cosine distance is large, it does mean you’re guaranteed a function that can linearly separate the two concepts.

It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy

OgsyedIE 4 hours ago [-]

You can only get 12,000! concepts if you pair each concept with an ordering of the dimensions, which models do not do. A vector in a model that has [weight_1, weight_2, ... weight_12000] is identical to the vector [weight_2, weight_1, ..., weight_12000] within the larger model.

Instead, a naive mental model of a language model is to have a positive, negative or zero trit in each axis: 3^12,000 concepts, which is a much lower number than 12000!. Then in practice, almost every vector in the model has all but a few dozen identified axes zeroed because of the limitations of training time.

aabhay 2 hours ago [-]

You’re right. I gave the wrong number. My model implies 2^12000 concepts, because you choose whether or not to include each concept to form your dimensional subspace.

I’m not even referring to the values within that subspace yet, and so once you pick a concept you still get the N degrees of freedom to create a complex manifold.

The main value of the mental model is to build an intuition for how “sparse” high dimensional vectors are without resorting to a 3D sphere.

bjornsing 2 hours ago [-]

> While that doesn’t mean their cosine distance is large

There’s a lot of devil in this detail.

Morizero 4 hours ago [-]

That number is far, far, far greater than the number of atoms in the universe (~10^43741 >>>>>>>> ~10^80).

bmacho 1 hours ago [-]

Say, there are 10^80 atoms, then there are like 2^(10^80) possible things, and 2^(2^(10^80)) grouping/categorization/ordering on the things, and so on, you can go higher, and the number of possibilities go up really fast.

cleansy 3 hours ago [-]

Not surprising since concepts are virtual. There is a person, a person with a partner is a couple. A couple with a kid is a family. That’s 5 concepts alone.

am17an 3 hours ago [-]

Somehow that's still an understatement

jibal 37 minutes ago [-]

There are no "real-world concepts" or "semantic meaning" in LLMs, there are only syntactic relationships among text tokens.

singularity2001 3 hours ago [-]

If you ever played 20Questions you know that you don't need 1000 dimensions for a billion concepts. These huge vectors can represent way more complex information than just a billion concepts.

In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.

giveita 2 hours ago [-]

That's not the vector doing that though it is the model. The model is like a trillion dimensional vector.

DougBTX 2 hours ago [-]

With binary vectors, 20 dimensions will get you just over a million concepts. For a billion you’ll need 30 questions.

3 hours ago [-]

bigdict 5 hours ago [-]

What's the point of the relu in the loss function? Its inputs are nonnegative anyway.

Nevermark 5 hours ago [-]

Let's try to keep things positive.

andy_ppp 2 hours ago [-]

In reality it’s probably not a RELU modern LLMs use GeLU or something more advanced.

meindnoch 2 hours ago [-]

Sometimes a cosmic ray might hit the sign bit of the register and flip it to a negative value. So it is useful to pass it through a rectifier to ensure it's never negative, even in this rare case.

GolDDranks 4 hours ago [-]

I wondered the same. Seems like it would just make a V-shaped loss around the zero, but abs has that property already!

fancyfredbot 3 hours ago [-]

RELU would have made it flat below zero ( _/ not \/). Adding the abs first just makes RELU do nothing.

fancyfredbot 3 hours ago [-]

I thought the belt and braces approach was a valuable contribution to AI safety. Better safe than sorry with these troublesome negative numbers!

naniwaduni 2 hours ago [-]

Well, I guess it's helping to distinguish authors who are doing arithmetic they understand from ones who are copying received incantations around...

rini17 3 hours ago [-]

I became bit lost between "C is a constant that determines the probability of success" and then they set C between 4 and 8. Probability should be between 0 and 1, how it relates to C?

emil-lp 2 hours ago [-]

It's the epsilon^-2 term that actually talks about success, but that is tightly linked with the C term. If you want to decrease epsilon, C goes up.

niemandhier 4 hours ago [-]

Wow, I think I might just have grasped one of the sources of the problems we keep seeing with LLMs.

Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.

It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).

The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.

It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.

I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:

=== AI in use === If you want to resolve an attractor down to a spatial scale rho, you need about n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).

The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension

k ≳ (d_B / ε^2) * log(C / rho).

So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale

rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),

below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .

=== AI end ===

Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.

If someone is bored and would like to discuss this, feel free to email me.

sdl 3 hours ago [-]

So basically the map projection problem [1] in higher dimensions?

[1] https://en.m.wikipedia.org/wiki/Map_projection

niemandhier 1 hours ago [-]

Worse. Map projection means that you cannot have a mapping that preserves elements of the internal geometry: angles and such.

Violation of topology means that a surface wrongly is mapped to one intersecting itself: Think Klein Bottle.

https://en.wikipedia.org/wiki/Klein_bottle

1 hours ago [-]

fedeb95 2 hours ago [-]

*string representations of concepts

5 hours ago [-]

js8 5 hours ago [-]

You can also imagine a similar thing on binary vectors. There two vectors are "orthogonal" if they share no bits that are set to one. So you can encode huge number of concepts using only small number of bits in modestly sized vectors, and most of them will be orthogonal.

phreeza 4 hours ago [-]

If they are only orthogonal if they share no bits that are set to one, only one vector, the complement, will be orthogonal, no?

Edit: this is wrong as respondents point out. Clearly I shouldn't be commenting before having my first coffee.

yznovyak 4 hours ago [-]

I don't think so. For n=3 you can have 000, 001, 010, 100. All 4 (n+1) are pairwise orthogonal. However, I don't think js8 is correct as it looks like in 2^n you can't have more than n+1 mutually orthogonal vectors, as if any vector has 1 in some place, no other vector can have 1 in the same place.

js8 4 hours ago [-]

It's not correct to call them orthogonal because I don't think the definition is a dot product. But that aside, yes, orthogonal basis can only have as much elements as dimensions. The article also mentions that, and then introduces "quasi-orthogonality", which means dot product is not zero but very small. On bitstrings, it would correspond to overlap on only small number of bits. I should have been clearer in my offhand remark. :-)

prerok 4 hours ago [-]

Hmm, I think one correction: is (0,0,0) actually a vector? I think that, by definition, an n-dimentional space can have at most n vectors which are all orthogonal to one another.

asplake 4 hours ago [-]

By the original definition, they can share bits that are set to zero and still be orthogonal. Think of the bits as basis vectors – if they have none in common, they are orthogonal.

js8 4 hours ago [-]

For example, 1010 and 0101 are orthogonal, but 1010 and 0011 are not (share the 3rd bit). Though calling them orthogonal is not quite right.

henearkr 4 hours ago [-]

Your definition of orthogonal is incorrect, in this case.

In the case of binary vectors, don't forget you are working with the finite field of two elements {0, 1}, and use XOR.

Loading comments...