Humanist Discussion Group, Vol. 39, No. 82.
Department of Digital Humanities, University of Cologne
Hosted by DH-Cologne
www.dhhumanist.org
Submit to: humanist@dhhumanist.org
Date: 2025-07-16 09:58:29+00:00
From: Tim Smithers <tim.smithers@cantab.net>
Subject: Re: [Humanist] 39.49: repetition vs intelligence
First. Happy Birthday Willard! May you have many more!
Second. I'm taking the liberty to wind the DH List back to
Humanist Discussion Group, Vol. 39, No. 49, 12 June, 2025,
at 08:07.
Dear Gabriel,
I see you're still wedded to, or could that be welded to, your
claim that LLMs represent and process the meanings and
semantic relationships we [humans] use and make in our
word-based languaging. You are not alone, of course. This
same fairy story is told by many working on, or commenting on,
or trying to explain LLMs and their associated computational
techniques. It's as if we're not allowed to work on, or
comment upon, LLMs unless we first swear unending belief in
this fantasy.
This fairy story is, in my view, mistaken and misleading. It
is like an Emperor who has no clothes on. Good scholarship
should dismiss such fantasies, not propagate them. The truth
of the matter is not, I think, up for debate, it is to be
understood and carefully shown to others, so that mistakes and
misunderstanding may be identified and corrected, and our
understanding of what is really going on, why, and how,
strengthened.
So, I am not going to respond point by point to your last
post. I'm going to do something different. I'm going to show
in some detail what is implemented by an LLM. Not all of it,
just the initial front-end stages. And, I'm going to ask some
questions of you and any other fairy story believes as I go.
I do this so that you, and anybody else here, may then point
out where my mistakes and misunderstandings are, and how to
correct these. I am confident of my understanding, but this
does not mean it is therefore all well formed and correct.
There are plenty of other people here who know, understand,
and use systems with LLMs inside them, and probably some who
build these kinds of systems, or parts of them. I would ask
these people to join in here, to say where my understanding is
not correct, or not sufficient, and to say where it is
correct, in their judgement. And I would ask others here not
so technically aware of the insides of LLM implementations to
tell us where you need clarification and further explanations,
so that we might arrive at a good explanation of what really
goes on inside an LLM, and thus clear away the dominant fairy
story and it's associated fantasies.
Here goes. It's long. [After this I'll take lessons from
anybody who offers them on how to write short Humanist posts.]
Preface
In what follows I am guided by the following quotations.
"Ostension, showing things, plays an important role in
discussions of representation and meaning."
-- David Zeitlyn [1]
"An explanation is a statement of what is there in reality
and how it works and why ..."
"A good explanation is one that is hard to vary while
still explaining what it purports to explain. ..."
-- David Deutsch [2]
"Writing is the solid form of language, the precipitate."
-- Robert Bringhurst [3]
[Numbered notes are at the bottom.]
1 What are we really talking about?
Good ostension is achieved, I assert, by first being clear
about what it is we want to show things about.
Human languaging is first and foremost sound based, it's
spoken, or it's sign based and thus signed. In the case of
[most?] spoken languages, words are formed by combining
standard and shared sounds in specific ways: we put together
the phonemes of the language(s) we speak to say words, and we
put words together in linear sequences to form meaningful
phrases and sentences, using the grammar(s) of the
language(s) we speak in. Right?
Well, no, not really. Spoken language is often not
grammatical, not strictly, but nor is it ungrammatical. In
successful spoken languaging there is enough grammatical
structure for meaning to be built from what is heard. And,
enough recognisable sounds for a listener to build words,
then phrases and sentences, from what the speaker is saying,
if they understand the language(s) being used well enough.
And, there's usually plenty of hand, arm, and body motions,
facial expressions, and other sound making, which are all a
natural part of the speaking and listening in our languaging.
Real human languaging is full theatre. We should keep this
in mind.
Writing is another kind of human languaging, a different
kind. Writing is an approximation of the full theatre of
spoken languaging. Writing necessarily simplifies out much
of this theatre, but it leaves marks of what is said and how
it is said: Bringhurst's precipitate. And these marks can be
read so that what was said may be heard again [perhaps only
silently in our heads], and from this hearing we may build
understandings of what was written.
This simplification in written languaging is mostly achieved
using what is sometimes called the Alphabetic Principle: the
phonemes of a spoken language are each associated with a
particular symbol. We call these special symbols the letters
of the alphabets of the languages we write in. These
letters, graphemes is the posh name, form the alphabets we
use to write with. They are symbols for the standard shared
units of sound we use to make spoken words with. But letters
are only one of the functional units we need to write with.
Usable writing systems have more kinds of symbols.
Alphabets are not the only symbols we need and use to write
what we want to say in the way we want to say it, and to make
our writing readable. We use other [mostly] standard symbols
to add punctuation to the sequences of words we write using
letters. "[P]unctuation tells the reader how to hear your
writing" is how Ursula Le Guin nicely explains what
punctuation is for [6]. And, in our writing, we also use
numerals, symbols for numbers, sometimes from different
numeral systems; Roman and Arabic, for example. We use the
symbols of mathematics, when we write about things
mathematical. And we use a variety of special symbols for a
variety of other things we talk about in our writing, more
and more, it seems; for currencies; trade marks; emojis; and
more. Our writing systems are thus systems made of the
different symbol sets we need to write what we want to talk
about, in the way we want to say it.
In digital form we use a binary character system to denote
all these different kinds of writing system symbols. This
character system needs to cover all the different symbols we
use in our writing, and this can be a lot, especially if we
write in different languages, as we often do.
All this is here to make a distinction we will need in what
follows: the distinction between the different symbol systems
we combine in our writing systems, and the characters we use
to digitally denote all these different kinds of symbols.
Text is all the marks left when we [humans] write, the
precipitate, it's not just combinations of letters we read as
words. Text is all the marks of all the symbols from the
different symbols systems we need to write with. The
alphabetic letter constructions in text are not the only
things readers need to build an understand of what is being
said. All the other symbols are needed. As is all type
setting and typographical design needed to make any text
readable. That's why they are there, and need to be there.
Text is not just made up of what look like words to us as
readers.
Writing in different languages may use the same characters,
but not the same letters; not the same alphabet. The English
alphabet, for example, has no double character letters, but
the Welsh alphabet does. Written Welsh uses dd, ff, ng, ll,
ph, and rh for specific sounds in spoken Welsh. Digraphs is
the posh name for these double character letters. (Saying
that these letters are made of two letters, as we often see,
is confusing; they are made of two characters, and often use
the same characters used to make single character letters in
an alphabet.) The alphabet for written Basque uses the
double character letters dd, ll, rr, ts, tt, tx, tz, with the
sounds of the tx and tz letters being hard for me to both
hear and to make, for not being a skilled speaker of Basque.
It's the same for the Welsh sounds. And written Basque
doesn't use c, ç, q, v, w, y, except for loanwords.
It's important in what follows to remember that the letters
of the alphabets we use to write with are symbols for sounds.
They are not symbols for bits of meaning: the meanings, to us
as readers and writes of words, are not built from the
letters we use to write them with. It is the sound of words
that is built from the letters we write them with, and read
them from. In all our different writing systems, there are
no symbols for meanings or semantic relationships from which
we build the meanings of things we read as words. Writing
systems are not semantic representation building systems.
Writing systems are made of multiple systems of symbols
needed to present well enough in written form the sound of
spoken languaging.
Text is thus sequences, usually long sequences, of the
characters used to make all the different kinds of symbols
we, as writers, use to write what we want to say in the way
we want to say it, and what we, as readers, use to hear, out
loud or silently in our heads, what the writer wanted to say
in the way they wanted us to hear it.
Text is thus not words. Nor is text just letters from the
alphabets we combine to write words with. Text is
characters, lots of them, from the different symbol systems
we need to write and read with. In digital form all these
different symbols, including embedded type setting and
formatting instructions, are encoded using the same digital
character system. We do not use different binary character
systems for the different symbols systems in our writing
systems.
2 The gigantic collection of text
To build a LLM we must first assemble a gigantic collection
of text (GCT) from human writing in all the different
languages we want our LLM to deal in, often, including
computer programming languages. From all this writing we
must remove all non-text material, images, pictures,
photographs, diagrams, etc, and rip out all font selection
and typographical formatting instructions; all the setting of
the text on the page or screen; everything we put there to
make the text readable and understandable by us readers.
So, first question: why is it okay to rip out all this type
setting and typographical design? If you think good
typographic design is not important for good reading and thus
good understanding of what we read, read more Bringhurst, and
disagree with him!
Good typography is not just decoration. It's needed for good
accurate reading and understanding of what was written, and a
lot of time and effort goes into making this work well. If
LLMs really deal in words and meanings, why is all this
needed aid to reading the text first removed? Why is it not
needed, if, what LLMs really do is represent the semantics of
text? This is a question for the LLM builders to answer,
and, I would say, others working in computational
linguistics. And, they need to explain why they then need to
add at least some of this text formatting back in when
presenting the output of LLMs to users of systems like
ChatGPT, and other automatic text generators. The simple,
and boring, type setting we currently get from things like
ChatGPT is not generated by the LLM inside it. It's all
added on by some post-processing of the generated text
tokens. Or, are we all now to believe that good
typographical design plays no part in real language
understanding from writing?
3 On token making
All the text in our GCT must first be turned in to a [very
long] sequence of text tokens. As must be the text of any
input prompt. To do this so called text tokenisation we must
first build a set of text tokens to use. Here I will show
how this is done for OpenAI GPT systems, including ChatGPT.
Other automatic text generating systems may use different
token building procedures, but they share the same basic
ideas for how this is done.
First, all the GCT text is encoded using one standard, and
pre-specified, single byte character set. For this UTF-8 is
used (see <https://en.wikipedia.org/wiki/UTF-8>) and, in the
case of the GPT systems, a basic set of 245 single 8 bit byte
UTF-8 characters is used to encode all the different kinds of
writing system symbols found in our GCT. (Most of the text in
our GCT will already be in UTF-8 since almost all webpages
are transmitted as UTF-8 characters.) These 245 UTF-8
characters are then made our first GCT text tokens, so we
start with all the text in our GCT encoded using only these
basic tokens.
All the distinctions between alphabetic letters, punctuation
marks, numerals, maths, and all the other things we find in
written text, are removed by this tokenisation. All the text
is turned into single character tokens using only the 245
UTF-8 characters we define as our basic set of characters.
So, how, I would ask the fairy story tellers, do these text
tokens carry, or magically possess, any meaning of words,
when we've lost all sight of things we call letters, let
alone words, in this basic text token representation of our
GCT?
These basic text tokens are treated as "atomic" text tokens,
and are used to build multi-character text tokens using a
Byte-Pair Encoding (BPE) procedure [4], which works as
follows.
An automated procedure is applied to our GCT, now all in
atomic text token form, to count how many times adjacent
pairs of atomic text tokens occur in our GCT. The adjacent
pairs of atomic text tokens which have the highest counts are
then made into new two-character text tokens and added to our
token set. Say, for example, the combination of a <t> token
followed immediately by a <h> token is one such often
occurring pair of atomic tokens in our GCT, then a <th> token
is made and added to our token set, and, all occurrences of a
<t> followed by a <h> in our GCT are replaced by this new
text token.
This way we make lots of new two character text tokens, and
add them to our text token set. How many depends upon what
we decide is the threshold of "frequently occurring" adjacent
pairs of atomic text tokens": a decision we as the LLM
builders need to make. It's part of the Dark Art of LLM
building. I don't know what this threshold is for the GPT
systems. If someone here does, please tell us.
Next, the same counting procedure is used again to count how
often sequences of three adjacent text characters occur in
the (now modified) GCT, where these three character sequences
are all built by pairing one atomic token with one of the
just made two character tokens. So, again, it's a pair of
tokens that are put together, but this time from one atomic
single character token and one two character token.
Say, for example, one of these new three character sequences
that occur many times in our GCT is "thr", made from <th> +
<r>, then <thr> is added as a new token to our token set,
and, again, all occurrences of <th> followed by <r> in our
GCT are replace by our new <thr> token. Another common three
character sequence that will be built at this stage will
likely be <the>, from <th> + <e>, and this too will be added
to our token set, and used to replace all <th> and <e>
sequences in our GCT. As before, what counts as a frequent
enough three character sequence for it to be made a new text
token must be decided and specified by us the LLM builders,
and it doesn't have to be the same number we used in the
building of the two character text tokens. More Dark Art.
This procedure is repeated, each time counting the occurrence
of longer sequences of pairs of tokens made from the last
made compound tokens and the set of atomic tokens. Pairs of
multi-character tokens are not used to make longer character
sequence tokens, only atomic tokens and the latest compound
tokens are used. This repetition continues until the magic
number of needed text tokens is built.
In the case of GPT-3, this magic number is 100,245 text
tokens. Why are 100,000 compound text tokens made and added
to the atomic text token set? I don't know the answer to
this. It's hidden in the Dark Art of LLM building. But, we
can see that there is trade-off here. The more compound
tokens we have, the more compact any processed text will be,
which saves memory and computation, but, the more tokens we
have, the more embedding vectors we will need to build and
use, which uses more memory and computation. Somehow we need
to find a "Goldilocks" value: not too large, not too small,
but just right, to borrow from another fairy story. How, I
wonder, does all the meaning and semantics of the text we're
processing here get to help decide on this Goldilocks value?
Please tell me. And why is 100,000 compound text tokens
"semantically" adequate here, and how is it "semantically"
adequate, and not just a system engineering decision hidden
in the Dark Art of LLM building, given that BPE was
originally designed and used as a text compression procedure,
and not as some kind of magic semantic representation
building procedure? I see plenty of binary character
constructing going on here, but no word meaning and semantic
relationships being identified and represented.
There are some restrictions applied to this text token
building procedure. For example, the tokens <’s>, <‘t>,
<‘re>, <‘ve>, <‘m>, <‘ll> and <‘d> are not combined with any
atomic tokens, so the GPT token set cannot include tokens
like <we’ll> or <they’d>. Why not? What's not good about
having these as text tokens? Another restriction is that
tokens for numerals are only combined with other numeral
tokens. So, some distinction between symbols from the
different symbol systems of our writing system, is added back
in here, via the back door. But, combinations like < a> -- a
single space character followed by the character 'a' -- do
get built: single space is not an alphabet symbol. You can
check on the final GPT token set here
<https://emaggiori.com/chatgpt-all-tokens/>. Use the search
option to see if your favourite text tokens are there! Such
as the common "words" like <;a>, or <**> and <****>, or <()>,
or <indow>, just to pick a tiny number from the 100,254
tokens that all, so the fairy story tells us, have perfectly
clear meanings and semantics all now safely captured in their
respective BPE built text tokens, right? No, of course not.
There are, as you'll notice, many text tokens that look like
words to us readers, which the tokenisation procedure has
constructed, and which happen to be combinations of
characters we know as symbols of letters used to spell words
with. But this is an artefact of the token making procedure,
and not an intended outcome. And, as we saw at the start,
letters carry no semantics, not even bits of meaning we build
from words when we read them. So, even for these word-like
text tokens, how is any "meaning" put into them?
In the standard fairy story versions of this token building
procedure, the compound text tokens are called "subwords",
sometimes with the added hand-wave that they capture the
sub-meaning of whole words. How they do this is, of course,
never explained. The reader, as far as I can see, is left to
just think "Oh yes, of course, these tokens pick up bits of
meaning which can just be added together in a simple linear
away to make bigger meanings", or should that be
"super-meaning"?
This "subword" claim is, in my view, plain rubbish. Meaning,
as we understand and use in our reading and writing does not
work this way. The term "subword" is an example of what I
call Hopeful Terminology; a term full of hope. The hope here
is that if we call these tokens "subwords" -- the vast
majority of which can't be read as having any meaning, or
sub-meaning -- they will magically become what we call them,
and, like whole words, thus somehow carry bits of meaning,
whatever that can mean. There is no more than hope and
hand-waving here. There is, I would say, no good enough
[David Deutsch] explanation for this. This is an empty
explanation; the weakest kind of explanation we can have, and
thus useless. But! If I am mistaken in calling all this
rubbish, please, someone, show us how and why I am mistaken.
I'll be happy to see this.
(There are plenty more Hopeful Terms used in all this
Generative AI business, such as "Language Model", "language
processing", "context", and more, but see [5].)
In summary, in all this token making no explicit account is
taken of the alphabets of the languages use to write the text
in our GCT, and nor is any account taken of the syntax and
morphology of these languages. No distinction is made
between the letters used to write words with and all the
other kinds of symbols needed in writing. No attempt is made
to build any kind of low level representation of what results
from our reading of text. So, what, exactly is the basis for
the claim that these text tokens can be used to represent
words and the meanings we, as real languages, build from
reading them? What, exactly, is meant by "meaning" here,
other than a low level encoding of the common characters
sequences found in a very large quantity of text from human
writing and programming? Which, of course, is not meaning.
This BPE text token building works well as a way to turn just
about everything we find in our GCT into binary encoded
characters of one, two, three, and more, bytes each. But it
squashes out all of the distinctions between the different
kinds of symbols in the writing systems we use to write and
read with. We just have combinations of one, two, three, and
more, UTF-8 binary characters.
3 On putting vectors to bed
Having built our text token set we next need to build a so
called embedding vector for each token. Here I will simplify
out some technical details, but still try show well enough
how embedding vector building is done. We build a two layer
Connectionist system to take as input a text token, and to
produce as output a big so called embedding vector.
Older procedures used word2vec developed by people at Google,
but more recent procedures use the GloVe (Global Vectors)
algorithm developed at Stanford [7], so this is what I will
describe. From [7] we read that the "... main intuition
underlying the [GloVe] model is the simple observation that
ratios of word-word co-occurrence probabilities have the
potential for encoding some form of meaning." Notice here,
and remember, the "potential for" and the "some form of
meaning". The "potential" claimed here is not demonstrate;
it has no substantial, multi-language, shared, and accepted,
empirical basis, as far as I can find, but if someone here
knows of this evidence, please tell us. And the "some form
of meaning" is not given any specification we can actually
use to assess this in plenty of representative empirical
cases, as far as I can find, but, again, if anybody here
knows of such specification, please tell us.
The GloVe procedure starts by first building a token pair
co-occurrence table. For each token in our token set, the
GloVe procedure builds an estimate of the pointwise mutual
information (PMI) between a token <x> and the "context"
containing another token <y>. This results is a 100,254 by
100,254 table of estimated PMI values for all pairs of
tokens. The "context" here is the set of n text tokens
sequentially either side of token <y> each time token <y>
occurs in our GCT. Calling this set of 2n surrounding tokens
the "context" of token <y> is yet another Hopeful Term used
in the fairy story which says GloVe builds into the embedding
vectors "the context sensitive semantics of words". How this
actually happens is not, of course, explained; it's just
asserted; the unevidencedunevidenced "potential" is turned into a
"fact": the magic of the fairy story at work. PMI is a
Shanon information theoretic construct, a useful one. But,
Shanon information is not about the meaning or semantics of
the words we use in out languaging. So, the fairy story
tellers here need to explain how these token pair PMI values
capture any actual semantics. I don't see any word meaning
or semantic relationships anywhere in this statistical data
about our text tokens as they occur in our GCT.
The pointwise mutual information (PMI) between a token <x>
and a sequence of (2n+1) tokens containing token <y> in the
middle, denoted <n-y-n>, is given by:
PMI(<x>,<n-y-n>) <= Log2 P(<x>,<n-y-n>)/P(<x>).P(<n-y-n>)
where P(<x>,<n-y-n>) is the probability of <x> occurring in
<n-y-n>, P(<x>) is the probability of <x> occurring in our
GCT, and P(<n-y-n>) is the probability of <n-y-n> occurring
in our GCT. The GloVe procedure estimates these probabilities
by counting actual occurrences of the token <x>, sequences of
tokens <n-y-n>, and occurrences of <x> in <n-y-n>, in our
GCT, a computationally expensive task, but one we only need
to do once.
But how big is n here, and how big does it need to be to
capture any real "semantic context", if that's what it's
supposed to do? And, is one size of n good enough for all
the text in our GCT? As far as I know it is always a
constant, and, from what I have seen mentioned, but not well
explained, n is set to something like 5 tokens. (If anybody
here knows more on this, please tell us.) This might be
enough for English text, given its typical grammatical
constructions, but for languages, such as German and Basque,
which have stack-like constructions where the verb goes at
the end of what can be a long sentence, 5 tokens before and
five after our token <y> isn't going to collect anything
about these longer range relationships, semantic or any other
kinds of relationship, not every time, at least. But, even
in English text, meaningful phrases like epistemic qualifiers
can appear almost anywhere in a sentence and still do the
same semantic qualifying, though where they are placed may
modulate the force of this qualification. Is an n=5 always
going to be big enough to cover this kind of semantic
context? I would say not. So, real semantic considerations
don't seem to come into the setting of n here. I've not
found any demonstration of how this magical "meaning context"
is made to have anything to do with this estimation of PMI
values. I strongly suspect that n is chosen on the basis of
what makes the implementation do what we think it should do,
and not on any aspects of real languaging, meaning, semantic,
or anything else. It's LLM Dark Art stuff.
Next, this large co-occurrence table of PMI values is used to
program a two layer Connectionist system [that's train a
Neural Network in Hopeful Terminology] using standard Back
Propagation to minimises the square of the difference between
the dot product of two token vectors and the [Log base 2 of
the] PMI value for the same two tokens in the big table.
This results, after enough "training", in a Connectionist
system which maps a text token, as input, into a large
numerical vector as output. In GPT-3 each token vector is
made 12,288 elements long, but I think they are bigger in
more recent versions of GPT. (Does anybody here know?). And
why 12,288, I hear you ask. I don't know, but, again, you
can see there are strong computational considerations at play
here. And, as far as I can see, no meaning or semantic
considerations. But if anybody here can show us how there
is, please do so.
What this minimisation does, according to the fairy story, is
"capture semantic relationships" between the text tokens
[which are always called "words" not text tokens to help us
keep our belief in the fairy story]. In more detail, the
fairy story is that the co-occurrence table "contains a
quantitative measure of the semantic affinity between words
in terms of the frequency with which they appear together in
a given context", 2n+1 tokens "context"! Except, nothing is
presented to show that PMI values do work as a well defined,
reliable, and robust, measures of the "semantic affinity
between words." This is just asserted, and the magic of the
fairy story then makes it so. Nor, of course, is there ever
any attempt to define what "semantic affinity of words" is in
quantitative terms. This is just more of the typical hopeful
hand waving we see in the fairy story versions of how token
vectors are built, despite the fact we're not dealing with
words here, we're dealing with text tokens, most of which
don't even look like words to us readers.
Given how this text token vector making is really done, what
we might be able to say is that there is a correspondence
between the way two different vectors relate and the
estimated PMI value of the two tokens as found in our CGT.
But mere correspondence is not enough to say anything is
being represented. Representation requires satisfying the
Representation Relationship, and this requires explicitly
showing that, in all cases without exception, each token's
"meaning" is mapped by the same mapping function into it's
embedding vector, and that every embedding vector's "meaning"
is mapped back correctly into the meaning if its
corresponding token, and that everything done to these
vectors by the LLM preserves these mappings at all times.
All this is usually hard to do in any good designing and
building of a representation system. Representations are not
made just by calling something a representation, as the fairy
story would have us believe. It takes plenty of difficult
designing, specifying, and careful, well verified, validated,
and tested, implementation, to build a correctly working
representation system that can support some well specified
sound computational reasoning process.
We need to look some more here. There are details of how the
GloVe procedure works which make it even harder to understand
how it builds any real vector representation of word
meanings.
The objective function of the optimisation done in building
the Connectionist system we then use to make our token
vectors with, contains another term, a so called "weighting
function" which is multiplied with the square of the
difference between the vector dot product and corresponding
PMI values for two tokens. This weighting function needs to
be continuous [but not necessarily smooth] so that pairs of
tokens with PMIs values of zero, or very small, are weighted
zero. This saves lots of computations with zero or very
small values. Also, this weighting function is designed to
limit the maximum size of PMI values used in the
minimisation. This is needed to prevent very common <x> in
<n-y-n> pairs dominating the proceedings. In other words,
it's a hack to stop undesirable behaviour in our vector
building system. There is also mention in some places that
implementations of this function contain hidden extras, like
rules to prevent certain token pairs being treated, or being
given certain PMI values, no matter what they have in the
table. In other words, more hacks. As an engineer I call
this approach to designing and building systems Christmas
Pudding making: as long as you add plenty of Brandy and serve
it hot with plenty of custard, everybody will like it, and
nobody will ask what's in it. It's certainly no way to build
a transparently working, reliable and robust, representation
system. Yet, this issue is swept aside by fair story tellers
with claims like "this does not tend to be a problem", see
video 1 in [8] In my view, this is a "hack it 'til it works"
attitude on display, where "works" is what we think it should
do.
More. As we have seen, the GloVe procedure uses the dot
product of two token vectors, which gives a scalar value.
But it does not render a surface with one global minimum;
it's not convex, is the posh way of saying this. This means
the minimisation can only finds local minima in the building
of the Connectionist system we use to make our token vectors
with, and this means tokens do not necessarily have unique
vectors: two or more tokens can have similar vectors: similar
in direction and magnitude. But these tokens will not
necessarily have the same or even similar "meanings"! This
is not a way to build a well working representation system.
And it means the favourite fairy story example of so called
"semantic vector algebra", that
vector(<king>) - vector(<man>) + vector(<woman>)
results in a vector the "same as" the vector(<queen>), can
appear to work for these tokens [always called words, of
course], but may also work for other tokens too, but which
have no sensible semantic relationships. This is no good for
any usable and understandable representation building.
Nobody goes looking to see if this happens because in the
fairy story explanation just showing that one carefully
chosen example appears to work is all, it seems, that is
needed to prove vector representations of word meanings
works. In any real representation system building we would
have the verification, validation, and testing work done and
presented which shows that the designed representations
happen as designed, only as designed, and can only happen as
designed, and always happen well enough in all the conditions
and situations they are designed to work in, particularly
when used in any computational reasoning processes these
representations are use by. Doing all this often takes more
work than the designing and implementing does.
To show that the vector(<king>) - vector(<man>) +
vector(<woman>) example really is an example of the vector
representation of word meaning the fairy story people say it
is, we need to see that a generous set of representative
examples of vector algebra combinations of text token vectors
all, without fail, result in sensible semantic outcomes. And
not just combinations of three vectors, combinations of any
number of vectors. And, if there are any such combinations
that are not supposed to result in semantically sensible
outcomes, we need to see that they don't. And we'd need to
see what extra machinery is added to our vector
representation system to stop these particular cases of
vector combinations from ever being treated as semantic
representations in some reasoning process. We need to see
the circumscribing machinery for this. Needles to say, I've
never seen anything like this in any reporting of any LLM
building.
There's more. How, exactly is the resulting vector of the
vector(<king>) - vector(<man>) + vector(<woman>)
example shown to be the same, or sufficiently the same, as
the vector for the token <queen>? I've not found any clear
explanation of this, just statements like "you can see they
are the same", or "we check to see if they are the same", or
we see nice neat drawings showing how all the vectors join up
as they are supposed to, but no numerical data for the
vectors drawn is offered. The sameness here really is as
exact as these drawings make it? No, of course hot. These
drawings are cartoons, not demonstrations of sameness. (And,
they are not even proper drawings of vectors in a vectors
space, they are drawings of lines with arrows on one end in a
coordinate space.) Remember, these vectors are of the order
of 12,000 elements long, so giving us all the numbers to look
at would need plenty of space, and not be easy to work with.
But, we could then check for ourselves. Better, of course,
would be that the fairy story tellers told us exactly how
this sameness or sufficient similarity is reliably assessed.
There are different ways to assess the similarity of two
vectors. Commonly used ones in computational linguistics are
Cartesian distance, vector dot product, and vector cosine.
Cartesian Distance is not a vector space quantity, it's a
[Cartesian] coordinate space property, in which we have
points, lines, and [hyper]planes, so it does not give us a
vector comparison; it gives us how far apart two points are
in the coordinate space. If you take the end points of
vectors as points in a coordinate space overlayed on our
vector space so that the origins and axes of both spaces
coincide, we can apply this method, but you are then not
dealing with vectors, just points. Or, is this what we are
really dealing with, embedding points in a coordinate space,
and not vectors in a vector space? It often looks like it.
Look at the picture in the Wikipedia page for Word embedding,
for example (see
<https://en.wikipedia.org/wiki/Word_embedding>)
This shows two pairs, France --> Paris and Germany -->
Berlin, and describes "France", "Germany", "Paris", and
"Berlin", as being points in a coordinate space, not vectors
in a vector space! What look like vectors in this drawing --
lines with arrows on one end -- are drawn to indicate the
translations from the point labelled "France" to the point
labelled "Paris", and from the point labelled "Germany" to
the point labelled "Berlin", together with the claim that
these two translations are the same, or, at least, near
enough the same. But these translations in this coordinate
space are not vectors, they need a matrix to define them, yet
there's no mention of this, let alone any indication of how
the similarity of these matrices are to be assessed. This is
a common confusion on display in many web pages and
publications on so called word embedding and the "semantic
relationships" captured by so called word vector embeddings.
To me, this kind of basic confusion about what are points and
translations in a coordinate space and what are vectors in a
vector space, suggests the people who do this stuff don't
understand what they are doing, or don't care to present and
explain it properly.
The vector dot product and vector cosine are vector space
quantities, so these do offer ways we might compare two token
vectors, but they have importantly different qualities which
we need to know about and understand. Take the dot product
first. This takes into account both the direction and the
magnitude of the vectors involved; the two properties of
vectors. But, it does not give us a result which strongly
corresponds to how well aligned the two vectors are. Two
vectors with the same direction, but quite different
magnitudes, will have a different dot product to two aligned
vectors which have similar magnitudes. This may be what we
want, of course, but then we must say what the direction of
token vectors represent, and what the magnitudes of token
vectors represent. If there really are two important aspects
of text tokens as they occur in the text of our GCT which we
need to represent, then using the two properties of vectors
to represent these would make sense, but if there is only one
aspect to represent, using vectors introduces distorting
artefacts into our representation system. The fairy story
doesn't tell us what two aspects of text tokens are well
represented using the magnitude and direction of vectors, and
which aspects is well represented by the magnitude and which
by the direction of token vectors. We need this to properly
understand what is really going on and why. Furthermore, if
the magnitude of token vectors does do representation work,
not just the direction, we need to know what all the
dimensions of the vector space represent, all 12,000 of them,
and how different amounts of these vector space dimensions do
needed representation work, and how what each dimension
represents is really orthogonal to what all the other
dimensions represent. Something else the fairy story is
silent about.
A typical simple illustration of this idea can be seen in the
diagram on the Wikipedia page for Distributional semantics,
the [hardly ever mentioned] origin of the fairy story. It's
one of many similar diagrams to be found on the web (see
<https://en.wikipedia.org/wiki/Distributional_semantics>).
Here the two axes of a two-dimensional "semantic space" are
labeled "political" (from 0.0 to 0.3 horizontally) and
"dangerous" (from 0,0 to 0.3 vertically), and we have example
vectors in this space for "animal", "shark", and
"dictatorship". If we are to take this cartoon seriously,
this illustration says that the meaning of the word "shark"
is composed of, or shares, about 0.1 of the "meaning" of
"political" and about 0.22 of the "meaning" of "dangerous",
and the "meaning" of "dictatorship" is composed of, or
shares, about 0.27 of the "meaning" of "political" and about
0.13 of the "meaning" of "dangerous". Is this how the real
meanings we build from reading text to build words in our
head works? If so, where is the plentiful reliable empirical
evidence for this, across a good selection of languages,
cultures, and domains? Obviously this particular example is
made up for "illustrative purposes", but this does not excuse
the silliness here. Nor does it serve to establish any real
basis for this kind of idea about how text, words, and
meanings work in real human languaging. Yet, the fairy story
is based upon this idea; it all starts with the so called
Distributional hypothesis: "words that are used and occur in
the same contexts tend to purport similar meanings." Let's
see this story told with the 12,000 or more dimensions of the
vector spaces used in today's LLMs. Let's see what are the
"words" put on the 12,000 or more supposedly orthogonal
dimensions of these vector spaces. Hand waving with simple
2D examples may be good enough for fair stories, it's not
good enough, I would say, for a good explanation of what is
really going on in the vector spaces built and used in LLMs
today.
Another issue with the vector dot product as a vector
similarity test is that pairs of vectors which are not well
aligned can have the same, or similar, dot product values as
two vectors which are well aligned, so the dot product
doesn't strongly distinguish between vectors which are well
aligned and vectors which are not aligned. So, if vector
alignment is important in the story of how text token
"meaning" is captured in vectors, we need a good explanation
of how this works well enough, given that the way the vectors
are built by the GloVe procedure using dot product
comparisons cannot make vectors of tokens with similar
"meaning" always be closely align. But, according to the
always quoted example of how this "semantic vector
arithmetic" works, about "king", "man", "women", and "queen",
vector alignment is important. This needs proper
explanation, not the usual "hey look, it works" treatment.
What about vector cosine? This measures the relative angle
between two vectors. So, if it is vector direction which is
important, and not magnitude, then this looks like what we
need to assess token vector similarity. Although to
calculate vector cosine we need to calculate the magnitudes
of the vectors involved, vector cosine values are independent
of the vector magnitudes. Essentially, the vector cosine
treats all vectors as unit vectors: vectors all of magnitude
one. If this is all we need to compare our token vectors,
why don't we build token-to-vector making systems that
generate unit vectors? This would make these systems easier
to verify and validate, and understand. The reason this is
not done, I suspect, is that the "semantic vector algebra"
would not work, or not work as well. This is speculation on
my part, but why are vectors, with their two defining
quantities, direction and magnitude, built when we only use
one of these quantities to build the so called vector
representations of meaning? Building unused properties into
a representation system is not good representation design:
you need to test that despite not being used such properties
do not result in unwanted artefacts in the representation
processing. More, if it is only vector direction that is
important for the claimed semantic representation, why not
use points in a coordinate system? Using unit vectors adds
nothing, except, apparently, confusion in the minds of the
representation builders. But, of course, if all you need to
provide is a hand waving fairy story about what goes on, none
of these kinds of important representation system design and
implementation issues need to be explained, let alone
admitted to. You can't see them in the fairy story.
A question that seems obvious to me is why the GloVe
procedure doesn't use vector cosine, rather than the vector
dot product, as it does. The vector cosine does render a
convex minimisation surface so should, on the face of it,
work better to minimise the objective function. Jurafsky and
Martin, in their book on Speech and Language Processing hint
at people doing this, see [9], page 110, and briefly mentions
the difficulties of using vector dot product, but I suspect
that vector cosine is not use because it is computationally
more expensive than vector dot product. As we have seen,
building the token-to-vector Connectionist system we need
involves doing lots of vector dot products, and this seems,
to the LLM builders, to work well enough, see video one in
[8], so this is what is mostly done. Yet more, hack it 'til
it works, I'd call this. Or, is there some sound argument
about how word meanings are capture by embedding vectors that
I have missed here? I'd be happy to be enlightened about
this.
In note [8] I list three YouTube videos which offer fairy
story explanations of some of what I talk about in this part,
and illustrate the kinds of confusions I have identified.
4 It's time to stop
What I have explained here is only the beginnings of what
happens inside systems like ChatGPT. There's plenty more to
understand, but I'm not going to show this here. However, I
do want to mention a few things to try to clarify particular
aspects of the way systems like ChatGPT work.
The input to the LLM inside ChatGPT is not a simple list of
tokens made from a prompt we enter via the ChatGPT interface.
The input is a large matrix which is as wide as the token
vectors are long, which for GPT-3 is 12,288, and as deep as
the ChatGPT "Context Window", that's the total number of text
tokens it works with at any time, which for GPT-3 is 50,257,
but much bigger for later versions of GPT used in ChatGPT
today.
This matrix is populated with the token vectors made for the
tokens from our prompt, after it has been tokenised, and,
after these token vectors have had their respective position
vectors added to them. Position vectors encode the position
each token comes in in the linear sequence of tokens made
from our prompt. Each position vector is the same length as
the token vectors, 12,288 for GPT-3, and is filled with
values from a sine function, whose frequency is incrementally
increased as we go along the token sequence: later position
vectors have values from high frequency sine waves. This is
how token sequence ordering is encoded and added into the
input to the LLM. But, I would ask, what does adding a 12,288
long vector of sine wave values do to the supposed "semantic
representation" our text token vectors are supposed to do?
Nothing? I'd like to see this explained. Adding this
position vectors to our token vectors changes all the values.
How can this not change what they represent, if they
represent anything?
To the tokens of our input prompt ChaGTP adds what is called
a system prompt. This is quite a lot of hidden stuff,
another 16,739 tokens in ChatGPT. In the case of Claude, this
system prompt is reported to be 16,739 tokens, see [10]. The
system prompt basically configures the behaviour of ChatGPT
and the way the LLM is made to process our prompt token
vectors. We never see any of this system prompt stuff, and
what it is composed, or what it's there for. It's more LLM
Dark Art.
If the 12,288 by 50,257 input matrix is not completely filled
with tokens from our prompt and the system prompt, the rest
of this matrix is filled with null tokens, which have a
spacial vector. Then, after some prompt and response cycle,
the output of ChaGPT is added into this input matrix, after
our first prompt and the system prompt token vectors. And
this keeps going until the input matrix, the so called
ChatGPT Context Window, is filled up. From then on, tokens
are removed on a first-in first-out basis, to make room for
the latest generated output and any more prompts from us.
This is how ChatGPT appears to "remember" our previous
prompts and its previous responses, and then starts to forget
these. And this is why making this so called "Context
Window" bigger seems to make ChatGPT generate better text in
response to our prompts.
The output of the LLM inside ChatGPT is not the text we get
back from ChatGPT in response to our prompt. This is
assembled from repeated cycles of presenting the input matrix
to the LLM and adding one more token to this matrix selected
from what the LLM actually outputs, which is an estimated
probability distribution over all the tokens in the text
token set. The LLM does not predict the next token, nor
"predict the next word". It builds a complete estimated
probability distribution over all the text tokens. It is an
extra piece of machinery inside ChatGPT which selects which
token to take out of this estimated probability distribution
to add to the input matrix. And it's another piece of
machinery which decides when to stop doing this, and prepare
the accumulated newly generated tokens to present back as
readable text.
5 Some last remarks
LLMs do not, despite their name, model language. If they
model anything, they model statistical patterns found in the
text token form of our GCT. Yes, at least some of these
patterns, perhaps many of them, will have some kind of
correspondence to patterns of words we make in our
languaging, which we speak and write down. But, the LLM has
no knowledge or understanding of these correspondences, only
we do. LLMs are not built, so called "trained", with any
data about patterns found in our languaging. The data used
is only tokenised forms of text, and text, as I have
explained, is what is left after we do some writing. Text is
not the writing. Text is not languaging. Text is the marks
left when we write something we want to say. It's a
precipitate of writing. The words we chose to use to say
something with, in the way we decide to say it, are in our
heads, nowhere else. The text left over from writing the
words we decide upon are signs for how the words we chose to
say something with sound. Text is not composed of signs for
the meaning of what we want to say. To get the meaning
again, you need a listener or a reader who understands the
language(s) we speak and write in. LLMs don't do any writing
or any reading; they don't do any language understanding.
LLMs do text token processing, a lot of it, but that is all.
Systems like ChatGPT with an LLM inside, do text-to-text
transformation. It requires us to read and understand
anything from the text generated by such text token
processing. It's clever how LLMs are used to generate text
which is readable and understandable by us, but not really a
surprise, given they are built with gigantic amounts of text
left from human writing. But, just because we do get
readable and understandable text from systems like ChatGPT
does not mean these system understand language, or know
about, reason about, or understand anything that the text
seems to be about to us readers. ChatGPT, and other
automatic text generating systems like it, doesn't know,
reason about, understand, anything. They are just built and
described and talked about in ways to make it look like they
do. We should take care not to be fooled by this kind of
fairy story.
6 To end
I have tried to get clear, and be clear, about (some of) what
really goes on inside Generative AI systems like ChatGPT, but
I am not welded to what I have put here. I would be more
than happy to receive any and all corrections and
clarifications to anything above, or even just indications of
where corrections and clarifications are needed. I am,
however, unlikely to respond to further attempts to push the
Generative AI fairy story. For me, this has no clothes on,
and I think we need to say so.
If you've read all the way to here, you have my appreciation
and a big thank you. Telling real stories that offer good
explanations, it turns out, takes longer than telling fairy
stories full of fantasies.
-- Tim
Notes
[1] David Zeitlyn, 2022. An Anthropological Toolkit, Berghahn
Books, Chapter 42. Ostension, pp 90, opening sentence.
[2] David Deutsch: Explanations. Why David Deutsch believes
good explanations are the antidote to bad philosophy,
video interview, Aeon, 25 June 2025, opening remarks.
<https://aeon.co/videos/why-david-deutsch-believes-good-explanations-are-
the-antidote-to-bad-philosophy>
[3] Robert Bringhurst, 2004: The Solid Form of Language, An
Essay on Writing and Meaning, Gaspereau Press, pp 9.
[4] There are plenty of paces to find the the BPE presented in
more detail, here is a simple and, I think, sufficient,
explanation.
<https://en.wikipedia.org/wiki/Byte-pair_encoding>.
[5] Hopeful Terminology is not special to Connectionist AI,
and today's Generative AI business. All kinds of AI
research have plenty of their own versions of Hopeful
Terms. It's been something that few have commented upon
and criticised, but the first person I know to do this was
Drew McDermott in a short paper called "Artificial
intelligence meets natural stupidity", in ACM SIGART
Bulletin, Issue 57, pp 4-9,
<https://doi.org/10.1145/1045339.1045340>. A direct
consequence of this bad practice is that the builders of
AI system then use these hopeful terms to presented what
can only be poor explanations of what their systems do and
how, and also fail to recognise that their understanding
of their own system is badly flawed.
[6] Ursula K Le Guin, 1998: Steering the Craft, A 21st-century
guide to sailing the sea of story, first Marina Books
edition 2015.
[7] GloVe: Global Vectors for Word Representation Jeffrey
Pennington, Richard Socher, Christopher D Manning
<https://nlp.stanford.edu/projects/glove/>. This page
presents more examples of the so called "vector difference
between the two word vectors" whereas what is actually
drawn are points labelled as words in a 2D coordinate
space with lines indicating translations from one word
point to another word point. In vector space there is no
difference operation, just vector addition (and
subtraction) and scalar multiplication. What is presented
here is confused, and confusing, and no good explanation
of what is really going on. Nor is anything said about
how the 50,000 or more dimensions of the illustrated "word
vectors" projected into the 2D space of this illustration.
The authors also seem not to understand that translations
in a coordinate space need square matrices, not vectors to
specify them.
[8] The three videos. There are millions of videos on the web
about "word embedding", GloVe, and LLMs -- well, okay, not
millions, but it feels like there are millions -- to save
you looking at all these, here are just three which are
useful in the context of the story I tell here.
Lecture 3: GloVe: Global Vectors for Word Representation
Stanford University School of Engineering, presented by
Richard Socher, see
<https://youtu.be/ASn7ExxLZws?si=e8ag__SacXOp1Tlr> (April
2017). In this lecture we get the claim, in response to a
question from someone in the class, that the fact the
vector dot product does not render a convex optimisation
surface doesn't seem to present any problems, so it's just
ignored.
Vectoring Words (Word Embeddings) by Rob Miles on
Computerphile (5 years ago)
<https://youtu.be/gQddtTdmG_8?si=zz4b6zBb2RwsnOiL>. This
presents the fairy story idea that two words are similar
if they are often used in the same "context", but without
any clear explanation of what "context" means here, and
how we can identify the context of any particular word.
And Miles talks about word being points, not vectors, thus
illustrating this common confusion in all this so called
"word embedding" stuff.
Glitch Tokens by Rob Miles on Computerphile (again)
<https://youtu.be/WO2X3oZEJOA?si=154hJuyCMVLcWnKC> (2023).
This presents some entertaining examples of strange text
tokens and weird ChatGTP behaviour, and, along the way,
displays the usual calling tokens words, when what they
really are are text tokens, and other hand waving ways of
talking we see in the fairy story explanations of the
workings of LLMs.
[9] Daniel Jurafsky and James H Martin, 2024. Speech and
Language Processing, Third Edition (Draft of January 12,
2025)
<https://web.stanford.edu/~jurafsky/slp3/ed3book_Jan25.pdf>
[10] Claude's System Prompt: Chatbots Are More Than Just
Models, from dbreunig.com
<https://www.dbreunig.com/2025/05/07/claude-s-system-prompt-chatbots-are-
more-than-just-models.html>
[11] Here's a late extra note not cited above, but which
confirms, I think, that it's all only text tokens, never
anything like words and meanings, no matter what it might
all be made to look like.
The mystery of em‑dashes: part two with quantitative
evidence By Maria Sukhareva, at AI Realist, July 05, 2025
<https://msukhareva.substack.com/p/the-mystery-of-emdashes-part-two>
==============================================================
_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php