Humanist Discussion Group

Humanist Archives: July 17, 2025, 7:56 a.m. Humanist 39.82 - repetition vs intelligence: on LLMs

				
              Humanist Discussion Group, Vol. 39, No. 82.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org




        Date: 2025-07-16 09:58:29+00:00
        From: Tim Smithers <tim.smithers@cantab.net>
        Subject: Re: [Humanist] 39.49: repetition vs intelligence

First. Happy Birthday Willard!  May you have many more!

Second.  I'm taking the liberty to wind the DH List back to
Humanist Discussion Group, Vol.  39, No.  49, 12 June, 2025,
at 08:07.


Dear Gabriel,

I see you're still wedded to, or could that be welded to, your
claim that LLMs represent and process the meanings and
semantic relationships we [humans] use and make in our
word-based languaging.  You are not alone, of course.  This
same fairy story is told by many working on, or commenting on,
or trying to explain LLMs and their associated computational
techniques.  It's as if we're not allowed to work on, or
comment upon, LLMs unless we first swear unending belief in
this fantasy.

This fairy story is, in my view, mistaken and misleading.  It
is like an Emperor who has no clothes on.  Good scholarship
should dismiss such fantasies, not propagate them.  The truth
of the matter is not, I think, up for debate, it is to be
understood and carefully shown to others, so that mistakes and
misunderstanding may be identified and corrected, and our
understanding of what is really going on, why, and how,
strengthened.

So, I am not going to respond point by point to your last
post.  I'm going to do something different.  I'm going to show
in some detail what is implemented by an LLM. Not all of it,
just the initial front-end stages.  And, I'm going to ask some
questions of you and any other fairy story believes as I go.
I do this so that you, and anybody else here, may then point
out where my mistakes and misunderstandings are, and how to
correct these.  I am confident of my understanding, but this
does not mean it is therefore all well formed and correct.
There are plenty of other people here who know, understand,
and use systems with LLMs inside them, and probably some who
build these kinds of systems, or parts of them.  I would ask
these people to join in here, to say where my understanding is
not correct, or not sufficient, and to say where it is
correct, in their judgement.  And I would ask others here not
so technically aware of the insides of LLM implementations to
tell us where you need clarification and further explanations,
so that we might arrive at a good explanation of what really
goes on inside an LLM, and thus clear away the dominant fairy
story and it's associated fantasies.

Here goes.  It's long.  [After this I'll take lessons from
anybody who offers them on how to write short Humanist posts.]


Preface

 In what follows I am guided by the following quotations.

   "Ostension, showing things, plays an important role in
    discussions of representation and meaning."
    -- David Zeitlyn [1]

   "An explanation is a statement of what is there in reality
    and how it works and why ..."

   "A good explanation is one that is hard to vary while
    still explaining what it purports to explain.  ..."
    -- David Deutsch [2]

   "Writing is the solid form of language, the precipitate."
    -- Robert Bringhurst [3]

 [Numbered notes are at the bottom.]


1 What are we really talking about?

 Good ostension is achieved, I assert, by first being clear
 about what it is we want to show things about.

 Human languaging is first and foremost sound based, it's
 spoken, or it's sign based and thus signed.  In the case of
 [most?]  spoken languages, words are formed by combining
 standard and shared sounds in specific ways: we put together
 the phonemes of the language(s) we speak to say words, and we
 put words together in linear sequences to form meaningful
 phrases and sentences, using the grammar(s) of the
 language(s) we speak in.  Right?

 Well, no, not really.  Spoken language is often not
 grammatical, not strictly, but nor is it ungrammatical.  In
 successful spoken languaging there is enough grammatical
 structure for meaning to be built from what is heard.  And,
 enough recognisable sounds for a listener to build words,
 then phrases and sentences, from what the speaker is saying,
 if they understand the language(s) being used well enough.
 And, there's usually plenty of hand, arm, and body motions,
 facial expressions, and other sound making, which are all a
 natural part of the speaking and listening in our languaging.
 Real human languaging is full theatre.  We should keep this
 in mind.

 Writing is another kind of human languaging, a different
 kind.  Writing is an approximation of the full theatre of
 spoken languaging.  Writing necessarily simplifies out much
 of this theatre, but it leaves marks of what is said and how
 it is said: Bringhurst's precipitate.  And these marks can be
 read so that what was said may be heard again [perhaps only
 silently in our heads], and from this hearing we may build
 understandings of what was written.

 This simplification in written languaging is mostly achieved
 using what is sometimes called the Alphabetic Principle: the
 phonemes of a spoken language are each associated with a
 particular symbol.  We call these special symbols the letters
 of the alphabets of the languages we write in.  These
 letters, graphemes is the posh name, form the alphabets we
 use to write with.  They are symbols for the standard shared
 units of sound we use to make spoken words with.  But letters
 are only one of the functional units we need to write with.
 Usable writing systems have more kinds of symbols.

 Alphabets are not the only symbols we need and use to write
 what we want to say in the way we want to say it, and to make
 our writing readable.  We use other [mostly] standard symbols
 to add punctuation to the sequences of words we write using
 letters.  "[P]unctuation tells the reader how to hear your
 writing" is how Ursula Le Guin nicely explains what
 punctuation is for [6].  And, in our writing, we also use
 numerals, symbols for numbers, sometimes from different
 numeral systems; Roman and Arabic, for example.  We use the
 symbols of mathematics, when we write about things
 mathematical.  And we use a variety of special symbols for a
 variety of other things we talk about in our writing, more
 and more, it seems; for currencies; trade marks; emojis; and
 more.  Our writing systems are thus systems made of the
 different symbol sets we need to write what we want to talk
 about, in the way we want to say it.

 In digital form we use a binary character system to denote
 all these different kinds of writing system symbols.  This
 character system needs to cover all the different symbols we
 use in our writing, and this can be a lot, especially if we
 write in different languages, as we often do.

 All this is here to make a distinction we will need in what
 follows: the distinction between the different symbol systems
 we combine in our writing systems, and the characters we use
 to digitally denote all these different kinds of symbols.

 Text is all the marks left when we [humans] write, the
 precipitate, it's not just combinations of letters we read as
 words.  Text is all the marks of all the symbols from the
 different symbols systems we need to write with.  The
 alphabetic letter constructions in text are not the only
 things readers need to build an understand of what is being
 said.  All the other symbols are needed.  As is all type
 setting and typographical design needed to make any text
 readable.  That's why they are there, and need to be there.
 Text is not just made up of what look like words to us as
 readers.

 Writing in different languages may use the same characters,
 but not the same letters; not the same alphabet.  The English
 alphabet, for example, has no double character letters, but
 the Welsh alphabet does.  Written Welsh uses dd, ff, ng, ll,
 ph, and rh for specific sounds in spoken Welsh.  Digraphs is
 the posh name for these double character letters.  (Saying
 that these letters are made of two letters, as we often see,
 is confusing; they are made of two characters, and often use
 the same characters used to make single character letters in
 an alphabet.)  The alphabet for written Basque uses the
 double character letters dd, ll, rr, ts, tt, tx, tz, with the
 sounds of the tx and tz letters being hard for me to both
 hear and to make, for not being a skilled speaker of Basque.
 It's the same for the Welsh sounds.  And written Basque
 doesn't use c, ç, q, v, w, y, except for loanwords.

 It's important in what follows to remember that the letters
 of the alphabets we use to write with are symbols for sounds.
 They are not symbols for bits of meaning: the meanings, to us
 as readers and writes of words, are not built from the
 letters we use to write them with.  It is the sound of words
 that is built from the letters we write them with, and read
 them from.  In all our different writing systems, there are
 no symbols for meanings or semantic relationships from which
 we build the meanings of things we read as words.  Writing
 systems are not semantic representation building systems.
 Writing systems are made of multiple systems of symbols
 needed to present well enough in written form the sound of
 spoken languaging.

 Text is thus sequences, usually long sequences, of the
 characters used to make all the different kinds of symbols
 we, as writers, use to write what we want to say in the way
 we want to say it, and what we, as readers, use to hear, out
 loud or silently in our heads, what the writer wanted to say
 in the way they wanted us to hear it.

 Text is thus not words.  Nor is text just letters from the
 alphabets we combine to write words with.  Text is
 characters, lots of them, from the different symbol systems
 we need to write and read with.  In digital form all these
 different symbols, including embedded type setting and
 formatting instructions, are encoded using the same digital
 character system.  We do not use different binary character
 systems for the different symbols systems in our writing
 systems.


2 The gigantic collection of text

 To build a LLM we must first assemble a gigantic collection
 of text (GCT) from human writing in all the different
 languages we want our LLM to deal in, often, including
 computer programming languages.  From all this writing we
 must remove all non-text material, images, pictures,
 photographs, diagrams, etc, and rip out all font selection
 and typographical formatting instructions; all the setting of
 the text on the page or screen; everything we put there to
 make the text readable and understandable by us readers.

 So, first question: why is it okay to rip out all this type
 setting and typographical design?  If you think good
 typographic design is not important for good reading and thus
 good understanding of what we read, read more Bringhurst, and
 disagree with him!

 Good typography is not just decoration.  It's needed for good
 accurate reading and understanding of what was written, and a
 lot of time and effort goes into making this work well.  If
 LLMs really deal in words and meanings, why is all this
 needed aid to reading the text first removed?  Why is it not
 needed, if, what LLMs really do is represent the semantics of
 text?  This is a question for the LLM builders to answer,
 and, I would say, others working in computational
 linguistics.  And, they need to explain why they then need to
 add at least some of this text formatting back in when
 presenting the output of LLMs to users of systems like
 ChatGPT, and other automatic text generators.  The simple,
 and boring, type setting we currently get from things like
 ChatGPT is not generated by the LLM inside it.  It's all
 added on by some post-processing of the generated text
 tokens.  Or, are we all now to believe that good
 typographical design plays no part in real language
 understanding from writing?


3 On token making

 All the text in our GCT must first be turned in to a [very
 long] sequence of text tokens.  As must be the text of any
 input prompt.  To do this so called text tokenisation we must
 first build a set of text tokens to use.  Here I will show
 how this is done for OpenAI GPT systems, including ChatGPT.
 Other automatic text generating systems may use different
 token building procedures, but they share the same basic
 ideas for how this is done.

 First, all the GCT text is encoded using one standard, and
 pre-specified, single byte character set.  For this UTF-8 is
 used (see <https://en.wikipedia.org/wiki/UTF-8>) and, in the
 case of the GPT systems, a basic set of 245 single 8 bit byte
 UTF-8 characters is used to encode all the different kinds of
 writing system symbols found in our GCT. (Most of the text in
 our GCT will already be in UTF-8 since almost all webpages
 are transmitted as UTF-8 characters.)  These 245 UTF-8
 characters are then made our first GCT text tokens, so we
 start with all the text in our GCT encoded using only these
 basic tokens.

 All the distinctions between alphabetic letters, punctuation
 marks, numerals, maths, and all the other things we find in
 written text, are removed by this tokenisation.  All the text
 is turned into single character tokens using only the 245
 UTF-8 characters we define as our basic set of characters.
 So, how, I would ask the fairy story tellers, do these text
 tokens carry, or magically possess, any meaning of words,
 when we've lost all sight of things we call letters, let
 alone words, in this basic text token representation of our
 GCT?

 These basic text tokens are treated as "atomic" text tokens,
 and are used to build multi-character text tokens using a
 Byte-Pair Encoding (BPE) procedure [4], which works as
 follows.

 An automated procedure is applied to our GCT, now all in
 atomic text token form, to count how many times adjacent
 pairs of atomic text tokens occur in our GCT. The adjacent
 pairs of atomic text tokens which have the highest counts are
 then made into new two-character text tokens and added to our
 token set.  Say, for example, the combination of a <t> token
 followed immediately by a <h> token is one such often
 occurring pair of atomic tokens in our GCT, then a <th> token
 is made and added to our token set, and, all occurrences of a
 <t> followed by a <h> in our GCT are replaced by this new
 text token.

 This way we make lots of new two character text tokens, and
 add them to our text token set.  How many depends upon what
 we decide is the threshold of "frequently occurring" adjacent
 pairs of atomic text tokens": a decision we as the LLM
 builders need to make.  It's part of the Dark Art of LLM
 building.  I don't know what this threshold is for the GPT
 systems.  If someone here does, please tell us.

 Next, the same counting procedure is used again to count how
 often sequences of three adjacent text characters occur in
 the (now modified) GCT, where these three character sequences
 are all built by pairing one atomic token with one of the
 just made two character tokens.  So, again, it's a pair of
 tokens that are put together, but this time from one atomic
 single character token and one two character token.

 Say, for example, one of these new three character sequences
 that occur many times in our GCT is "thr", made from <th> +
 <r>, then <thr> is added as a new token to our token set,
 and, again, all occurrences of <th> followed by <r> in our
 GCT are replace by our new <thr> token.  Another common three
 character sequence that will be built at this stage will
 likely be <the>, from <th> + <e>, and this too will be added
 to our token set, and used to replace all <th> and <e>
 sequences in our GCT. As before, what counts as a frequent
 enough three character sequence for it to be made a new text
 token must be decided and specified by us the LLM builders,
 and it doesn't have to be the same number we used in the
 building of the two character text tokens.  More Dark Art.

 This procedure is repeated, each time counting the occurrence
 of longer sequences of pairs of tokens made from the last
 made compound tokens and the set of atomic tokens.  Pairs of
 multi-character tokens are not used to make longer character
 sequence tokens, only atomic tokens and the latest compound
 tokens are used.  This repetition continues until the magic
 number of needed text tokens is built.

 In the case of GPT-3, this magic number is 100,245 text
 tokens.  Why are 100,000 compound text tokens made and added
 to the atomic text token set?  I don't know the answer to
 this.  It's hidden in the Dark Art of LLM building.  But, we
 can see that there is trade-off here.  The more compound
 tokens we have, the more compact any processed text will be,
 which saves memory and computation, but, the more tokens we
 have, the more embedding vectors we will need to build and
 use, which uses more memory and computation.  Somehow we need
 to find a "Goldilocks" value: not too large, not too small,
 but just right, to borrow from another fairy story.  How, I
 wonder, does all the meaning and semantics of the text we're
 processing here get to help decide on this Goldilocks value?
 Please tell me.  And why is 100,000 compound text tokens
 "semantically" adequate here, and how is it "semantically"
 adequate, and not just a system engineering decision hidden
 in the Dark Art of LLM building, given that BPE was
 originally designed and used as a text compression procedure,
 and not as some kind of magic semantic representation
 building procedure?  I see plenty of binary character
 constructing going on here, but no word meaning and semantic
 relationships being identified and represented.

 There are some restrictions applied to this text token
 building procedure.  For example, the tokens <’s>, <‘t>,
 <‘re>, <‘ve>, <‘m>, <‘ll> and <‘d> are not combined with any
 atomic tokens, so the GPT token set cannot include tokens
 like <we’ll> or <they’d>.  Why not?  What's not good about
 having these as text tokens?  Another restriction is that
 tokens for numerals are only combined with other numeral
 tokens.  So, some distinction between symbols from the
 different symbol systems of our writing system, is added back
 in here, via the back door.  But, combinations like < a> -- a
 single space character followed by the character 'a' -- do
 get built: single space is not an alphabet symbol.  You can
 check on the final GPT token set here
 <https://emaggiori.com/chatgpt-all-tokens/>.  Use the search
 option to see if your favourite text tokens are there!  Such
 as the common "words" like <;a>, or <**> and <****>, or <()>,
 or <indow>, just to pick a tiny number from the 100,254
 tokens that all, so the fairy story tells us, have perfectly
 clear meanings and semantics all now safely captured in their
 respective BPE built text tokens, right?  No, of course not.

 There are, as you'll notice, many text tokens that look like
 words to us readers, which the tokenisation procedure has
 constructed, and which happen to be combinations of
 characters we know as symbols of letters used to spell words
 with.  But this is an artefact of the token making procedure,
 and not an intended outcome.  And, as we saw at the start,
 letters carry no semantics, not even bits of meaning we build
 from words when we read them.  So, even for these word-like
 text tokens, how is any "meaning" put into them?

 In the standard fairy story versions of this token building
 procedure, the compound text tokens are called "subwords",
 sometimes with the added hand-wave that they capture the
 sub-meaning of whole words.  How they do this is, of course,
 never explained.  The reader, as far as I can see, is left to
 just think "Oh yes, of course, these tokens pick up bits of
 meaning which can just be added together in a simple linear
 away to make bigger meanings", or should that be
 "super-meaning"?

 This "subword" claim is, in my view, plain rubbish.  Meaning,
 as we understand and use in our reading and writing does not
 work this way.  The term "subword" is an example of what I
 call Hopeful Terminology; a term full of hope.  The hope here
 is that if we call these tokens "subwords" -- the vast
 majority of which can't be read as having any meaning, or
 sub-meaning -- they will magically become what we call them,
 and, like whole words, thus somehow carry bits of meaning,
 whatever that can mean.  There is no more than hope and
 hand-waving here.  There is, I would say, no good enough
 [David Deutsch] explanation for this.  This is an empty
 explanation; the weakest kind of explanation we can have, and
 thus useless.  But!  If I am mistaken in calling all this
 rubbish, please, someone, show us how and why I am mistaken.
 I'll be happy to see this.

 (There are plenty more Hopeful Terms used in all this
 Generative AI business, such as "Language Model", "language
 processing", "context", and more, but see [5].)

 In summary, in all this token making no explicit account is
 taken of the alphabets of the languages use to write the text
 in our GCT, and nor is any account taken of the syntax and
 morphology of these languages.  No distinction is made
 between the letters used to write words with and all the
 other kinds of symbols needed in writing.  No attempt is made
 to build any kind of low level representation of what results
 from our reading of text.  So, what, exactly is the basis for
 the claim that these text tokens can be used to represent
 words and the meanings we, as real languages, build from
 reading them?  What, exactly, is meant by "meaning" here,
 other than a low level encoding of the common characters
 sequences found in a very large quantity of text from human
 writing and programming?  Which, of course, is not meaning.

 This BPE text token building works well as a way to turn just
 about everything we find in our GCT into binary encoded
 characters of one, two, three, and more, bytes each.  But it
 squashes out all of the distinctions between the different
 kinds of symbols in the writing systems we use to write and
 read with.  We just have combinations of one, two, three, and
 more, UTF-8 binary characters.


3 On putting vectors to bed

 Having built our text token set we next need to build a so
 called embedding vector for each token.  Here I will simplify
 out some technical details, but still try show well enough
 how embedding vector building is done.  We build a two layer
 Connectionist system to take as input a text token, and to
 produce as output a big so called embedding vector.

 Older procedures used word2vec developed by people at Google,
 but more recent procedures use the GloVe (Global Vectors)
 algorithm developed at Stanford [7], so this is what I will
 describe.  From [7] we read that the "...  main intuition
 underlying the [GloVe] model is the simple observation that
 ratios of word-word co-occurrence probabilities have the
 potential for encoding some form of meaning."  Notice here,
 and remember, the "potential for" and the "some form of
 meaning".  The "potential" claimed here is not demonstrate;
 it has no substantial, multi-language, shared, and accepted,
 empirical basis, as far as I can find, but if someone here
 knows of this evidence, please tell us.  And the "some form
 of meaning" is not given any specification we can actually
 use to assess this in plenty of representative empirical
 cases, as far as I can find, but, again, if anybody here
 knows of such specification, please tell us.

 The GloVe procedure starts by first building a token pair
 co-occurrence table.  For each token in our token set, the
 GloVe procedure builds an estimate of the pointwise mutual
 information (PMI) between a token <x> and the "context"
 containing another token <y>.  This results is a 100,254 by
 100,254 table of estimated PMI values for all pairs of
 tokens.  The "context" here is the set of n text tokens
 sequentially either side of token <y> each time token <y>
 occurs in our GCT. Calling this set of 2n surrounding tokens
 the "context" of token <y> is yet another Hopeful Term used
 in the fairy story which says GloVe builds into the embedding
 vectors "the context sensitive semantics of words".  How this
 actually happens is not, of course, explained; it's just
 asserted; the unevidencedunevidenced "potential" is turned into a
 "fact": the magic of the fairy story at work.  PMI is a
 Shanon information theoretic construct, a useful one.  But,
 Shanon information is not about the meaning or semantics of
 the words we use in out languaging.  So, the fairy story
 tellers here need to explain how these token pair PMI values
 capture any actual semantics.  I don't see any word meaning
 or semantic relationships anywhere in this statistical data
 about our text tokens as they occur in our GCT.

 The pointwise mutual information (PMI) between a token <x>
 and a sequence of (2n+1) tokens containing token <y> in the
 middle, denoted <n-y-n>, is given by:

     PMI(<x>,<n-y-n>) <= Log2 P(<x>,<n-y-n>)/P(<x>).P(<n-y-n>)

 where P(<x>,<n-y-n>) is the probability of <x> occurring in
 <n-y-n>, P(<x>) is the probability of <x> occurring in our
 GCT, and P(<n-y-n>) is the probability of <n-y-n> occurring
 in our GCT. The GloVe procedure estimates these probabilities
 by counting actual occurrences of the token <x>, sequences of
 tokens <n-y-n>, and occurrences of <x> in <n-y-n>, in our
 GCT, a computationally expensive task, but one we only need
 to do once.

 But how big is n here, and how big does it need to be to
 capture any real "semantic context", if that's what it's
 supposed to do?  And, is one size of n good enough for all
 the text in our GCT? As far as I know it is always a
 constant, and, from what I have seen mentioned, but not well
 explained, n is set to something like 5 tokens.  (If anybody
 here knows more on this, please tell us.)  This might be
 enough for English text, given its typical grammatical
 constructions, but for languages, such as German and Basque,
 which have stack-like constructions where the verb goes at
 the end of what can be a long sentence, 5 tokens before and
 five after our token <y> isn't going to collect anything
 about these longer range relationships, semantic or any other
 kinds of relationship, not every time, at least.  But, even
 in English text, meaningful phrases like epistemic qualifiers
 can appear almost anywhere in a sentence and still do the
 same semantic qualifying, though where they are placed may
 modulate the force of this qualification.  Is an n=5 always
 going to be big enough to cover this kind of semantic
 context?  I would say not.  So, real semantic considerations
 don't seem to come into the setting of n here.  I've not
 found any demonstration of how this magical "meaning context"
 is made to have anything to do with this estimation of PMI
 values.  I strongly suspect that n is chosen on the basis of
 what makes the implementation do what we think it should do,
 and not on any aspects of real languaging, meaning, semantic,
 or anything else.  It's LLM Dark Art stuff.

 Next, this large co-occurrence table of PMI values is used to
 program a two layer Connectionist system [that's train a
 Neural Network in Hopeful Terminology] using standard Back
 Propagation to minimises the square of the difference between
 the dot product of two token vectors and the [Log base 2 of
 the] PMI value for the same two tokens in the big table.
 This results, after enough "training", in a Connectionist
 system which maps a text token, as input, into a large
 numerical vector as output.  In GPT-3 each token vector is
 made 12,288 elements long, but I think they are bigger in
 more recent versions of GPT. (Does anybody here know?).  And
 why 12,288, I hear you ask.  I don't know, but, again, you
 can see there are strong computational considerations at play
 here.  And, as far as I can see, no meaning or semantic
 considerations.  But if anybody here can show us how there
 is, please do so.

 What this minimisation does, according to the fairy story, is
 "capture semantic relationships" between the text tokens
 [which are always called "words" not text tokens to help us
 keep our belief in the fairy story].  In more detail, the
 fairy story is that the co-occurrence table "contains a
 quantitative measure of the semantic affinity between words
 in terms of the frequency with which they appear together in
 a given context", 2n+1 tokens "context"!  Except, nothing is
 presented to show that PMI values do work as a well defined,
 reliable, and robust, measures of the "semantic affinity
 between words."  This is just asserted, and the magic of the
 fairy story then makes it so.  Nor, of course, is there ever
 any attempt to define what "semantic affinity of words" is in
 quantitative terms.  This is just more of the typical hopeful
 hand waving we see in the fairy story versions of how token
 vectors are built, despite the fact we're not dealing with
 words here, we're dealing with text tokens, most of which
 don't even look like words to us readers.

 Given how this text token vector making is really done, what
 we might be able to say is that there is a correspondence
 between the way two different vectors relate and the
 estimated PMI value of the two tokens as found in our CGT.
 But mere correspondence is not enough to say anything is
 being represented.  Representation requires satisfying the
 Representation Relationship, and this requires explicitly
 showing that, in all cases without exception, each token's
 "meaning" is mapped by the same mapping function into it's
 embedding vector, and that every embedding vector's "meaning"
 is mapped back correctly into the meaning if its
 corresponding token, and that everything done to these
 vectors by the LLM preserves these mappings at all times.
 All this is usually hard to do in any good designing and
 building of a representation system.  Representations are not
 made just by calling something a representation, as the fairy
 story would have us believe.  It takes plenty of difficult
 designing, specifying, and careful, well verified, validated,
 and tested, implementation, to build a correctly working
 representation system that can support some well specified
 sound computational reasoning process.

 We need to look some more here.  There are details of how the
 GloVe procedure works which make it even harder to understand
 how it builds any real vector representation of word
 meanings.

 The objective function of the optimisation done in building
 the Connectionist system we then use to make our token
 vectors with, contains another term, a so called "weighting
 function" which is multiplied with the square of the
 difference between the vector dot product and corresponding
 PMI values for two tokens.  This weighting function needs to
 be continuous [but not necessarily smooth] so that pairs of
 tokens with PMIs values of zero, or very small, are weighted
 zero.  This saves lots of computations with zero or very
 small values.  Also, this weighting function is designed to
 limit the maximum size of PMI values used in the
 minimisation.  This is needed to prevent very common <x> in
 <n-y-n> pairs dominating the proceedings.  In other words,
 it's a hack to stop undesirable behaviour in our vector
 building system.  There is also mention in some places that
 implementations of this function contain hidden extras, like
 rules to prevent certain token pairs being treated, or being
 given certain PMI values, no matter what they have in the
 table.  In other words, more hacks.  As an engineer I call
 this approach to designing and building systems Christmas
 Pudding making: as long as you add plenty of Brandy and serve
 it hot with plenty of custard, everybody will like it, and
 nobody will ask what's in it.  It's certainly no way to build
 a transparently working, reliable and robust, representation
 system.  Yet, this issue is swept aside by fair story tellers
 with claims like "this does not tend to be a problem", see
 video 1 in [8] In my view, this is a "hack it 'til it works"
 attitude on display, where "works" is what we think it should
 do.

 More.  As we have seen, the GloVe procedure uses the dot
 product of two token vectors, which gives a scalar value.
 But it does not render a surface with one global minimum;
 it's not convex, is the posh way of saying this.  This means
 the minimisation can only finds local minima in the building
 of the Connectionist system we use to make our token vectors
 with, and this means tokens do not necessarily have unique
 vectors: two or more tokens can have similar vectors: similar
 in direction and magnitude.  But these tokens will not
 necessarily have the same or even similar "meanings"!  This
 is not a way to build a well working representation system.

 And it means the favourite fairy story example of so called
 "semantic vector algebra", that

     vector(<king>) - vector(<man>) + vector(<woman>)

 results in a vector the "same as" the vector(<queen>), can
 appear to work for these tokens [always called words, of
 course], but may also work for other tokens too, but which
 have no sensible semantic relationships.  This is no good for
 any usable and understandable representation building.

 Nobody goes looking to see if this happens because in the
 fairy story explanation just showing that one carefully
 chosen example appears to work is all, it seems, that is
 needed to prove vector representations of word meanings
 works.  In any real representation system building we would
 have the verification, validation, and testing work done and
 presented which shows that the designed representations
 happen as designed, only as designed, and can only happen as
 designed, and always happen well enough in all the conditions
 and situations they are designed to work in, particularly
 when used in any computational reasoning processes these
 representations are use by.  Doing all this often takes more
 work than the designing and implementing does.

 To show that the vector(<king>) - vector(<man>) +
 vector(<woman>) example really is an example of the vector
 representation of word meaning the fairy story people say it
 is, we need to see that a generous set of representative
 examples of vector algebra combinations of text token vectors
 all, without fail, result in sensible semantic outcomes.  And
 not just combinations of three vectors, combinations of any
 number of vectors.  And, if there are any such combinations
 that are not supposed to result in semantically sensible
 outcomes, we need to see that they don't.  And we'd need to
 see what extra machinery is added to our vector
 representation system to stop these particular cases of
 vector combinations from ever being treated as semantic
 representations in some reasoning process.  We need to see
 the circumscribing machinery for this.  Needles to say, I've
 never seen anything like this in any reporting of any LLM
 building.

 There's more.  How, exactly is the resulting vector of the

     vector(<king>) - vector(<man>) + vector(<woman>)

 example shown to be the same, or sufficiently the same, as
 the vector for the token <queen>?  I've not found any clear
 explanation of this, just statements like "you can see they
 are the same", or "we check to see if they are the same", or
 we see nice neat drawings showing how all the vectors join up
 as they are supposed to, but no numerical data for the
 vectors drawn is offered.  The sameness here really is as
 exact as these drawings make it?  No, of course hot.  These
 drawings are cartoons, not demonstrations of sameness.  (And,
 they are not even proper drawings of vectors in a vectors
 space, they are drawings of lines with arrows on one end in a
 coordinate space.)  Remember, these vectors are of the order
 of 12,000 elements long, so giving us all the numbers to look
 at would need plenty of space, and not be easy to work with.
 But, we could then check for ourselves.  Better, of course,
 would be that the fairy story tellers told us exactly how
 this sameness or sufficient similarity is reliably assessed.

 There are different ways to assess the similarity of two
 vectors.  Commonly used ones in computational linguistics are
 Cartesian distance, vector dot product, and vector cosine.
 Cartesian Distance is not a vector space quantity, it's a
 [Cartesian] coordinate space property, in which we have
 points, lines, and [hyper]planes, so it does not give us a
 vector comparison; it gives us how far apart two points are
 in the coordinate space.  If you take the end points of
 vectors as points in a coordinate space overlayed on our
 vector space so that the origins and axes of both spaces
 coincide, we can apply this method, but you are then not
 dealing with vectors, just points.  Or, is this what we are
 really dealing with, embedding points in a coordinate space,
 and not vectors in a vector space?  It often looks like it.
 Look at the picture in the Wikipedia page for Word embedding,
 for example (see
 <https://en.wikipedia.org/wiki/Word_embedding>)

 This shows two pairs, France --> Paris and Germany -->
 Berlin, and describes "France", "Germany", "Paris", and
 "Berlin", as being points in a coordinate space, not vectors
 in a vector space!  What look like vectors in this drawing --
 lines with arrows on one end -- are drawn to indicate the
 translations from the point labelled "France" to the point
 labelled "Paris", and from the point labelled "Germany" to
 the point labelled "Berlin", together with the claim that
 these two translations are the same, or, at least, near
 enough the same.  But these translations in this coordinate
 space are not vectors, they need a matrix to define them, yet
 there's no mention of this, let alone any indication of how
 the similarity of these matrices are to be assessed.  This is
 a common confusion on display in many web pages and
 publications on so called word embedding and the "semantic
 relationships" captured by so called word vector embeddings.
 To me, this kind of basic confusion about what are points and
 translations in a coordinate space and what are vectors in a
 vector space, suggests the people who do this stuff don't
 understand what they are doing, or don't care to present and
 explain it properly.

 The vector dot product and vector cosine are vector space
 quantities, so these do offer ways we might compare two token
 vectors, but they have importantly different qualities which
 we need to know about and understand.  Take the dot product
 first.  This takes into account both the direction and the
 magnitude of the vectors involved; the two properties of
 vectors.  But, it does not give us a result which strongly
 corresponds to how well aligned the two vectors are.  Two
 vectors with the same direction, but quite different
 magnitudes, will have a different dot product to two aligned
 vectors which have similar magnitudes.  This may be what we
 want, of course, but then we must say what the direction of
 token vectors represent, and what the magnitudes of token
 vectors represent.  If there really are two important aspects
 of text tokens as they occur in the text of our GCT which we
 need to represent, then using the two properties of vectors
 to represent these would make sense, but if there is only one
 aspect to represent, using vectors introduces distorting
 artefacts into our representation system.  The fairy story
 doesn't tell us what two aspects of text tokens are well
 represented using the magnitude and direction of vectors, and
 which aspects is well represented by the magnitude and which
 by the direction of token vectors.  We need this to properly
 understand what is really going on and why.  Furthermore, if
 the magnitude of token vectors does do representation work,
 not just the direction, we need to know what all the
 dimensions of the vector space represent, all 12,000 of them,
 and how different amounts of these vector space dimensions do
 needed representation work, and how what each dimension
 represents is really orthogonal to what all the other
 dimensions represent.  Something else the fairy story is
 silent about.

 A typical simple illustration of this idea can be seen in the
 diagram on the Wikipedia page for Distributional semantics,
 the [hardly ever mentioned] origin of the fairy story.  It's
 one of many similar diagrams to be found on the web (see
 <https://en.wikipedia.org/wiki/Distributional_semantics>).
 Here the two axes of a two-dimensional "semantic space" are
 labeled "political" (from 0.0 to 0.3 horizontally) and
 "dangerous" (from 0,0 to 0.3 vertically), and we have example
 vectors in this space for "animal", "shark", and
 "dictatorship".  If we are to take this cartoon seriously,
 this illustration says that the meaning of the word "shark"
 is composed of, or shares, about 0.1 of the "meaning" of
 "political" and about 0.22 of the "meaning" of "dangerous",
 and the "meaning" of "dictatorship" is composed of, or
 shares, about 0.27 of the "meaning" of "political" and about
 0.13 of the "meaning" of "dangerous".  Is this how the real
 meanings we build from reading text to build words in our
 head works?  If so, where is the plentiful reliable empirical
 evidence for this, across a good selection of languages,
 cultures, and domains?  Obviously this particular example is
 made up for "illustrative purposes", but this does not excuse
 the silliness here.  Nor does it serve to establish any real
 basis for this kind of idea about how text, words, and
 meanings work in real human languaging.  Yet, the fairy story
 is based upon this idea; it all starts with the so called
 Distributional hypothesis: "words that are used and occur in
 the same contexts tend to purport similar meanings."  Let's
 see this story told with the 12,000 or more dimensions of the
 vector spaces used in today's LLMs.  Let's see what are the
 "words" put on the 12,000 or more supposedly orthogonal
 dimensions of these vector spaces.  Hand waving with simple
 2D examples may be good enough for fair stories, it's not
 good enough, I would say, for a good explanation of what is
 really going on in the vector spaces built and used in LLMs
 today.

 Another issue with the vector dot product as a vector
 similarity test is that pairs of vectors which are not well
 aligned can have the same, or similar, dot product values as
 two vectors which are well aligned, so the dot product
 doesn't strongly distinguish between vectors which are well
 aligned and vectors which are not aligned.  So, if vector
 alignment is important in the story of how text token
 "meaning" is captured in vectors, we need a good explanation
 of how this works well enough, given that the way the vectors
 are built by the GloVe procedure using dot product
 comparisons cannot make vectors of tokens with similar
 "meaning" always be closely align.  But, according to the
 always quoted example of how this "semantic vector
 arithmetic" works, about "king", "man", "women", and "queen",
 vector alignment is important.  This needs proper
 explanation, not the usual "hey look, it works" treatment.

 What about vector cosine?  This measures the relative angle
 between two vectors.  So, if it is vector direction which is
 important, and not magnitude, then this looks like what we
 need to assess token vector similarity.  Although to
 calculate vector cosine we need to calculate the magnitudes
 of the vectors involved, vector cosine values are independent
 of the vector magnitudes.  Essentially, the vector cosine
 treats all vectors as unit vectors: vectors all of magnitude
 one.  If this is all we need to compare our token vectors,
 why don't we build token-to-vector making systems that
 generate unit vectors?  This would make these systems easier
 to verify and validate, and understand.  The reason this is
 not done, I suspect, is that the "semantic vector algebra"
 would not work, or not work as well.  This is speculation on
 my part, but why are vectors, with their two defining
 quantities, direction and magnitude, built when we only use
 one of these quantities to build the so called vector
 representations of meaning?  Building unused properties into
 a representation system is not good representation design:
 you need to test that despite not being used such properties
 do not result in unwanted artefacts in the representation
 processing.  More, if it is only vector direction that is
 important for the claimed semantic representation, why not
 use points in a coordinate system?  Using unit vectors adds
 nothing, except, apparently, confusion in the minds of the
 representation builders.  But, of course, if all you need to
 provide is a hand waving fairy story about what goes on, none
 of these kinds of important representation system design and
 implementation issues need to be explained, let alone
 admitted to.  You can't see them in the fairy story.

 A question that seems obvious to me is why the GloVe
 procedure doesn't use vector cosine, rather than the vector
 dot product, as it does.  The vector cosine does render a
 convex minimisation surface so should, on the face of it,
 work better to minimise the objective function.  Jurafsky and
 Martin, in their book on Speech and Language Processing hint
 at people doing this, see [9], page 110, and briefly mentions
 the difficulties of using vector dot product, but I suspect
 that vector cosine is not use because it is computationally
 more expensive than vector dot product.  As we have seen,
 building the token-to-vector Connectionist system we need
 involves doing lots of vector dot products, and this seems,
 to the LLM builders, to work well enough, see video one in
 [8], so this is what is mostly done.  Yet more, hack it 'til
 it works, I'd call this.  Or, is there some sound argument
 about how word meanings are capture by embedding vectors that
 I have missed here?  I'd be happy to be enlightened about
 this.

 In note [8] I list three YouTube videos which offer fairy
 story explanations of some of what I talk about in this part,
 and illustrate the kinds of confusions I have identified.


4 It's time to stop

 What I have explained here is only the beginnings of what
 happens inside systems like ChatGPT. There's plenty more to
 understand, but I'm not going to show this here.  However, I
 do want to mention a few things to try to clarify particular
 aspects of the way systems like ChatGPT work.

 The input to the LLM inside ChatGPT is not a simple list of
 tokens made from a prompt we enter via the ChatGPT interface.
 The input is a large matrix which is as wide as the token
 vectors are long, which for GPT-3 is 12,288, and as deep as
 the ChatGPT "Context Window", that's the total number of text
 tokens it works with at any time, which for GPT-3 is 50,257,
 but much bigger for later versions of GPT used in ChatGPT
 today.

 This matrix is populated with the token vectors made for the
 tokens from our prompt, after it has been tokenised, and,
 after these token vectors have had their respective position
 vectors added to them.  Position vectors encode the position
 each token comes in in the linear sequence of tokens made
 from our prompt.  Each position vector is the same length as
 the token vectors, 12,288 for GPT-3, and is filled with
 values from a sine function, whose frequency is incrementally
 increased as we go along the token sequence: later position
 vectors have values from high frequency sine waves.  This is
 how token sequence ordering is encoded and added into the
 input to the LLM. But, I would ask, what does adding a 12,288
 long vector of sine wave values do to the supposed "semantic
 representation" our text token vectors are supposed to do?
 Nothing?  I'd like to see this explained.  Adding this
 position vectors to our token vectors changes all the values.
 How can this not change what they represent, if they
 represent anything?

 To the tokens of our input prompt ChaGTP adds what is called
 a system prompt.  This is quite a lot of hidden stuff,
 another 16,739 tokens in ChatGPT. In the case of Claude, this
 system prompt is reported to be 16,739 tokens, see [10].  The
 system prompt basically configures the behaviour of ChatGPT
 and the way the LLM is made to process our prompt token
 vectors.  We never see any of this system prompt stuff, and
 what it is composed, or what it's there for.  It's more LLM
 Dark Art.

 If the 12,288 by 50,257 input matrix is not completely filled
 with tokens from our prompt and the system prompt, the rest
 of this matrix is filled with null tokens, which have a
 spacial vector.  Then, after some prompt and response cycle,
 the output of ChaGPT is added into this input matrix, after
 our first prompt and the system prompt token vectors.  And
 this keeps going until the input matrix, the so called
 ChatGPT Context Window, is filled up.  From then on, tokens
 are removed on a first-in first-out basis, to make room for
 the latest generated output and any more prompts from us.
 This is how ChatGPT appears to "remember" our previous
 prompts and its previous responses, and then starts to forget
 these.  And this is why making this so called "Context
 Window" bigger seems to make ChatGPT generate better text in
 response to our prompts.

 The output of the LLM inside ChatGPT is not the text we get
 back from ChatGPT in response to our prompt.  This is
 assembled from repeated cycles of presenting the input matrix
 to the LLM and adding one more token to this matrix selected
 from what the LLM actually outputs, which is an estimated
 probability distribution over all the tokens in the text
 token set.  The LLM does not predict the next token, nor
 "predict the next word".  It builds a complete estimated
 probability distribution over all the text tokens.  It is an
 extra piece of machinery inside ChatGPT which selects which
 token to take out of this estimated probability distribution
 to add to the input matrix.  And it's another piece of
 machinery which decides when to stop doing this, and prepare
 the accumulated newly generated tokens to present back as
 readable text.


5 Some last remarks

 LLMs do not, despite their name, model language.  If they
 model anything, they model statistical patterns found in the
 text token form of our GCT. Yes, at least some of these
 patterns, perhaps many of them, will have some kind of
 correspondence to patterns of words we make in our
 languaging, which we speak and write down.  But, the LLM has
 no knowledge or understanding of these correspondences, only
 we do.  LLMs are not built, so called "trained", with any
 data about patterns found in our languaging.  The data used
 is only tokenised forms of text, and text, as I have
 explained, is what is left after we do some writing.  Text is
 not the writing.  Text is not languaging.  Text is the marks
 left when we write something we want to say.  It's a
 precipitate of writing.  The words we chose to use to say
 something with, in the way we decide to say it, are in our
 heads, nowhere else.  The text left over from writing the
 words we decide upon are signs for how the words we chose to
 say something with sound.  Text is not composed of signs for
 the meaning of what we want to say.  To get the meaning
 again, you need a listener or a reader who understands the
 language(s) we speak and write in.  LLMs don't do any writing
 or any reading; they don't do any language understanding.
 LLMs do text token processing, a lot of it, but that is all.
 Systems like ChatGPT with an LLM inside, do text-to-text
 transformation.  It requires us to read and understand
 anything from the text generated by such text token
 processing.  It's clever how LLMs are used to generate text
 which is readable and understandable by us, but not really a
 surprise, given they are built with gigantic amounts of text
 left from human writing.  But, just because we do get
 readable and understandable text from systems like ChatGPT
 does not mean these system understand language, or know
 about, reason about, or understand anything that the text
 seems to be about to us readers.  ChatGPT, and other
 automatic text generating systems like it, doesn't know,
 reason about, understand, anything.  They are just built and
 described and talked about in ways to make it look like they
 do.  We should take care not to be fooled by this kind of
 fairy story.


6 To end

 I have tried to get clear, and be clear, about (some of) what
 really goes on inside Generative AI systems like ChatGPT, but
 I am not welded to what I have put here.  I would be more
 than happy to receive any and all corrections and
 clarifications to anything above, or even just indications of
 where corrections and clarifications are needed.  I am,
 however, unlikely to respond to further attempts to push the
 Generative AI fairy story.  For me, this has no clothes on,
 and I think we need to say so.

 If you've read all the way to here, you have my appreciation
 and a big thank you.  Telling real stories that offer good
 explanations, it turns out, takes longer than telling fairy
 stories full of fantasies.

 -- Tim


Notes

[1] David Zeitlyn, 2022.  An Anthropological Toolkit, Berghahn
    Books, Chapter 42.  Ostension, pp 90, opening sentence.

[2] David Deutsch: Explanations.  Why David Deutsch believes
    good explanations are the antidote to bad philosophy,
    video interview, Aeon, 25 June 2025, opening remarks.
    <https://aeon.co/videos/why-david-deutsch-believes-good-explanations-are-
the-antidote-to-bad-philosophy>

[3] Robert Bringhurst, 2004: The Solid Form of Language, An
    Essay on Writing and Meaning, Gaspereau Press, pp 9.

[4] There are plenty of paces to find the the BPE presented in
    more detail, here is a simple and, I think, sufficient,
    explanation.
    <https://en.wikipedia.org/wiki/Byte-pair_encoding>.

[5] Hopeful Terminology is not special to Connectionist AI,
    and today's Generative AI business.  All kinds of AI
    research have plenty of their own versions of Hopeful
    Terms.  It's been something that few have commented upon
    and criticised, but the first person I know to do this was
    Drew McDermott in a short paper called "Artificial
    intelligence meets natural stupidity", in ACM SIGART
    Bulletin, Issue 57, pp 4-9,
    <https://doi.org/10.1145/1045339.1045340>.  A direct
    consequence of this bad practice is that the builders of
    AI system then use these hopeful terms to presented what
    can only be poor explanations of what their systems do and
    how, and also fail to recognise that their understanding
    of their own system is badly flawed.

[6] Ursula K Le Guin, 1998: Steering the Craft, A 21st-century
    guide to sailing the sea of story, first Marina Books
    edition 2015.

[7] GloVe: Global Vectors for Word Representation Jeffrey

    Pennington, Richard Socher, Christopher D Manning
    <https://nlp.stanford.edu/projects/glove/>.  This page
    presents more examples of the so called "vector difference
    between the two word vectors" whereas what is actually
    drawn are points labelled as words in a 2D coordinate
    space with lines indicating translations from one word
    point to another word point.  In vector space there is no
    difference operation, just vector addition (and
    subtraction) and scalar multiplication.  What is presented
    here is confused, and confusing, and no good explanation
    of what is really going on.  Nor is anything said about
    how the 50,000 or more dimensions of the illustrated "word
    vectors" projected into the 2D space of this illustration.
    The authors also seem not to understand that translations
    in a coordinate space need square matrices, not vectors to
    specify them.

[8] The three videos.  There are millions of videos on the web

    about "word embedding", GloVe, and LLMs -- well, okay, not
    millions, but it feels like there are millions -- to save
    you looking at all these, here are just three which are
    useful in the context of the story I tell here.

    Lecture 3: GloVe: Global Vectors for Word Representation
    Stanford University School of Engineering, presented by
    Richard Socher, see
    <https://youtu.be/ASn7ExxLZws?si=e8ag__SacXOp1Tlr> (April
    2017).  In this lecture we get the claim, in response to a
    question from someone in the class, that the fact the
    vector dot product does not render a convex optimisation
    surface doesn't seem to present any problems, so it's just
    ignored.

    Vectoring Words (Word Embeddings) by Rob Miles on
    Computerphile (5 years ago)
    <https://youtu.be/gQddtTdmG_8?si=zz4b6zBb2RwsnOiL>.  This
    presents the fairy story idea that two words are similar
    if they are often used in the same "context", but without
    any clear explanation of what "context" means here, and
    how we can identify the context of any particular word.
    And Miles talks about word being points, not vectors, thus
    illustrating this common confusion in all this so called
    "word embedding" stuff.

    Glitch Tokens by Rob Miles on Computerphile (again)
    <https://youtu.be/WO2X3oZEJOA?si=154hJuyCMVLcWnKC> (2023).
    This presents some entertaining examples of strange text
    tokens and weird ChatGTP behaviour, and, along the way,
    displays the usual calling tokens words, when what they
    really are are text tokens, and other hand waving ways of
    talking we see in the fairy story explanations of the
    workings of LLMs.

[9] Daniel Jurafsky and James H Martin, 2024.  Speech and
    Language Processing, Third Edition (Draft of January 12,
    2025)
    <https://web.stanford.edu/~jurafsky/slp3/ed3book_Jan25.pdf>

[10] Claude's System Prompt: Chatbots Are More Than Just
     Models, from dbreunig.com
     <https://www.dbreunig.com/2025/05/07/claude-s-system-prompt-chatbots-are-
more-than-just-models.html>

[11] Here's a late extra note not cited above, but which
     confirms, I think, that it's all only text tokens, never
     anything like words and meanings, no matter what it might
     all be made to look like.

     The mystery of em‑dashes: part two with quantitative
     evidence By Maria Sukhareva, at AI Realist, July 05, 2025
     <https://msukhareva.substack.com/p/the-mystery-of-emdashes-part-two>

==============================================================


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php