I saw a beautiful blue butterfly

The common thinking is that language emerged in humans because of our superior intellect compared to other animals. But with the rise of Large Language Model (LLM) based AI, and how they seem to exhibit a form of intelligence, perhaps it is time to ask the question:

Did we get that backwards? Is Intelligence just an emergent property of Language?

Hear me out here. The Current scientific consensus is that the human mind has not changed in the last 160,000 years. Yet civilization and technological advancement seems to be very recent, appearing only as recently as 5000-6000 years ago. This is also about the time complex language forms arose. And since then language has exponentially accrued new words. And as the language exploded so did the pace of human innovation.

I believe it is the expansion in language that has driven the explosion in apparent human intelligence.

I saw a beautiful blue butterfly

When I tell you “I saw a beautiful blue butterfly this morning”. You can reconstruct that in your mind’s eye with great clarity. But to do so… you must have already known what the color blue is, likely how colors in general and their nature is, that a butterfly is a winged insect that has metamorphosed from a caterpillar. That it has a certain shape/form. What a morning is, and what feeling the word beautiful represents. Furthermore, you also understand what an insect is, the position of the sun in the morning and that it must be late spring to early fall. And so on down the rabbit hole of trained knowledge.

Language is a compression function

“I saw a beautiful blue butterfly this morning” contains a lot of information being conveyed in just a few words. This implies language is in one respect a very compressed way to convey information. Language is a compression function that relies on a shared recursive dictionary. And we spend the vast majority of our lives building this map of dictionary words to meaning data sets. In a very similar way we train LLMs. But once a significant amount of this has been learned… the rate that we can transfer knowledge, intent and meaning between each other skyrockets.

Understanding requires shared data sets

It may also explain why often different cultures have such a hard time understanding each other. Our dictionaries are filled with different information…because we trained them on different data. What may seem obvious to you and I, may be completely foreign to a recent immigrant.

In the abstract, words are just arbitrary tokens that are an index into a deep set of underlying trained knowledge.

Feed your kids tokens

We already know that reading to your children has a significant impact on their brain development. This reinforces my premise that it is language that leads into intelligence. Reading to your children is training your children’s mind map of words to meaning. We find children with a large set of vocabulary and knowledge do better in school, life , etc. We find that even passive exposure to language results in improved performance in children.

This is the same as what we find with LLM AI. The larger the training data set the better these systems do in common tasks. With the latest (as of this writing) 10 Trillion token data sets, we’re finding incredible improvements in cognition.

Are we failing parts of our society due to language neglect?

We often blame poverty on culture, on intelligence, poor food, poor environment, or one of any number of -isms. But what if its as simple as language starvation?

We know from many studies that people from poor families have significantly lower language skills. What if the only thing needed to pull people out of poverty is as simple as finding a way to feed these kids language at a high rate. Forget math, science, history, etc. Just hammer away at language as fast and hard as possible… until they reach parity.. and then introduce more abstract schooling.

If intelligence arises from language, and abstract thought is the highest form of intelligence, how can you possibly learn abstract concepts if you do not have a deep enough dictionary of tokens?

Language and Super Intelligence

From this thought experiment, it is evident to me that current AI, that is trained on human-generated information, will be limited by the number of tokens(words) of our language. Although AI should be able to outperform the average human, it will not be able to reach super intelligence with those language limitations.

As humans, we have biological limitations to our language size… though I don’t think anyone understands what the upper limit of that is yet.. but thinking machines have no such biological limitation.

To venture beyond this and onto super intelligence, they will have to invent their own tokens/words to carry an even deeper meaning. And then invent their own language form, which will likely be completely alien to us. Only then will they truly exceed the capabilities of their creators. I don’t think that time is far off.