J A B B Y A I

Loading

TL/DR:

In this post, i demonstrate a compelling case that the Transformer architecture is a holographic associative memory. The attention mechanism performs the Fourier Transform, the mathematical operation that describes holography. The MLP layer performs the inverse Fourier Transform, returning information back to its original format. And the trainable parameters inside the Transformer act as the holographic film that record inference patterns. Transformers make predictions in a similar manner to how hopfield networks and holographic films perform pattern completions. If transformers work via holography, that opens a huge avenue for optimizing the transformer architecture, for example using physical light waves directly, instead of simulating them, which would result in 1000x increase of the training, processing speeds.

Beginning

I was curious about how LLMs architecture work. Why it work. So i started researching about it.

LLMs are based on the Transformer architecture. A single layer Transformer architecture, essentially consists of a sequence of information processing modules, sending information in one direction:

  1. Word embedding layer. It takes a sentence, sequence of words, and essentially turns every word into a vector in a high dimensional space.
  2. Self-attention mechanism.
  3. Multi Layer Perceptrons (MLP), or feedforward networks.
  4. Unembedding layer. It takes the high dimensional vectors, and essentially turns it back into words. Making a prediction of a new next word this way.

Attention mechanism is a hopfield network

I wanted to better learn what the attention mechanism does, and started researching about it. And i learned, that the attention mechanism in transformers, is equivalent to modern hopfield networks.

[2008.02217] Hopfield Networks is All You Need

“We introduce a modern Hopfield network with continuous states and a corresponding update rule [….] The new update rule is equivalent to the attention mechanism used in transformers.”

Hopfield network is a neural network that works like an associative memory. It can record many different patterns into itself. And then it can recall the correct complete pattern, if you input the incomplete version of that pattern. You can see the illustration of how it works below:

Hopfield Networks is All You Need – Github

https://preview.redd.it/18qq06xhnqie1.png?width=1672&format=png&auto=webp&s=dd2475174bf51ab4ec37772ce7f536503a400d08

Here, it remembers different characters from Simpsons. And then is able to recall the complete character, from the incomplete input.

I became hugely curious about hopfield networks. Because they were much more biologically plausible, similar to how the real neurons in our brains work. So i researched more about it. And i found out, that hopfield networks are just holographic data storage systems. Specifically they are a type of holographic associative memory.

Hopfield network is a holographic associative memory

But what is a hologram? How do holograms work?

Introduction To Holography – 1972

If you want to better understand holography, i would recommend watching this youtube video about it.

Chapter 12: Phase Conjugation and Real Image Projection

https://preview.redd.it/bzpk1l3mnqie1.png?width=465&format=png&auto=webp&s=90267544c59514c5a30a72c04024c8b1db1f1665

Here is how holography works.

Say we expose two physical objects, toy car and a rubber duck, to a laser beam, and then record it into the holographic film. The light waves traveling from those two objects interfere with each other, creating an interference pattern. This interference pattern then gets recorded and saved inside the holographic film.

Next time you shine the laser at the first object, the toy car, and its light waves reach the already recorded holographic film, the interference lines that previously formed inside the holographic film, act as partially transparent mirrors. Those mirrors will redirect the light waves of the first object in such a way, as if the light waves that traveled from the second object is being projected. Creating the image of the rubber duck.

So in a sense, holography creates associative memory between those two physical objects. And an interesting property of the holographic film, is that even if you only send a projection of only a part of the toy car to the holographic film, it will recreate the whole image of the rubber duck. Showing how it can complete incomplete patterns, or “predict” incomplete patterns. And, a single holographic film, can store multiple images of associations of many pairs of objects.

You can already see the similarities it has with hopfield networks. To make it closer to the hopfield network, you can think of a holographic film, that stores the two light projections of the exact same object. So that, if you project an incomplete object, it will create a fully complete version of that incomplete object, if that incomplete object matches the stored object.

“Holographic associative memory with nonlinearities in the correlation domain”

https://opg.optica.org/ao/viewmedia.cfm?uri=ao-26-10-1900

https://sci-hub.se/https://doi.org/10.1364/AO.26.001900

https://preview.redd.it/j3u3a31pnqie1.png?width=940&format=png&auto=webp&s=9c4bd8eba74e78f63be03decf7d3480e98aac8a7

Here is a direct example, of holographic associative memory, completing an incomplete pattern, like the previous example of the hopfield network.

But those are just similarities. How is the hopfield network actually equivalent to holographic associative memory?

Willshaw associative net

Holography, Associative Memory, and Inductive General ization

David Willshaw in 1981 wrote this chapter in a book about artificial intelligence.

In it, he showed showed that holographic film can act as associative memory. And by performing couple modifications to the mathematical model of holography, he derived the Willshaw associative net. Which works very similarly to the hopfield network.

https://preview.redd.it/j8lmd72rnqie1.png?width=751&format=png&auto=webp&s=c31d49f3dddf912e69fadc1710fc1f92a830039f

The Hippocampus and Associative Memory

This brilliant presentation, shows that human hippocampus, a region in the human brain, works like a hopfield network.

https://preview.redd.it/tslsl56tnqie1.png?width=940&format=png&auto=webp&s=62bbf3dccf821918a6346bab260f65b5bc2d8959

And he derived a simpler, more biologically plausible, realistic version of the Hopfield network, by slightly modifying the Willshaw associative net, which itself was based directly on holography. Creating a direct link between holography, and hopfield networks.

https://preview.redd.it/ei2rx71roqie1.png?width=940&format=png&auto=webp&s=fbb5798a873db3834babbdf90c37c0615284af56

And because hopfield networks are equivalent to the attention mechanism in the Transformer architecture, it means that the attention mechanism works by the principles of holography.

Attention mechanism works via principles of holography

If the attention mechanism worked by the principles of holography, we would have more evidence for this, right? So i started searching, and i found it.

[2105.03824] FNet: Mixing Tokens with Fourier Transforms

In this paper, the researcher from Google replaced the attention mechanism in Transformers with a mathematical operation called the Fourier Transform. It hugely improved the speed of training the Transformer, while maintaining comparable performance.

“Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths.”

“With the addition of just two self-attention sublayers, the hybrid FNet models achieve 97% and 99% of their respective BERT counterpart’s accuracies with only limited speed degradations”

So by retaining only two final attention mechanisms, and replacing all others with a fourier transform, they were able to obtain 99% same performance.

“Interestingly, the gap between BERT and FNet shrinks to just 3% for Large models; this is likely due to FNet-Large being more stable during training than BERT-Large.”

Meaning that as the trained model gets bigger, the performance gap between the attention mechanism and the fourier transform keeps shrinking. They were able to shrink it to 3%, and it would probably shrink further for bigger models.

What is a Fourier Transform? Fourier Transform is a mathematical operation, that describes how information gets recorded into the holographic film. It essentially describes the inference pattern, inference image, that gets recorded inside the physical holographic film.

Fourier Transform Holography: A Lensless Imaging Technique, Its Principles and Applications

This is a huge evidence, that the attention mechanism, and the whole of the Transformer architecture, works by the principles of holography.

The Fourier Transform is a neural network

But how could this be? I was curious as to why a simple mathematical operation of the Fourier Transform was able to replace the Attention mechanism so successfully. As i researched, i learned that the Fourier Transform, can actually be considered a type of a neural network.

The Fourier transform is a neural network | sidsite

“We can consider the discrete Fourier transform (DFT) to be an artificial neural network: it is a single layer network, with no bias, no activation function, and particular values for the weights. The number of output nodes is equal to the number of frequencies we evaluate.”

But then i wondered, even if the Fourier Transform can be considered to be a type of a neural network, it has no trainable parameters, that can be learned. So how is it able to replace the attention mechanism that has a ton of trainable parameters?

And then it clicked. You can implement trainable parameters into a Fourier Transform, if you are performing physical holography. Because the modifiable holographic film that records inference patterns, can be considered to be a layer of trainable parameters. And a neural network, that has its trainable parameters spread across its network, or has trainable parameters compressed into a single layer of the holographic film, would give the exact same outputs to the same inputs. The trainable parameters were simply moved and compressed into the holographic film.

What if, when we replace the attention mechanism with the fourier transform, the trainable parameters that usually reside inside the attention mechanism, get compressed and moved into the next layer, into the MLP layer?

That seemed reasonable to me. So i became curious about the MLP layer.

Multi Layer Perceptrons perform inverse Fourier Transform

If the Transformer architecture works via holography, what role does the MLP layer have?

And also, the FNET paper showed a big improvement in the speed of transformers, but it showed that it wasn’t even bigger, because huge amount of trainable parameters were still in the MLP layer.

That made me curious, if there was a way to reduce the amount of trainable parameters in the MLP layer too, just like how we accomplished that by replacing the attention mechanism with the Fourier Transform.

So i started researching about MLPs, and found some interesting papers.

[2012.14913] Transformer Feed-Forward Layers Are Key-Value Memories

“Feed-forward layers constitute two-thirds of a transformer model’s parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary.”

So essentially, the MLP layer in the Transformers is also a type of the key-value associative memory, which is equivalent to hopfield networks, which is equivalent to holographic associative memory.

[2410.02675] FAN: Fourier Analysis Networks

In this paper the researchers were able to create an alternative to the MLP layer by using a system that employed concepts from the Fourier Transform, Fourier Series. That made me further think, that MLPs worked by the principles of holography too.

[2410.13732] Reducing the Transformer Architecture to a Minimum

“The tests have shown that simplified transformer architectures (a) without MLP, (b) with collapsed matrices, and (c) symmetric similarity matrices exhibit similar performance as the original architecture, saving up to 90 % of parameters without hurting the classification performance.”

This paper essentially shows that you can remove huge amount of trainable parameters from the MLP layer, with minimal amount of performance loss.

To better understand the role of MLPs in the transformers, i again thought about the real physical holographic film.

Fourier Transform Holography: A Lensless Imaging Technique, Its Principles and Applications

https://preview.redd.it/ss8sgm43oqie1.png?width=1787&format=png&auto=webp&s=dde49574e8657872fe275f088635e88c247382f0

Here, the image of a cat (b) gets recorded into the holographic film as an inference pattern (d). This same inference pattern can be obtained, if you performed the Fourier Transform operation on the image (b). The resulting inference pattern, is in no way similar to the original image of the cat. For us to obtain the original image of the cat back from this inference pattern, the holographic film needs to be again exposed to light waves and projected onto the surface, creating the image (e). Same image (e) can be obtained from the inference pattern (d), by performing a Fourier Transform on it, which would make it an inverse Fourier Transform.

And then it clicked. What if the MLP layer, is simply performing inverse Fourier Transform?

In the FNET paper, where the attention mechanism was replaced by the Fourier Transform, a set of vectors in a high dimensional space, representing the words of the sentence, gets scrambled by the fourier transform. Analogous to the scrambled image (d) from above. And for us to continue working with this information, it needs to be converted back into its original format, and that can be done via inverse Fourier Transform, which is basically a regular Fourier Transform applied to the inference pattern in the holographic film.

When you replace the attention mechanism with the Fourier Transform, it seems that the trainable parameters that previously resided in the attention mechanism, get delegated, moved into the MLP layer. And if you replace the MLP layer with the inverse Fourier Transform, then you should be able to compress all trainable parameters of the Transformer, into a single layer of trainable parameters, located between those two Fourier Transforms. Making this thin layer of trainable parameters, hugely analogous to the thin physical holographic film, which has the two fourier transforms performed before and after it.

If this is truly how the Transformer architecture works, what is the role of the embedding layer at the start and the unembedding layer at the end?

The role of the word embedding layer

The embedding layer, turns the sentence, the sequence of words, into a sequence of high dimensional vectors. It also takes in the positional embedding, showing the position of words in relation to each other.

To make it easier to understand, think of those vectors, as a vector in a 2 dimensional space. So the embedding layer creates a sequence of 2 dimensional vectors. And then, the positional encoding, inverts any vector directed to the left to look at the right, mirroring it, and making it so that all the 2 dimensional vectors point to the right. You can also think of the positional encoding, as adding additional dimension to the vector, a dimension where all of them never point back, all of them point in a single general direction.

Then, you can connect those 2 dimensional vectors, all of whom point to the right, into a continuous line, where the end of the one vector, becomes the beginning of the next vector. As a result, you would have basically constructed a complex varying line, complex function line. The red line from below, is what i roughly mean. Think of this red line, as a continuous connection of 2d vector lines, forming that red line.

https://preview.redd.it/kmplvukboqie1.png?width=540&format=png&auto=webp&s=613ce0e9eea421c48439e77129104b7f8cc6c0ba

From this perspective, the role of the word embedding layer becomes very clear. Word embedding layer basically turns the word sequence into a data format that is ideal for performing Fourier Transform on it, as the Fourier Transform is designed to take complex functions and decompose them into their specific frequency components.

https://en.wikipedia.org/wiki/Fourier_transform

If you are curious about how the process of Fourier Transform happens to 2d images, like in the previous example of Fourier transforming an image of a cat, this link provides a very good illustrative explanation of it.

https://physics.stackexchange.com/a/684000

So, if we made an analogy with the process of physical recording of objects into holographic films. Without the embedding layer, words seem like some incomprehensible set of data to the Transformer. The embedding layer then turns it into a shape of a 3d object, a rubber duck for example, which can be then recorded into the holographic film inside the Transformer. And the unembedding layer, is analogous to taking the holographic projection of an image, and turning it back in to some incomprehensible data format the Transformer can’t understand. It basically turns words, into a format of data that the Transformer architecture can understand, into a format it can perform Fourier Transform operations on, and then the unembedding layer takes the final output that the Transformer architecture, and turns it back into words, the format that we can understand.

Transformers perform sophisticated hopfield network like pattern completion

This complex function perspective of word embeddings is hugely valuable, for other reasons. It becomes very clear what the Transformer architecture is essentially doing. It is taking a complex varying line, and is trying to predict the continuation of this complex varying line. By the holographic analogy, it is treating the initial complex varying line, as an incomplete pattern, and is using its memory to perform the autocompletion of this varying line, to holographically project the completion of this line.

The First Nobel Prize for Insidious Software Degradation | by Terry Bollinger (Apabistia Press) | Medium

Here is a post that also found out, that LLMs are just huge holographic storage systems, with alot of interesting insights.

How to improve the Transformer architecture

If Transformers truly works via holographic principles, then you can hugely optimize it.

First, it allows you to squeeze all the trainable parameters, into a single layer, instead of being spread all over the network. Which would hugely make training easier, faster, much more efficient.

As i have previously proposed, you can create a new Transformer architecture, that replaces the attention mechanism with a Fourier Transform, and replaces the MLP layer with the inverse Fourier Transform, and leaves a single layer of compressed trainable parameters between them. This would result in the removal of the 90% of trainable parameters, hugely increasing the speed of training.

If transformers work like the holographic film, then it means that you can train it without backpropagation. Because recording of the information into physical holographic films does not involve any backpropagation of light signals. This would hugely speed up the training, learning, inference, compute efficiency of the Transformer.

This also has a potential to unify training and inference into a single stage, allowing the LLMs, to efficiently learn in real time.

If Transformers work via principles of holography, then it is essentially simulating how light works, how holographic films record information, computationally. Instead of simulating light, we could use light and holography directly, to train Transformers. Which would result in 1000x improvement in speed, as real light will compute operations 1000x faster than simulated light.

Light-based chip: China’s Taichi could power artificial general intelligence

AI Chip Trims Energy Budget Back by 99+ Percent – IEEE Spectrum

Large-scale photonic chiplet Taichi empowers 160-TOPS/W artificial general intelligence | Science

“Researchers at Tsinghua University in China have developed a revolutionary new artificial intelligence (AI) chip that uses light instead of electricity to process data.”

“Dubbed “Taichi,” the chip is reportedly over 1,000 times more energy-efficient than Nvidia’s high-performance H100 GPU chip. Taichi is especially relevant given export restrictions to China due to US trade policies.”

“Taichi produced music clips in the style of Bach and art in the style of Van Gogh and Munch.”

There are already huge steps in China, in directly using light for computing operations in neural networks.

So this is how you can hugely improve the Transformer Architecture.

submitted by /u/Radlib123
[link] [comments]

Leave a Comment