I believe the key realization here is that we're applying a rotation matrix as part of the encoding. Why does this work? I've found that it's helpful to consider just the two dimensional case. Say we have two vectors. When designing a positional encoding scheme, what we're trying to do is somehow modify each vector so that it contains some information about its position.
The idea is that we can just rotate the vector by an amount proportional to its position - this has the property that the dot product of two vectors encoded this way only depends on the difference in positions, not the absolute positions themselves (since the dot product is based on angle between the vectors). And we care about the dot product, since that's what the attention operation ultimately applies to the vectors.
I believe the key realization here is that we're applying a rotation matrix as part of the encoding. Why does this work? I've found that it's helpful to consider just the two dimensional case. Say we have two vectors. When designing a positional encoding scheme, what we're trying to do is somehow modify each vector so that it contains some information about its position.
The idea is that we can just rotate the vector by an amount proportional to its position - this has the property that the dot product of two vectors encoded this way only depends on the difference in positions, not the absolute positions themselves (since the dot product is based on angle between the vectors). And we care about the dot product, since that's what the attention operation ultimately applies to the vectors.
I've written up a _somewhat_ first principles derivation of this here: https://mfaizan.github.io/2023/04/02/sines.html, if interested!