# Fusic Tech Blog

Fusion of Society, IT and Culture 2021/08/03

# Sinusoidal Representation Networks (SIREN)

My name is Teodor Toshkov and I am an intern at Fusic, Fukuoka. I am going to introduce a paper, published on 17 June 2020 by Vincent Sitzmann et. al., named Implicit Neural Representations with Periodic Activation Functions Comparison of different activation functions for continuous representation of a signal.
Credit: Implicit Neural Representations with Periodic Activation Functions

Sinusoidal representation networks ($\textup{SIREN}$-s) are types of networks, which leverage the periodical nature of the $sine$ function, leading to Fourier-like behavior. This makes $\textup{SIREN}$-s extremely efficient at representing continuous signals, automatically fitting the derivatives as well.

Training a network with a $sine$ activation function has notoriously been difficult, due to it being periodic, but Vincent Sitzmann et. al. propose a special initialization method, which ensures consistency of the output distribution when propagating through layers. This alleviates the issue of creating deep models with $sine$ activations.

Used for representing natural signals (ex. images, sound, video, etc.)

Contrary to traditional networks, $\textup{SIREN}$ does not take signals as input. It is used for creating implicit representations of a signal. $\textup{SIREN}$ creates a mapping between two spaces. The input could simply be $(x,y)$ coordinates, or it could contain a positional encoding in any form. Due to its use of the $sine$ function, the output is a continuous translation of the input to a natural signal.

The output being a continuous signal, $\textup{SIREN}$ provides a very powerful weapon. We can set any type of constraint on it or its derivatives. We can now fit not only on a given signal, but also on its gradient:

$\|\nabla_x\Phi(x)-\nabla_xf(x)\|$

An example of this is described in detail in section Image representation using gradient as loss.

## Structure

The $\textup{SIREN}$ utilizes a $sine$ activation function, of the form:

$sin(\omega_0\cdot Wx+b)$

$\omega_0$ is a parameter, magnifying the frequency of the $sine$ function, which is set to $30$ for the examples, provided in the original paper and in this article. It should be altered, depending on the application.

Generally, the input to a $\textup{SIREN}$ is the position of a single moment of the output, ex. the position of a single pixel. Based on the location of the element, a $\textup{SIREN}$ predicts what the output at that location is.

Below is an example of a $\textup{SIREN}$ used for predicting the color, i.e. RGB values, of a pixel, located at position $[100,120]$ on the image.

$\textup{SIREN}([100,120]) = [r, g, b]$

If we do this for each pixel of a given signal, we will get its internal representation.

## Training

An important feature of the $sine$ function is that all derivatives of $sine$ are in practice phase-shifted $sine$ functions. This means that any-order derivative of a $\textup{SIREN}$ is a $\textup{SIREN}$ as well, thus inheriting the properties, regardless of the depth of the network. This leads to derivatives of any order being accessible and alleviates vanishing and exploding gradients.

Thanks to this property, when the network is trained, it fits not only the desired output signal but its derivatives as well. This allows us to set any constraint on any derivative we desire. Considering $M$ number of constraints, they can be implemented in the form of losses:

$\mathcal{L}=\int_\Omega\sum_{m=1}^M{1_{\Omega_m}(x) \|\mathcal{C}_m(a(x),\Phi(x),\nabla\Phi(x),...)\| } dx$

Where $\Omega_m$ is the domain, where the constraints hold and $1_{\Omega_m}(x)=1, \forall x\in \Omega_m$ and $1_{\Omega_m}(x)=0, \forall x\notin \Omega_m$. Each $\mathcal{C}_m$ is a constraint on any combination of the output signal and its derivatives.

Such constraints could be used for encouraging smooth surface of a 3D mesh, or training representations of natural signals according to their gradients, instead of the actual signals. Detailed examples are explained in Applications.

Thanks to there being little loss of information when propagating through derivatives, we can infer information about the output, using only its derivatives. For example, using the function below as a loss, we could fit an image, based purely on its second derivative, i.e. its Laplacian, and the $\textup{SIREN}$ network will infer the original signal closely. Note that the information about the constant $+C$ is lost due to the unbounded integral.

$\mathcal{L}_{lapl.}=\int_\Omega\|\Delta\Phi(x)-\Delta f(x)\|dx$

In my opinion, the possibility to constrain the output signal based on any combination of derivatives is the most powerful property of $\textup{SIREN}$.

## Special Initialization

In contrast to activation functions like $\textup{ReLU}$, $\textup{GELU}$, $\textup{Sigmoid}$ or $tanh$, $sine$ is a periodic function, i.e. there is a cycle of intervals of positive and negative gradients. This introduces instability when propagating deep through the network, which is mitigated by the authors of the $\textup{SIREN}$ paper by careful initialization of the weights, as follows:

The first layer is initialized with:

$w_0\sim \textup{U}(-1/n, 1/n)$

where $n$ is the number of input channels.

The weights of all subsequent layers are initialized with:

$w_i\sim \textup{U}(-\frac{\sqrt{6/n}}{\omega_0}, \frac{\sqrt{6/n}}{\omega_0})$

The reason for choosing a uniform distribution is that it leads to normally distributed outputs before each sine nonlinearity and arcsine distributed outputs after applying $sine$. Choosing boundaries of $[-\frac{\sqrt{6/n}}{\omega_0}, \frac{\sqrt{6/n}}{\omega_0}]$ ensures that these properties are preserved, regardless of the depth of the network.

## Properties

A $\textup{SIREN}$, being a sum of $sine$ functions, leads to a Fourier-like behavior. This means that with just frequency and shift parameters of each $sine$, we can describe a natural signal of any form. Thanks to $\textup{SIREN}$'s suitability to natural continuous signals and their derivatives, it converges significantly quicker, compared to other activation functions.  The continuous representation, learned by $\textup{SIREN}$ provides us with the ability to sample pixel values at arbitrary points, we are not constrained with the resolution of the training data or its boundaries. Let us consider the following image of size $256\times256$ pixels: Let us generate two $1024\times1024$ images. If we consider the $x$ and $y$ coordinates of the original image to be in the range $[-1,1]$, the first image samples in the range $[-0.5,0.5]$ and the other, in the range $[-2,2]$.

Interpolation $[-0.5,0.5]$ Extrapolation $[-2,2]$  Good at interpolation, not really good at extrapolation.

We can sample pixel values at any point, allowing for increasing the resolution of the original input. Thanks to the smooth nature of $sine$, interpolation between two pixels produces viable results with little noise.

Having a continuous representation, we can sample pixels at, which extend beyond the original boundaries of the signal. However, using $\textup{SIREN}$ by itself, without additional constraints, does not generalize well beyond the boundaries.

## Applications

$\textup{SIREN}$ can be used with any natural signal, ex. images, videos, 3D objects, audio, etc. It can be used to convert any positional encoding into natural signals. I will introduce 3 examples from the original paper. For more example uses, including video and audio, please feel free to visit Implicit Neural Representations with Periodic Activation Functions (vsitzmann.github.io). Another use of $\textup{SIREN}$ will be described in detail in a following article, named ViTGAN.

### Image representation using gradient as loss

Instead of calculating the loss based on the actual target:

$\mathcal{L}=\int_\Omega\|\Phi(x)-f(x)\|dx$

It is possible to calculate it based on the gradient, i.e. first-order derivative:

$\mathcal{L}_{grad.}=\int_\Omega\|\nabla_x\Phi(x)-\nabla_xf(x)\|dx$

Or even based on the Laplasian, i.e. second-order derivative:

$\mathcal{L}_{lapl.}=\int_\Omega\|\Delta\Phi(x)-\Delta f(x)\|dx$

Due to there being a range of solutions, when calculating the integral, color intensity is lost in $\nabla_xf(x)$. When using $\Delta f(x)$, we lose more local information. Despite these shortcomings, the $\textup{SIREN}$ manages to fit the output well.

Fitting on the gradient, allows us to estimate combinations of several images, by fitting on a weighted sum of the gradients of the images. ### 3D reconstruction

Another interesting application of $\textup{SIREN}$ is in reconstructing 3D objects, based on point clouds: The loss function, used to generate the representations penalizes off-surface points and constrains the norm of spatial gradients $|\nabla_x\Phi|$ to $1$:

$\mathcal{L}_{sdf} = \int_\Omega{\||\nabla_x\Phi(x)|-1\|dx}+\int_{\Omega_0}\|\Phi(x)\|+(1-\langle \nabla_x\Phi(x),\nabla f(x) \rangle)dx + \int_{\Omega\setminus\Omega_0}{\textup{exp}(-\alpha\cdot|\Phi(x)|)}dx$

### Generalizing images from a subset of pixels Using $\textup{SIREN}$-s by themselves, we can represent only a single output. However, it can be used in combination with a CNN with $\textup{ReLU}$, which acts as a hypernetwork, which maps latent code to weights of a $\textup{SIREN}$. Below are the results from the use of an image encoder in combination with a $\textup{SIREN}$. The encoder converts input images or a subset of pixels from an image to a latent space representation, which is then translated to the weights of a $\textup{SIREN}$. ## Summary

$\textup{SIREN}$ networks take positional encoding as input and output a natural continuous signal, including images, video, sound, 3D objects, etc.

They can be used by themselves to represent a single input by a combination of $sine$ waves, or a neural network could be trained to infer the necessary weights of a $\textup{SIREN}$ to reproduce the desired output. Another possibility is to use a combination of the positional encoding and an embedding, to train a homogeneous network to produce conditional outputs directly.

I think that it should not be considered as a possible replacement of $\textup{ReLU}$ or other activation functions. Instead, it should be used in combination with traditional activation functions and utilize the best of both worlds.

The input position encoding and $\omega_0$ should be scaled according to the application, to produce optimal results. ## Teodor TOSHKOV

I am an intern at Fusic, a company in Fukuoka, Japan. From 2022, I will be joining the Machine Learning team. I develop mostly deep learning models, using PyTorch and Tensorflow.