Top View


Author Teodor TOSHKOV

Sinusoidal Representation Networks (SIREN)

2021/08/03

Structure

The SIREN\textup{SIREN} utilizes a sinesine activation function, of the form:

sin(ω_0Wx+b)sin(\omega\_0\cdot Wx+b)

ω_0\omega\_0 is a parameter, magnifying the frequency of the sinesine function, which is set to 3030 for the examples, provided in the original paper and in this article. It should be altered, depending on the application.

Generally, the input to a SIREN\textup{SIREN} is the position of a single moment of the output, ex. the position of a single pixel. Based on the location of the element, a SIREN\textup{SIREN} predicts what the output at that location is.

Below is an example of a SIREN\textup{SIREN} used for predicting the color, i.e. RGB values, of a pixel, located at position \[100,120]\[100,120] on the image.

SIREN(\[100,120])=\[r,g,b]\textup{SIREN}(\[100,120]) = \[r, g, b]

If we do this for each pixel of a given signal, we will get its internal representation.

Training

An important feature of the sinesine function is that all derivatives of sinesine are in practice phase-shifted sinesine functions. This means that any-order derivative of a SIREN\textup{SIREN} is a SIREN\textup{SIREN} as well, thus inheriting the properties, regardless of the depth of the network. This leads to derivatives of any order being accessible and alleviates vanishing and exploding gradients.

Thanks to this property, when the network is trained, it fits not only the desired output signal but its derivatives as well. This allows us to set any constraint on any derivative we desire. Considering MM number of constraints, they can be implemented in the form of losses:

L=_Ω_m=1M1_Ω_m(x)C_m(a(x),Φ(x),Φ(x),...)dx\mathcal{L}=\int\_\Omega\sum\_{m=1}^M{1\_{\Omega\_m}(x) |\mathcal{C}\_m(a(x),\Phi(x),\nabla\Phi(x),...)| } dx

Where Ω_m\Omega\_m is the domain, where the constraints hold and 1_Ω_m(x)=1,xΩ_m1\_{\Omega\_m}(x)=1, \forall x\in \Omega\_m and 1_Ω_m(x)=0,xΩ_m1\_{\Omega\_m}(x)=0, \forall x\notin \Omega\_m. Each C_m\mathcal{C}\_m is a constraint on any combination of the output signal and its derivatives.

Such constraints could be used for encouraging smooth surface of a 3D mesh, or training representations of natural signals according to their gradients, instead of the actual signals. Detailed examples are explained in Applications.

Thanks to there being little loss of information when propagating through derivatives, we can infer information about the output, using only its derivatives. For example, using the function below as a loss, we could fit an image, based purely on its second derivative, i.e. its Laplacian, and the SIREN\textup{SIREN} network will infer the original signal closely. Note that the information about the constant +C+C is lost due to the unbounded integral.

Llapl.=ΩΔΦ(x)Δf(x)dx\mathcal{L}*{lapl.}=\int*\Omega|\Delta\Phi(x)-\Delta f(x)|dx

In my opinion, the possibility to constrain the output signal based on any combination of derivatives is the most powerful property of SIREN\textup{SIREN}.

Special Initialization

In contrast to activation functions like ReLU\textup{ReLU}, GELU\textup{GELU}, Sigmoid\textup{Sigmoid} or tanhtanh, sinesine is a periodic function, i.e. there is a cycle of intervals of positive and negative gradients. This introduces instability when propagating deep through the network, which is mitigated by the authors of the SIREN\textup{SIREN} paper by careful initialization of the weights, as follows:

The first layer is initialized with:

w_0U(1/n,1/n)w\_0\sim \textup{U}(-1/n, 1/n)

where nn is the number of input channels.

The weights of all subsequent layers are initialized with:

w_iU(6/nω_0,6/nω_0)w\_i\sim \textup{U}(-\frac{\sqrt{6/n}}{\omega\_0}, \frac{\sqrt{6/n}}{\omega\_0})

The reason for choosing a uniform distribution is that it leads to normally distributed outputs before each sine nonlinearity and arcsine distributed outputs after applying sinesine. Choosing boundaries of \[6/nω_0,6/nω_0]\[-\frac{\sqrt{6/n}}{\omega\_0}, \frac{\sqrt{6/n}}{\omega\_0}] ensures that these properties are preserved, regardless of the depth of the network.

Properties

A SIREN\textup{SIREN}, being a sum of sinesine functions, leads to a Fourier-like behavior. This means that with just frequency and shift parameters of each sinesine, we can describe a natural signal of any form. Thanks to SIREN\textup{SIREN}'s suitability to natural continuous signals and their derivatives, it converges significantly quicker, compared to other activation functions.

SIREN image representation comparison SIREN image representation loss


The continuous representation, learned by SIREN\textup{SIREN} provides us with the ability to sample pixel values at arbitrary points, we are not constrained with the resolution of the training data or its boundaries. Let us consider the following image of size 256×256256\times256 pixels:

Chess pieces

Let us generate two 1024×10241024\times1024 images. If we consider the xx and yy coordinates of the original image to be in the range \[1,1]\[-1,1], the first image samples in the range \[0.5,0.5]\[-0.5,0.5] and the other, in the range \[2,2]\[-2,2].

Interpolation \[0.5,0.5]\[-0.5,0.5]Extrapolation \[2,2]\[-2,2]
Chess pieces high resolutionChess pieces extrapolation

Good at interpolation, not really good at extrapolation.

We can sample pixel values at any point, allowing for increasing the resolution of the original input. Thanks to the smooth nature of sinesine, interpolation between two pixels produces viable results with little noise.

Having a continuous representation, we can sample pixels at, which extend beyond the original boundaries of the signal. However, using SIREN\textup{SIREN} by itself, without additional constraints, does not generalize well beyond the boundaries.

Applications

SIREN\textup{SIREN} can be used with any natural signal, ex. images, videos, 3D objects, audio, etc. It can be used to convert any positional encoding into natural signals. I will introduce 3 examples from the original paper. For more example uses, including video and audio, please feel free to visit Implicit Neural Representations with Periodic Activation Functions (vsitzmann.github.io). Another use of SIREN\textup{SIREN} will be described in detail in a following article, named ViTGAN.

Image representation using gradient as loss

Instead of calculating the loss based on the actual target:

L=_ΩΦ(x)f(x)dx\mathcal{L}=\int\_\Omega|\Phi(x)-f(x)|dx

It is possible to calculate it based on the gradient, i.e. first-order derivative:

Lgrad.=Ω_xΦ(x)_xf(x)dx\mathcal{L}*{grad.}=\int*\Omega|\nabla\_x\Phi(x)-\nabla\_xf(x)|dx

Or even based on the Laplasian, i.e. second-order derivative:

Llapl.=ΩΔΦ(x)Δf(x)dx\mathcal{L}*{lapl.}=\int*\Omega|\Delta\Phi(x)-\Delta f(x)|dx

Due to there being a range of solutions, when calculating the integral, color intensity is lost in _xf(x)\nabla\_xf(x). When using Δf(x)\Delta f(x), we lose more local information. Despite these shortcomings, the SIREN\textup{SIREN} manages to fit the output well.

Fitting on the gradient, allows us to estimate combinations of several images, by fitting on a weighted sum of the gradients of the images. Poisson Image Reconstruction

3D reconstruction

Another interesting application of SIREN\textup{SIREN} is in reconstructing 3D objects, based on point clouds:

3D Reconstruction


The loss function, used to generate the representations penalizes off-surface points and constrains the norm of spatial gradients _xΦ|\nabla\_x\Phi| to 11:

Lsdf=Ω_xΦ(x)1dx+_Ω_0Φ(x)+(1_xΦ(x),f(x))dx+_ΩΩ_0exp(αΦ(x))dx\mathcal{L}*{sdf} = \int*\Omega{||\nabla\_x\Phi(x)|-1|dx}+\int\_{\Omega\_0}|\Phi(x)|+(1-\langle \nabla\_x\Phi(x),\nabla f(x) \rangle)dx + \int\_{\Omega\setminus\Omega\_0}{\textup{exp}(-\alpha\cdot|\Phi(x)|)}dx

Generalizing images from a subset of pixels

Image inpainting diagram

Using SIREN\textup{SIREN}-s by themselves, we can represent only a single output. However, it can be used in combination with a CNN with ReLU\textup{ReLU}, which acts as a hypernetwork, which maps latent code to weights of a SIREN\textup{SIREN}. Below are the results from the use of an image encoder in combination with a SIREN\textup{SIREN}. The encoder converts input images or a subset of pixels from an image to a latent space representation, which is then translated to the weights of a SIREN\textup{SIREN}.

Image inpainting results

Summary

SIREN\textup{SIREN} networks take positional encoding as input and output a natural continuous signal, including images, video, sound, 3D objects, etc.

They can be used by themselves to represent a single input by a combination of sinesine waves, or a neural network could be trained to infer the necessary weights of a SIREN\textup{SIREN} to reproduce the desired output. Another possibility is to use a combination of the positional encoding and an embedding, to train a homogeneous network to produce conditional outputs directly.

I think that it should not be considered as a possible replacement of ReLU\textup{ReLU} or other activation functions. Instead, it should be used in combination with traditional activation functions and utilize the best of both worlds.

The input position encoding and ω_0\omega\_0 should be scaled according to the application, to produce optimal results.

Teodor TOSHKOV

Teodor TOSHKOV

I am an intern at Fusic, a company in Fukuoka, Japan. From 2022, I will be joining the Machine Learning team. I develop mostly deep learning models, using PyTorch and Tensorflow.