Fusic Tech Blog

Fusion of Society, IT and Culture

Sinusoidal Representation Networks (SIREN)
2021/08/03

Sinusoidal Representation Networks (SIREN)

My name is Teodor Toshkov and I am an intern at Fusic, Fukuoka. I am going to introduce a paper, published on 17 June 2020 by Vincent Sitzmann et. al., named Implicit Neural Representations with Periodic Activation Functions

SIREN image representation comparison

Comparison of different activation functions for continuous representation of a signal.
Credit: Implicit Neural Representations with Periodic Activation Functions

Sinusoidal representation networks (SIREN\textup{SIREN}-s) are types of networks, which leverage the periodical nature of the sinesine function, leading to Fourier-like behavior. This makes SIREN\textup{SIREN}-s extremely efficient at representing continuous signals, automatically fitting the derivatives as well.

Training a network with a sinesine activation function has notoriously been difficult, due to it being periodic, but Vincent Sitzmann et. al. propose a special initialization method, which ensures consistency of the output distribution when propagating through layers. This alleviates the issue of creating deep models with sinesine activations.

Used for representing natural signals (ex. images, sound, video, etc.)

Contrary to traditional networks, SIREN\textup{SIREN} does not take signals as input. It is used for creating implicit representations of a signal.

SIREN image representation comparison

SIREN\textup{SIREN} creates a mapping between two spaces. The input could simply be (x,y)(x,y) coordinates, or it could contain a positional encoding in any form. Due to its use of the sinesine function, the output is a continuous translation of the input to a natural signal.

The output being a continuous signal, SIREN\textup{SIREN} provides a very powerful weapon. We can set any type of constraint on it or its derivatives. We can now fit not only on a given signal, but also on its gradient:

xΦ(x)xf(x)\|\nabla_x\Phi(x)-\nabla_xf(x)\|

An example of this is described in detail in section Image representation using gradient as loss.

Structure

The SIREN\textup{SIREN} utilizes a sinesine activation function, of the form:

sin(ω0Wx+b)sin(\omega_0\cdot Wx+b)

ω0\omega_0 is a parameter, magnifying the frequency of the sinesine function, which is set to 3030 for the examples, provided in the original paper and in this article. It should be altered, depending on the application.

Generally, the input to a SIREN\textup{SIREN} is the position of a single moment of the output, ex. the position of a single pixel. Based on the location of the element, a SIREN\textup{SIREN} predicts what the output at that location is.

Below is an example of a SIREN\textup{SIREN} used for predicting the color, i.e. RGB values, of a pixel, located at position [100,120][100,120] on the image.

SIREN([100,120])=[r,g,b]\textup{SIREN}([100,120]) = [r, g, b]

If we do this for each pixel of a given signal, we will get its internal representation.

Training

An important feature of the sinesine function is that all derivatives of sinesine are in practice phase-shifted sinesine functions. This means that any-order derivative of a SIREN\textup{SIREN} is a SIREN\textup{SIREN} as well, thus inheriting the properties, regardless of the depth of the network. This leads to derivatives of any order being accessible and alleviates vanishing and exploding gradients.

Thanks to this property, when the network is trained, it fits not only the desired output signal but its derivatives as well. This allows us to set any constraint on any derivative we desire. Considering MM number of constraints, they can be implemented in the form of losses:

L=Ωm=1M1Ωm(x)Cm(a(x),Φ(x),Φ(x),...)dx\mathcal{L}=\int_\Omega\sum_{m=1}^M{1_{\Omega_m}(x) \|\mathcal{C}_m(a(x),\Phi(x),\nabla\Phi(x),...)\| } dx

Where Ωm\Omega_m is the domain, where the constraints hold and 1Ωm(x)=1,xΩm1_{\Omega_m}(x)=1, \forall x\in \Omega_m and 1Ωm(x)=0,xΩm1_{\Omega_m}(x)=0, \forall x\notin \Omega_m. Each Cm\mathcal{C}_m is a constraint on any combination of the output signal and its derivatives.

Such constraints could be used for encouraging smooth surface of a 3D mesh, or training representations of natural signals according to their gradients, instead of the actual signals. Detailed examples are explained in Applications.

Thanks to there being little loss of information when propagating through derivatives, we can infer information about the output, using only its derivatives. For example, using the function below as a loss, we could fit an image, based purely on its second derivative, i.e. its Laplacian, and the SIREN\textup{SIREN} network will infer the original signal closely. Note that the information about the constant +C+C is lost due to the unbounded integral.

Llapl.=ΩΔΦ(x)Δf(x)dx\mathcal{L}_{lapl.}=\int_\Omega\|\Delta\Phi(x)-\Delta f(x)\|dx

In my opinion, the possibility to constrain the output signal based on any combination of derivatives is the most powerful property of SIREN\textup{SIREN}.

Special Initialization

In contrast to activation functions like ReLU\textup{ReLU}, GELU\textup{GELU}, Sigmoid\textup{Sigmoid} or tanhtanh, sinesine is a periodic function, i.e. there is a cycle of intervals of positive and negative gradients. This introduces instability when propagating deep through the network, which is mitigated by the authors of the SIREN\textup{SIREN} paper by careful initialization of the weights, as follows:

The first layer is initialized with:

w0U(1/n,1/n)w_0\sim \textup{U}(-1/n, 1/n)

where nn is the number of input channels.

The weights of all subsequent layers are initialized with:

wiU(6/nω0,6/nω0)w_i\sim \textup{U}(-\frac{\sqrt{6/n}}{\omega_0}, \frac{\sqrt{6/n}}{\omega_0})

The reason for choosing a uniform distribution is that it leads to normally distributed outputs before each sine nonlinearity and arcsine distributed outputs after applying sinesine. Choosing boundaries of [6/nω0,6/nω0][-\frac{\sqrt{6/n}}{\omega_0}, \frac{\sqrt{6/n}}{\omega_0}] ensures that these properties are preserved, regardless of the depth of the network.

Properties

A SIREN\textup{SIREN}, being a sum of sinesine functions, leads to a Fourier-like behavior. This means that with just frequency and shift parameters of each sinesine, we can describe a natural signal of any form. Thanks to SIREN\textup{SIREN}'s suitability to natural continuous signals and their derivatives, it converges significantly quicker, compared to other activation functions.

SIREN image representation comparison SIREN image representation loss


The continuous representation, learned by SIREN\textup{SIREN} provides us with the ability to sample pixel values at arbitrary points, we are not constrained with the resolution of the training data or its boundaries. Let us consider the following image of size 256×256256\times256 pixels:

Chess pieces

Let us generate two 1024×10241024\times1024 images. If we consider the xx and yy coordinates of the original image to be in the range [1,1][-1,1], the first image samples in the range [0.5,0.5][-0.5,0.5] and the other, in the range [2,2][-2,2].

Interpolation [0.5,0.5][-0.5,0.5] Extrapolation [2,2][-2,2]
Chess pieces high resolution Chess pieces extrapolation

Good at interpolation, not really good at extrapolation.

We can sample pixel values at any point, allowing for increasing the resolution of the original input. Thanks to the smooth nature of sinesine, interpolation between two pixels produces viable results with little noise.

Having a continuous representation, we can sample pixels at, which extend beyond the original boundaries of the signal. However, using SIREN\textup{SIREN} by itself, without additional constraints, does not generalize well beyond the boundaries.

Applications

SIREN\textup{SIREN} can be used with any natural signal, ex. images, videos, 3D objects, audio, etc. It can be used to convert any positional encoding into natural signals. I will introduce 3 examples from the original paper. For more example uses, including video and audio, please feel free to visit Implicit Neural Representations with Periodic Activation Functions (vsitzmann.github.io). Another use of SIREN\textup{SIREN} will be described in detail in a following article, named ViTGAN.

Image representation using gradient as loss

Instead of calculating the loss based on the actual target:

L=ΩΦ(x)f(x)dx\mathcal{L}=\int_\Omega\|\Phi(x)-f(x)\|dx

It is possible to calculate it based on the gradient, i.e. first-order derivative:

Lgrad.=ΩxΦ(x)xf(x)dx\mathcal{L}_{grad.}=\int_\Omega\|\nabla_x\Phi(x)-\nabla_xf(x)\|dx

Or even based on the Laplasian, i.e. second-order derivative:

Llapl.=ΩΔΦ(x)Δf(x)dx\mathcal{L}_{lapl.}=\int_\Omega\|\Delta\Phi(x)-\Delta f(x)\|dx

Due to there being a range of solutions, when calculating the integral, color intensity is lost in xf(x)\nabla_xf(x). When using Δf(x)\Delta f(x), we lose more local information. Despite these shortcomings, the SIREN\textup{SIREN} manages to fit the output well.

Fitting on the gradient, allows us to estimate combinations of several images, by fitting on a weighted sum of the gradients of the images. Poisson Image Reconstruction

3D reconstruction

Another interesting application of SIREN\textup{SIREN} is in reconstructing 3D objects, based on point clouds:

3D Reconstruction


The loss function, used to generate the representations penalizes off-surface points and constrains the norm of spatial gradients xΦ|\nabla_x\Phi| to 11:

Lsdf=ΩxΦ(x)1dx+Ω0Φ(x)+(1xΦ(x),f(x))dx+ΩΩ0exp(αΦ(x))dx\mathcal{L}_{sdf} = \int_\Omega{\||\nabla_x\Phi(x)|-1\|dx}+\int_{\Omega_0}\|\Phi(x)\|+(1-\langle \nabla_x\Phi(x),\nabla f(x) \rangle)dx + \int_{\Omega\setminus\Omega_0}{\textup{exp}(-\alpha\cdot|\Phi(x)|)}dx

Generalizing images from a subset of pixels

Image inpainting diagram

Using SIREN\textup{SIREN}-s by themselves, we can represent only a single output. However, it can be used in combination with a CNN with ReLU\textup{ReLU}, which acts as a hypernetwork, which maps latent code to weights of a SIREN\textup{SIREN}. Below are the results from the use of an image encoder in combination with a SIREN\textup{SIREN}. The encoder converts input images or a subset of pixels from an image to a latent space representation, which is then translated to the weights of a SIREN\textup{SIREN}.

Image inpainting results

Summary

SIREN\textup{SIREN} networks take positional encoding as input and output a natural continuous signal, including images, video, sound, 3D objects, etc.

They can be used by themselves to represent a single input by a combination of sinesine waves, or a neural network could be trained to infer the necessary weights of a SIREN\textup{SIREN} to reproduce the desired output. Another possibility is to use a combination of the positional encoding and an embedding, to train a homogeneous network to produce conditional outputs directly.

I think that it should not be considered as a possible replacement of ReLU\textup{ReLU} or other activation functions. Instead, it should be used in combination with traditional activation functions and utilize the best of both worlds.

The input position encoding and ω0\omega_0 should be scaled according to the application, to produce optimal results.

Teodor TOSHKOV

Teodor TOSHKOV

I am an intern at Fusic, a company in Fukuoka, Japan. From 2022, I will be joining the Machine Learning team. I develop mostly deep learning models, using PyTorch and Tensorflow.