Table of Contents
Structure
The utilizes a activation function, of the form:
is a parameter, magnifying the frequency of the function, which is set to for the examples, provided in the original paper and in this article. It should be altered, depending on the application.
Generally, the input to a is the position of a single moment of the output, ex. the position of a single pixel. Based on the location of the element, a predicts what the output at that location is.
Below is an example of a used for predicting the color, i.e. RGB values, of a pixel, located at position on the image.
If we do this for each pixel of a given signal, we will get its internal representation.
Training
An important feature of the function is that all derivatives of are in practice phase-shifted functions. This means that any-order derivative of a is a as well, thus inheriting the properties, regardless of the depth of the network. This leads to derivatives of any order being accessible and alleviates vanishing and exploding gradients.
Thanks to this property, when the network is trained, it fits not only the desired output signal but its derivatives as well. This allows us to set any constraint on any derivative we desire. Considering number of constraints, they can be implemented in the form of losses:
Where is the domain, where the constraints hold and and . Each is a constraint on any combination of the output signal and its derivatives.
Such constraints could be used for encouraging smooth surface of a 3D mesh, or training representations of natural signals according to their gradients, instead of the actual signals. Detailed examples are explained in Applications.
Thanks to there being little loss of information when propagating through derivatives, we can infer information about the output, using only its derivatives. For example, using the function below as a loss, we could fit an image, based purely on its second derivative, i.e. its Laplacian, and the network will infer the original signal closely. Note that the information about the constant is lost due to the unbounded integral.
In my opinion, the possibility to constrain the output signal based on any combination of derivatives is the most powerful property of .
Special Initialization
In contrast to activation functions like , , or , is a periodic function, i.e. there is a cycle of intervals of positive and negative gradients. This introduces instability when propagating deep through the network, which is mitigated by the authors of the paper by careful initialization of the weights, as follows:
The first layer is initialized with:
where is the number of input channels.
The weights of all subsequent layers are initialized with:
The reason for choosing a uniform distribution is that it leads to normally distributed outputs before each sine nonlinearity and arcsine distributed outputs after applying . Choosing boundaries of ensures that these properties are preserved, regardless of the depth of the network.
Properties
A , being a sum of functions, leads to a Fourier-like behavior. This means that with just frequency and shift parameters of each , we can describe a natural signal of any form. Thanks to 's suitability to natural continuous signals and their derivatives, it converges significantly quicker, compared to other activation functions.
The continuous representation, learned by provides us with the ability to sample pixel values at arbitrary points, we are not constrained with the resolution of the training data or its boundaries. Let us consider the following image of size pixels:
Let us generate two images. If we consider the and coordinates of the original image to be in the range , the first image samples in the range and the other, in the range .
Interpolation | Extrapolation |
---|---|
Good at interpolation, not really good at extrapolation.
We can sample pixel values at any point, allowing for increasing the resolution of the original input. Thanks to the smooth nature of , interpolation between two pixels produces viable results with little noise.
Having a continuous representation, we can sample pixels at, which extend beyond the original boundaries of the signal. However, using by itself, without additional constraints, does not generalize well beyond the boundaries.
Applications
can be used with any natural signal, ex. images, videos, 3D objects, audio, etc. It can be used to convert any positional encoding into natural signals. I will introduce 3 examples from the original paper. For more example uses, including video and audio, please feel free to visit Implicit Neural Representations with Periodic Activation Functions (vsitzmann.github.io). Another use of will be described in detail in a following article, named ViTGAN.
Image representation using gradient as loss
Instead of calculating the loss based on the actual target:
It is possible to calculate it based on the gradient, i.e. first-order derivative:
Or even based on the Laplasian, i.e. second-order derivative:
Due to there being a range of solutions, when calculating the integral, color intensity is lost in . When using , we lose more local information. Despite these shortcomings, the manages to fit the output well.
Fitting on the gradient, allows us to estimate combinations of several images, by fitting on a weighted sum of the gradients of the images.
3D reconstruction
Another interesting application of is in reconstructing 3D objects, based on point clouds:
The loss function, used to generate the representations penalizes off-surface points and constrains the norm of spatial gradients to :
Generalizing images from a subset of pixels
Using -s by themselves, we can represent only a single output. However, it can be used in combination with a CNN with , which acts as a hypernetwork, which maps latent code to weights of a . Below are the results from the use of an image encoder in combination with a . The encoder converts input images or a subset of pixels from an image to a latent space representation, which is then translated to the weights of a .
Summary
networks take positional encoding as input and output a natural continuous signal, including images, video, sound, 3D objects, etc.
They can be used by themselves to represent a single input by a combination of waves, or a neural network could be trained to infer the necessary weights of a to reproduce the desired output. Another possibility is to use a combination of the positional encoding and an embedding, to train a homogeneous network to produce conditional outputs directly.
I think that it should not be considered as a possible replacement of or other activation functions. Instead, it should be used in combination with traditional activation functions and utilize the best of both worlds.
The input position encoding and should be scaled according to the application, to produce optimal results.
Teodor TOSHKOV
I am an intern at Fusic, a company in Fukuoka, Japan. From 2022, I will be joining the Machine Learning team. I develop mostly deep learning models, using PyTorch and Tensorflow.
Related Posts
Teodor TOSHKOV
2022/06/13
Teodor TOSHKOV
2021/09/28
Teodor TOSHKOV
2021/08/24