Fast 1D Linear NaN Interpolation over Large 3D Array

Introduction

In this article, we will explore the problem of interpolating missing values in a large 3D array. The data is structured such that each value along the first axis represents a different point in time. Some values are missing due to acquisition failures or other reasons, resulting in NaN (Not a Number) values.

We will discuss the current approach using scipy.interpolate.interp1d with pandas, and then explore an optimized solution using numba.

Background

The interp1d function from scipy.interpolate is a powerful tool for performing linear interpolation in 1D arrays. When used with pandas, it provides a convenient interface for interpolating missing values along the time axis.

However, as we will see later, this approach can be computationally expensive, especially when dealing with large datasets like our 3D array.

Current Approach

The current implementation uses the following code snippet:

import numpy as np
import pandas as pd

def interpolate_nan(arr, method="linear", limit=3):
    """return array interpolated along time-axis to fill missing values"""
    result = np.zeros_like(arr, dtype=np.int16)

    for i in range(arr.shape[1]):
        # slice along y axis, interpolate with pandas wrapper to interp1d
        line_stack = pd.DataFrame(data=arr[:,i,:], dtype=np.float32)
        line_stack.replace(to_replace=-37268, value=np.NaN, inplace=True)
        line_stack.interpolate(method=method, axis=0, inplace=True, limit=limit)
        line_stack.replace(to_replace=np.NaN, value=-37268, inplace=True)
        result[:, i, :] = line_stack.values.astype(np.int16)
    return result

As we can see, this implementation uses a for loop to iterate over each column of the 3D array. For each column, it creates a pandas.DataFrame from the corresponding values, replaces NaN values with -37268, and then interpolates missing values using interp1d.

However, as mentioned in the question, this approach is computationally expensive, taking around 2 seconds per slice on the original dataset.

Optimized Solution

To improve performance, we can use numba to optimize the interpolation process. The optimized code snippet is:

from numba import jit

@jit(nopython=True)
def interpolate_numba(arr, no_data=-32768):
    """return array interpolated along time-axis to fill missing values"""
    result = np.zeros_like(arr, dtype=np.int16)

    for x in range(arr.shape[2]):
        # slice along x axis
        for y in range(arr.shape[1]):
            # slice along y axis
            for z in range(arr.shape[0]):
                value = arr[z,y,x]
                if z == 0:  # don't interpolate first value
                    new_value = value
                elif z == len(arr[:,0,0])-1:  # don't interpolate last value
                    new_value = value

                elif value == no_data:  # interpolate

                    left = arr[z-1,y,x]
                    right = arr[z+1,y,x]
                    # look for valid neighbours
                    if left != no_data and right != no_data:  # left and right are valid
                        new_value = (left + right) / 2

                    elif left == no_data and z == 1:  # boundary condition left
                        new_value = value
                    elif right == no_data and z == len(arr[:,0,0])-2:  # boundary condition right
                        new_value = value

                    elif left == no_data and right != no_data:  # take second neighbour to the left
                        more_left = arr[z-2,y,x]
                        if more_left == no_data:
                            new_value = value
                        else:
                            new_value = (more_left + right) / 2

                    elif left != no_data and right == no_data:  # take second neighbour to the right
                        more_right = arr[z+2,y,x]
                        if more_right == no_data:
                            new_value = value
                        else:
                            new_value = (more_right + left) / 2

                    elif left == no_data and right == no_data:  # take second neighbour on both sides
                        more_left = arr[z-2,y,x]
                        more_right = arr[z+2,y,x]
                        if more_left != no_data and more_right != no_data:
                            new_value = (more_left + more_right) / 2
                        else:
                            new_value = value
                    else:
                        new_value = value
                else:
                    new_value = value
                result[z,y,x] = int(new_value)
    return result

As we can see, this implementation uses a single for loop to iterate over each element in the 3D array. This reduces the computational complexity from O(n^2) to O(n), where n is the total number of elements in the array.

Conclusion

In conclusion, while the original approach using scipy.interpolate.interp1d with pandas provides a convenient interface for interpolating missing values, it can be computationally expensive. The optimized solution using numba significantly improves performance by reducing the computational complexity from O(n^2) to O(n).

We hope this article has provided valuable insights into optimizing interpolation processes in large 3D arrays. With the use of numba, we can achieve significant improvements in performance while maintaining accuracy.

Additional Considerations

While the optimized solution using numba provides a significant improvement in performance, there are additional considerations to keep in mind:

Boundary Conditions: In the optimized solution, we handle boundary conditions by not interpolating the first and last values. However, depending on the specific requirements of your application, you may need to modify these boundary conditions.
Neighbour Selection: The optimized solution uses a strategy for selecting neighbours based on their validity. You can modify this approach to suit your specific needs.

By understanding the trade-offs involved in optimizing interpolation processes, you can make informed decisions about which approach best suits your application’s requirements.

Last modified on 2023-10-30