Probabilistic metaplasticity for continual learning with memristors in spiking networks

Zohora, Fatima Tuz; Karia, Vedant; Soures, Nicholas; Kudithipudi, Dhireesha

doi:10.1038/s41598-024-78290-w

Download PDF

Article
Open access
Published: 27 November 2024

Probabilistic metaplasticity for continual learning with memristors in spiking networks

Fatima Tuz Zohora¹,
Vedant Karia¹,
Nicholas Soures¹ &
…
Dhireesha Kudithipudi¹

Scientific Reports volume 14, Article number: 29496 (2024) Cite this article

1970 Accesses
Metrics details

Subjects

Abstract

Edge devices operating in dynamic environments critically need the ability to continually learn without catastrophic forgetting. The strict resource constraints in these devices pose a major challenge to achieve this, as continual learning entails memory and computational overhead. Crossbar architectures using memristor devices offer energy efficiency through compute-in-memory and hold promise to address this issue. However, memristors often exhibit low precision and high variability in conductance modulation, rendering them unsuitable for continual learning solutions that require precise modulation of weight magnitude for consolidation. Current approaches fall short to address this challenge directly and rely on auxiliary high-precision memory, leading to frequent memory access, high memory overhead, and energy dissipation. In this research, we propose probabilistic metaplasticity, which consolidates weights by modulating their update probability rather than magnitude. The proposed mechanism eliminates high-precision modification to weight magnitudes and, consequently, the need for auxiliary high-precision memory. We demonstrate the efficacy of the proposed mechanism by integrating probabilistic metaplasticity into a spiking network trained on an error threshold with low-precision memristor weights. Evaluations of continual learning benchmarks show that probabilistic metaplasticity achieves performance equivalent to state-of-the-art continual learning models with high-precision weights while consuming ~ 67% lower memory for additional parameters and up to ~ 60× lower energy during parameter updates compared to an auxiliary memory-based solution. The proposed model shows potential for energy-efficient continual learning with low-precision emerging devices.

Single neuromorphic memristor closely emulates multiple synaptic mechanisms for energy efficient neural networks

Article Open access 13 August 2024

Analog monolayer SWCNTs-based memristive 2D structure for energy-efficient deep learning in spiking neural networks

Article Open access 04 December 2023

Experimental Demonstration of Supervised Learning in Spiking Neural Networks with Phase-Change Memory Synapses

Article Open access 15 May 2020

Introduction

Autonomous edge devices constitute a major application domain of artificial intelligence–ranging from indoor robots to autonomous drones/vehicles¹. These devices often need on-device training for real-time adaptation to dynamic environments². However, training neural networks requires energy-intensive data movement and memory accesses, which can strain the energy budget of these resource-constrained platforms³. The challenge exacerbates in real-world scenarios where data is non-stationary and does not maintain independent and identical distribution. Conventional artificial intelligence fails to learn effectively in this scenario and catastrophically forgets previously learned information⁴. The ability to learn and adapt to non-stationary data streams requires continual learning mechanisms, which entail additional memory and computational overhead, further straining the strict resource budget on edge devices⁵. The constraints in autonomous edge applications present unique challenges, which lead to certain desirable criteria for continual learning solutions. First, the continual learning solution should be task-agnostic, i.e., it should be effective with no supervision or task-boundary awareness for autonomous applications. Second, edge devices have limited on-chip memory, which prohibits numerous iterative training over a large dataset. Hence, these devices should learn in an online scenario where data are encountered only once¹. Finally, the memory and computational overhead of the continual learning mechanisms should be minimized considering the strict energy constraints of edge devices⁶.

Recent research primarily focuses on alleviating catastrophic forgetting to achieve continual learning while overlooking the above-mentioned criteria. In our research, we strive to achieve continual learning in the online task-agnostic scenario while minimizing memory and computational overhead. We address this multifaceted challenge by taking inspiration from the biological brain, which shows the ability to learn continually throughout its lifetime with remarkable energy efficiency⁷. Neuromorphic or brain-inspired approaches can help in two ways. First, we can take inspiration from the neural mechanisms deemed to help memory consolidation in the brain to address catastrophic forgetting⁸. One such mechanism is metaplasticity or plasticity of plasticity⁹, whereby synapses modify their ability to change based on their history of activity. Recent research in continual learning abstracted this concept by detecting important weights and restricting their update in magnitude by modulating weight gradients. Metaplasticity-inspired weight consolidation mechanisms have shown promising performance in addressing continual learning in neural networks^{10,11,12,13,14,15}. Second, we can harness the computational and structural principles of the brain to optimize the energy efficiency of on-device continual learning. This can translate to adopting spiking models that emulate the computational primitives of the brain with sparse binary encoding. This feature replaces energy-intensive floating-point operations with low-power event-driven computations¹⁶. Structural primitives of the brain are emulated with architectures featuring co-located memory and processing with nanoscale emerging memristor devices¹⁷. In a crossbar arrangement, the conductance of these devices can emulate weights in a network and perform vector-matrix multiplication in a single timestep¹⁸. Such architectures reduce data movement, enable parallel computations and can lead to up to two orders of magnitude greater energy efficiency compared to conventional architectures¹⁹.

Although spiking models incorporated in memristor crossbar architectures are promising, they show non-ideal characteristics which can limit their suitability for continual learning mechanisms. The weight consolidation mechanism mentioned above is a prominent method of continual learning that relies on precise modulation of the weight magnitude. This is extremely challenging to achieve with memristors since these nanoscale devices show low precision (~ 3–6 bits^20,21) and inherent variability during conductance modulation. Recent research has circumvented this issue by incorporating auxiliary high-precision memory (usually implemented with Static Random Access Memory (SRAM)) alongside memristor crossbars²². Binary memristor devices have been used to accelerate forward pass in continual learning tasks while training with backpropagation is carried out in high-precision memory²³. Memristors with multi-level conductance have also been utilized for continual learning with metaplasticity, where auxiliary high-precision memory accumulates weight gradients, and memristor weights are updated when the accumulated gradient is equivalent to the device conductance resolution²⁴. These approaches have shown promising performance with low-precision memristor weights in continual learning benchmarks. However, these solutions rely on multi-epoch training and knowledge of task progression in order to either assign separate output neurons for different tasks or to modulate hyperparameters. These prerequisites are not suitable for autonomous applications. Moreover, high-precision memory with SRAMs lead to a large silicon footprint and idle standby leakage power (tens to hundreds of picoWatts per bit)²⁵. Finally, while crossbar architectures with auxiliary high-precision memory retain energy efficiency during inference, excessive memory access during training can potentially void the benefits of the compute-in-memory architecture.

In this research, we explore a novel weight consolidation mechanism suitable for low-precision memristor weights. Neuroscience literature has shown that binary metaplastic synapses characterized by probabilistic transitions help memory retention²⁶. Inspired by this, we propose the probabilistic metaplasticity model, where a weight’s plasticity translates to its probability of update rather than its magnitude of update. The probability of a weight’s update evolves during training such that important weights are assigned lower update probability to prevent catastrophic forgetting. Since this model does not need precise modulation in weight magnitude, the need for auxiliary high-precision memory is eliminated. Previous research explored probabilistic metaplasticity in the context of binary synapses in models of memory²⁶ and binarized networks that only support training a single-layer network²⁷. We generalize this concept to memristor weights with multi-level conductance incorporated in a multilayer spiking network trained based on an error threshold. We evaluate the proposed solution on image detection continual learning benchmarks considering single and multi-memristor weights. Evaluations show that low-precision memristor weights with probabilistic metaplasticity can achieve performance equivalent to state-of-the-art networks with high-precision weights in the challenging online task-agnostic continual learning scenario. Compared to an equivalent solution with high-precision auxiliary memory, probabilistic metaplasticity reduces memory overhead for additional parameters by $\sim 67\%$ while requiring up to $\sim 60 \times$ lower energy during the parameter update phase. The proposed model is promising for memory- and energy-efficient continual learning for autonomous edge applications.

Probabilistic metaplasticity with error threshold-based training

Metaplasticity, or plasticity of plasticity, refers to the modification of a synapse or weight’s ability/tendency to change based on previous activity⁹. It is deemed a key mechanism for memory consolidation in the brain²⁸. Fusi et al. proposed a cascade metaplasticity model, where binary synapses contain an additional discrete metaplastic state variable²⁶. The metaplastic state of a synapse changes probabilistically in response to input stimuli. Synapses with higher metaplastic states are assigned a lower probability of change to signify greater stability.

Similar to the discrete metaplastic states in the cascade model, memristors can be programmed to a limited number of distinguishable conductance levels (Fig. 1b). In our work, we consolidate memristor weights in a similar probabilistic manner. In a typical network, weights realized with memristors are updated deterministically by increasing or decreasing their conductances based on the error. In the proposed probabilistic metaplasticity model, the weights are updated probabilistically, where the important weights exhibit a lower update probability. The importance of a weight is a function of its magnitude and the activity levels of adjacent neurons¹¹. High magnitude and connections to highly active pre- and post-synaptic neurons lead to a lower probability of updating or higher weight consolidation. This is captured in the metaplasticity function, which determines the update probability $p_{\text {update}}$ of a weight $w_{ij}$ connected between $i^{\text {th}}$ pre-synaptic and $j^{\text {th}}$ post-synaptic neurons -

$$\begin{aligned} p_{\text {update}} = e^{-|m_{ij}w_{{ij}}|}. \end{aligned}$$

(1)

Here, $m_{ij}$ is a metaplasticity coefficient assigned to each weight. As the network learns, the metaplasticity coefficient evolves based on the activity trace of the adjacent neurons as -

$$\begin{aligned} m_{ij}(t+1) = {\left\{ \begin{array}{ll} m_{ij}(t) + {\Delta }m, & \text {~if} \, X^{\text {tr}}_{i} \ge m^{\text {pre}}_{\text {th}} \& X^{\text {tr}}_{j} \ge m^{\text {post}}_{\text {th}} \\ m_{ij}(t) & \text {~otherwise} \end{array}\right. }. \end{aligned}$$

(2)

We see that high activity in the adjacent neurons increases the metaplasticity coefficient of a weight. Figure 1c shows the update probability for different values of weights and metaplasticity coefficients. We see that weights with higher magnitude and metaplasticity coefficient exhibit lower update probability. Throughout training, a weight’s magnitude and metaplasticity coefficient adapt in response to the activity of the network. This changes the update probability/plasticity of a weight, leading to a metaplastic effect to consolidate weights.

We incorporate probabilistic metaplasticity into a multi-layer spiking network trained based on an error threshold (shown in Fig. 1a). The network is trained with event-driven random backpropagation (eRBP) where the network error is projected to neurons (details in Methods)²⁹. Each neuron includes a dendritic compartment that accumulates the error. In eRBP, a weight is considered eligible for update if the pre-synaptic neuron has an active spike and the post-synaptic current is within a range. When these criteria are met, the weights are updated in proportion to the accumulated error. However, low-precision weights cannot be updated with arbitrary resolution. We resolve this issue by updating the weights when the accumulated error reaches a threshold. The threshold corresponds to the magnitude of error that warrants a change in the weights equivalent to their resolution. This update criteria is similar to the gradient accumulation approach where high-precision memory accumulates the gradients, and weights are updated when their gradients reach a certain threshold²². The distinctive factor in our approach is that the weight update decision depends on the local error estimated at the neurons instead of gradients accumulated for every weight. This avoids gradient computation and storage and leads to lower memory overhead and accesses. Spiking networks trained with error-triggered learning can effectively learn a single task³⁰. We combine probabilistic metaplasticity with error threshold-based training to continually learn sequential tasks. Eligible weights are probabilistically updated, where the weight update involves programming the memristor device to its adjacent higher or lower conductance level, depending on the sign of the error. Important weights exhibit low update probability and hence retain their state and help prevent catastrophic forgetting. The detailed learning algorithm is described in Algorithm 1.

Results

Continual learning with probabilistic metaplasticity

We evaluate the proposed probabilistic metaplasticity model considering weights realized with a Hafnium Oxide-based memristor device in the IBM 65nm 10LPe CMOS/Memristor process²⁰. The device can be programmed to ten distinguishable low resistance levels which are used to map the weights (shown in Fig. 1b). The average resolution of conductance is $\sim$ 3 bits considering the average conductance at different levels. To realize both positive and negative weights, memristor crossbars are assigned a bias column whose contributions to the output are subtracted from that of the weights with inverting and summing amplifiers³¹. If $g_\text {p}$ is the conductance of the memristor weight, $g_\text {b}$ is the conductance of the bias memristor, then the synaptic weight can be expressed with Eq. 3 where $g_\text {f}$ is the scaling factor implemented with the conductance of the feedback resistor in the inverting amplifier.

$$\begin{aligned} w = \frac{1}{g_\text {f}}(g_\text {p}-g_\text {b}) \end{aligned}$$

(3)

We evaluate a spiking network with memristor weights on image detection continual learning benchmarks. The network is trained sequentially on tasks without any knowledge of the identity of the task or its boundaries during training or inference. This scenario is known as task-agnostic domain incremental learning (Domain-IL), which presents two major challenges^32,33. First, if the task boundary is known, the network can expect a change in data distribution and take measures to reduce catastrophic forgetting. This is not possible in task-agnostic domain-IL setting. Second, all tasks share the same output layer, which leads to increased interference and catastrophic forgetting. We also consider an online scenario where the network sees each sample only once (the network is trained on each task for one epoch).

We first evaluate probabilistic metaplasticity on the split-MNIST benchmark, which consists of five sequential image classification tasks, each composed of two classes from the MNIST dataset³⁴. We conduct network-level simulations in Python considering a spiking network with 200 hidden neurons and 2 output neurons. The simulation emulates in situ learning with memristor weights and captures the quantized nonuniform distribution and the associated variance of memristor conductance during weight updates. Figure 2 shows the evolution of the network performance in each of the benchmark tasks as the network trains on them sequentially. When trained with error threshold-based eRBP with no probabilistic metaplasticity, we see that the network learns each task it encounters with high accuracy (Fig. 2a). However, after training on all tasks, it retains high accuracy only on the recently encountered tasks, and the earlier task accuracies drop to random guess level.

Probabilistic metaplasticity consolidates important weights to alleviate such catastrophic forgetting. The degree of consolidation depends on the activity threshold of pre-synaptic and post-synaptic neurons ($m^{\text {pre}}_{\text {th}}$ and $m^{\text {th}}_{\text {post}}$ in Eq. 2). A low activity threshold leads to high weight consolidation, which helps retain high accuracy in earlier tasks, as shown in Fig. 2b. However, the network cannot adapt when later tasks are encountered due to the lower plasticity of the weights, leading to poor performance. A high activity threshold leads to the opposite effect, where the network can learn later tasks better with a compromise in earlier task accuracies (Fig. 2c). Optimizing the activity threshold balances the stability-plasticity trade-off to retain high accuracy in earlier tasks with minimal degradation in the ability to learn later tasks (Fig. 2d).

We further evaluate probabilistic metaplasticity on the split-Fashion MNIST benchmark, which also consists of five sequential image classification tasks, each with two classes³⁵. We compare the performance of probabilistic metaplasticity with state-of-the-art models which also consolidate important weights or regularize weight updates to achieve continual learning. These approaches differ in the criteria that determine weight importance. “Synaptic Intelligence” (SI)¹³ measures weight importance based on the contribution of weights to improve network performance, whereas “Memory Aware Synapses” (MAS)³⁶ measures it through the sensitivity of the network’s output function to changes in weights. “Stochastic Synapses” (SS)³⁷ adopts weights which transmit information probabilistically and weights with high transmission probability are consolidated. “Bayesian Gradient Descent” (BGD)³⁸ is a Bayesian continual learning mechanism that assesses the importance of parameters based on the uncertainty associated with their distributions. “Learning without forgetting” (LwF)³⁹ preserves the response of the network to previous tasks through knowledge distillation. “TACOS”¹¹ introduces activity-dependent metaplasticity where weights are consolidated based on their magnitude and the activity level of adjacent neurons. Except for TACOS, which was evaluated in a spiking network, the rest of the models were evaluated considering multi-layer perceptrons. Weight consolidation models typically require additional memory to store information for consolidating weights, which can be implemented with on-chip memory in hardware. In this work, we model this memory as on-chip SRAM during the energy analysis in the following subsection. Table 1 shows the memory required by each continual learning model for parameters, excluding weights, considering the same network architecture used to evaluate probabilistic metaplasticity (see Supplementary note S1 for details). We exclude the network weights in the memory overhead computation since the aforementioned continual learning models were evaluated considering full-precision weights, whereas the proposed mechanism realizes weights with memristor devices. We also evaluate a spiking neural network with full-precision weights trained with eRBP as a baseline.

Table 1 lists the mean accuracy across tasks for the different models after sequentially training on multiple tasks. To observe the effect of weight resolutions on the performance of probabilistic metaplasticity, we consider both single- and multi-memristor weights⁴⁰. Multi-memristor weights are configured by connecting $n_{\text {mem}}$ number of memristors in parallel. Figure 1a shows a 1T1R crossbar with multi-memristor weights consisting of three 1T1R devices. During training, only one memristor arbitrated by a global counter is updated to reduce training overhead. The parallel combination of devices increases the dynamic range of the weight conductance by a factor of $n_{\text {mem}}$ and increases the weight resolution. We note an improvement in the mean accuracy across tasks with increased weight resolution, with the best performance when $n_{\text {mem}}$ = 7, leading to a mean weight resolution $\sim 6$-bit. Results show that with probabilistic metaplasticity, a spiking network with noisy low-precision multi-memristor weights achieves performance equivalent to state-of-the-art networks with full-precision weights in both benchmark tasks. For reference, we consider a spiking network that consolidates weights with activity-dependent metaplasticity (details in Methods)¹¹. Since this weight consolidation mechanism requires precise weight updates, we accumulate the weight gradients in auxiliary memory, which is reset after every sample. This model is evaluated with multi-memristor weights ($n_{\text {mem}}$ = 7) considering the same network architecture as probabilistic metaplasticity. We see that activity-dependent metaplasticity leads to performance similar to probabilistic metaplasticity in benchmark tasks. However, the latter requires $\sim 67\%$ lower memory overhead. We extend the performance analysis by evaluating probabilistic metaplasticity on the split-CIFAR-10 benchmark⁴¹, which consists of five sequential image detection tasks. It is a more complex dataset compared to the previous benchmarks, containing larger images, more complex image types and colored images instead of grayscale. To address this complexity, we consider a ResNet-18 backbone pre-trained on ImageNet to process the images and use the extracted features as inputs to the spiking networks^32,42. Probabilistic metaplasticity was evaluated considering a spiking network with 200 hidden neurons and multi-memristor weights ($n_{mem}$ =7). We also evaluate a baseline spiking network with multi-memristor weights and a spiking network trained with TACOS. Table 2 lists the accuracies of individual tasks and mean accuracy across tasks. We see that although probabilistic metaplasticity adopts low-precision multi-memristor weights, it demonstrates performance equivalent to TACOS, which adopts full-precision weights.

We further analyze the robustness of the proposed probabilistic metaplasticity model by evaluating its performance on a smaller set of training data. An autonomous agent learning in an online setting may need to learn from fewer samples. To explore this scenario, we trained and optimized the model on a subset of the split-MNIST dataset and tested it on the complete test set. We observed some degradation in the mean accuracy across tasks with reduced dataset size (shown in Fig. 3). The degradation may stem from the probabilistic consolidation mechanism’s reliance on the law of large numbers. However, we observed a similar degradation when training the network with activity-dependent metaplasticity. This suggests that the decline in performance could also be attributed to the general performance degradation in machine learning models when the number of training samples is insufficient.

Finally, we perform two controlled experiments to show the contribution of probabilistic metaplasticity for continual learning. First, we study the contribution of consolidating weights with high magnitude and high adjacent neural activity. For this study, we explore random consolidation in which the update probabilities of the weights are calculated with Eq. 1, but the results are randomly shuffled among the weights. Second, we study the contribution of probabilistic metaplasticity in contrast to probabilistic plasticity. The controlled experiment evaluates this by updating all weights with the same update probability. We also reduce the weight update probability or plasticity by a factor with each incoming task to help retain previous knowledge. Table 3 shows the results of the controlled experiments on the split-MNIST benchmark. We carried out the experiments considering multi-memristor weights ($n_{\text {mem}}$ = 7) in the same spiking network as in the previous experiments. The baseline network is trained based on an error threshold with no additional mechanism. The table lists the accuracies of each task in the benchmark and the mean accuracy across tasks after sequential training. We see that random consolidation fails to learn continually and suffers from catastrophic forgetting. The performance of decaying probabilistic plasticity varies with the update probability reduction factor. A high reduction factor leads to consolidation in weights, which retains high accuracies in earlier tasks. However, the network fails to learn the last task. A low reduction factor shows the opposite characteristic, where the network catastrophically forgets the earlier tasks. Controlled experiments on the split-Fashion MNIST benchmark show the same response (Supplementary note S2 and Supplementary Table S2). Contrary to the experiments, we see that probabilistic metaplasticity retains high accuracy in earlier tasks with much lower degradation in the accuracy of the last task compared to decaying probabilistic plasticity. Since the proposed mechanism consolidates weights selectively based on their magnitude and adjacent neural activity, it retains high plasticity in weights assigned low importance, leading to a better balance of stability and plasticity.

In summary, probabilistic metaplasticity demonstrates continual learning with low-precision memristor weights in the challenging online task-agnostic continual learning scenario, which has not been addressed before in the literature. It achieves state-of-the-art performance with low memory overhead, which is highly suitable for edge applications.

Energy analysis

We evaluate energy consumption for continual learning with probabilistic metaplasticity by analyzing the computations and memory accesses required to train on the benchmark tasks. We then determine the energy needed to train on a sample by analyzing the average number of operations and the individual operation energy requirements (details in Supplementary note S3). We also perform the analysis for activity-dependent metaplasticity with gradient accumulation as a reference. Figure 4 shows the architecture and computational flow for the analysis. We focus on the energy consumption during the parameter update phase (the operations in the gray box in Fig. 4) since the operations incurred during the forward pass and error computations are the same in both approaches. We analyze the activities of a spiking network with 200 hidden neurons and multi-memristor weights ($n_{\text {mem}}$ = 7). The mixed-signal and digital components for different network operations were designed in the IBM 65nm 10LPe process. We designed a $16\times 16$ 1T1R crossbar array with peripherals to estimate the energy to read and write memristor weights. The metaplasticity coefficients for both mechanisms and the gradients for the activity-dependent metaplasticity mechanism were modeled as on-chip SRAM with 16-bit and 32-bit bus widths, respectively (more details in Supplementary note S3).

Figure 5a shows the energy consumption per sample during the parameter update phase and its breakdown for the split-MNIST task. We highlight the energy consumption due to memory access with memristor and SRAM read/write and subsume the energy required for the rest of the operations (post-synaptic current comparison, random number generation, error computation, etc.) in the “Other operations” category. We observe that probabilistic metaplasticity leads to a significant reduction in energy. The reason behind this is mainly twofold. First, the operations required for updating weights with probabilistic metaplasticity require lower energy compared to the gradient accumulation-based approach. Each time a weight is eligible for update, activity-dependent metaplasticity with gradient accumulation requires computing the gradient and the metaplasticity function. The latter requires reading the metaplasticity coefficient from SRAM and reading memristor weights. The computed gradient is multiplied by the output of the metaplasticity function and accumulated, which requires an SRAM read and write. On the other hand, probabilistic metaplasticity avoids memory access due to gradient accumulation. It also updates weights based on the comparison of a random number with the update probability (computed with the metaplasticity function), which requires much less energy compared to multiplication. Second, in the gradient accumulation-based approach, the weights are updated based on the criteria set in eRBP, while in probabilistic metaplasticity, the eligibility for weight update includes an additional condition of the error threshold. This leads to much fewer weights eligible for update in the latter (up to three orders of magnitude reduction on average), which consequently leads to less frequent execution of the operations needed for weight update. In summary, gradient accumulation leads to a higher fraction of weights eligible for update, and each eligible weight requires a higher number of memory accesses and computations for gradient computation, leading to higher energy consumption. We observe up to $\sim 60 \times$ reduction in total energy during parameter updates with probabilistic metaplasticity compared to activity-dependent metaplasticity with gradient accumulation.

Memory optimization with parameter sharing

Energy analysis shows that SRAM read/write dominates energy consumption in both continual learning approaches. In the probabilistic metaplasticity model, the SRAM read/write operations account for reading and updating the metaplasticity coefficients assigned to the weights. Since a metaplasticity coefficient is assigned to each weight, the area overhead due to this memory can be substantial. Here, we optimize this memory overhead by sharing metaplasticity coefficients among weights. We explore sharing the metaplasticity coefficients based on neural connectivity where weights connected to the same post-synaptic neuron share a metaplasticity coefficient (neuron-shared m)⁴³. The metaplasticity coefficient associated with the weights connected to the $j^{\text {th}}$ post-synaptic neuron evolves as -

$$\begin{aligned} m_{j}(t+1) = {\left\{ \begin{array}{ll} m_{j}(t) + {\Delta }m, & \text {~if} \, X^{\text {tr}}_{j} \ge m^{\text {post}}_{\text {th}} \\ m_{j}(t), & \text {~otherwise} \end{array}\right. }, \end{aligned}$$

(4)

where $X^{\text {tr}}_j$ is the activity trace of the $j^{\text {th}}$ post-synaptic neuron. This configuration reduces the memory overhead substantially since only one coefficient is required per post-synaptic neuron in a layer. To further optimize the memory overhead, we also consider assigning all weights in the same layer the same metaplasticity coefficient (layer-shared m). In this configuration, the metaplasticity coefficient is updated based on the average activity trace of all post-synaptic neurons in the layer -

$$\begin{aligned} m(t+1) = {\left\{ \begin{array}{ll} m(t) + {\Delta }m, & \text {~if} \, \frac{(\sum _{j=1}^{n}X^{\text {tr}}_{j})}{n} \ge m^{\text {layer}}_{\text {th}} \\ m(t), & \text {~otherwise} \end{array}\right. }, \end{aligned}$$

(5)

where n is the number of post-synaptic neurons in a layer.

We evaluate the probabilistic metaplasticity model with shared metaplasticity coefficients on a spiking network with 200 hidden neurons and multi-memristor weights ($n_{\text {mem}}$ = 7). Table 4 lists the evaluation results on the benchmark tasks. We see that probabilistic metaplasticity with neuron-shared and layer-shared m performs on par or better compared to state-of-the-art models with weight consolidation based on neural activity^42,44. Furthermore, sharing metaplasticity coefficients leads to a substantial decrease in memory overhead and energy consumption during training. We model the metaplasticity coefficients in neuron-shared and layer-shared m with on-chip SRAM and registers, respectively, during energy analysis. Figure 5a shows, in both neuron-shared and layer-shared configurations, energy consumption is no longer dominated by SRAM read/write, rather most of the energy is spent in operations such as assessing the weight update criteria and memristor read/write. Hence, although layer-shared m drastically reduces memory overhead and leads to three orders of magnitude reduction in energy dissipation to read/write the metaplasticity coefficients compared to neuron-shared m, in both cases we observe $\sim 18 \times$ reduction in total energy consumption during the parameter update phase.

The reduction in energy consumption, however, is accompanied by a degradation in performance compared to the probabilistic metaplasticity model with individual metaplasticity coefficients. This can be attributed to the limited flexibility of the network to consolidate weights, which leads to a poorer balance of the stability-plasticity dilemma. In the split-MNIST task, sharing metaplasticity coefficients leads to reduced plasticity after sequential training. This can be observed from the update probability of the weights, which is inversely related to the product of the metaplasticity coefficient and the weight magnitude (|mw|). The distribution of |mw| in hidden weights (Fig. 5b) shows that shared metaplasticity coefficients lead to a lower fraction of weights with |mw| close to zero, indicating reduced plasticity. This reduction is more pronounced in layer-shared m. Individual task accuracies after sequential training also show that shared coefficients result in poorer accuracies in later tasks (Supplementary Table S5). This demonstrates a trade-off in the memory overhead and the network’s ability to balance stability and plasticity.

Despite the reduction in plasticity, the substantial gain in energy consumption incentivizes shared metaplasticity coefficients in extreme-edge applications. The trade-off between memory overhead and the stability-plasticity balance can be further optimized by sharing the metaplasticity coefficients among blocks of weights within a layer. We investigated a modular sharing of metaplasticity coefficients (module-shared m), where weights in a block, connecting a post-synaptic neuron to a set of pre-synaptic neurons, share a metaplasticity coefficient. The pre-synaptic neurons can be selected based on different criteria such as the sign or magnitude of incoming weights. Here, we group pre-synaptic neurons based on proximity where adjacent pre-synaptic neurons are grouped together. This configuration is well-suited for hardware implementation, as the activity of adjacent neurons can be computed locally, minimizing communication overhead. The coefficient $m_{kj}$, assigned to module k, consisting of post-synaptic neuron j and $n_{\text {block}}$ number of adjacent pre-synaptic neurons, are updated based on the post-synaptic activity trace and the average of the traces of the pre-synaptic neurons in the module as follows -

$$\begin{aligned} m_{kj}(t+1) = {\left\{ \begin{array}{ll} m_{kj}(t) + {\Delta }m, & \text {~if} \, \frac{(\sum _{i=1}^{n_{\text {block}}}X^{\text {tr}}_{i})}{n_{\text {block}}} \ge m^{\text {pre}}_{\text {th}} \& X^{\text {tr}}_{j} \ge m^{\text {post}}_{\text {th}} \\ m_{kj}(t) & \text {~otherwise} \end{array}\right. }. \end{aligned}$$

(6)

We evaluate the performance of probabilistic metaplasticity with module-shared m in the split-MNIST task considering sharing m among 8 and 4 adjacent weights in the hidden and the output layer, respectively. Figure 5c shows the mean accuracy across tasks for different configurations of shared m vs. the energy consumption per sample during the parameter update phase. We see that while the performance of module-shared m is similar to that of neuron-shared m, the former shows greater plasticity compared to neuron-shared m (Fig. 5b). It also reduces memory overhead by $\sim 87.5\%$ and reduces energy consumption by $\sim 4.7\times$ during the parameter update phase compared to individual m. This configuration can be a promising approach to optimize the trade-off between balance of stability and plasticity and energy efficiency. Introducing mechanisms to reduce the magnitude of metaplasticity coefficients based on neural activity can also potentially improve the balance of stability and plasticity⁴⁵.

Discussion

Emerging nonvolatile memristor devices offer immense opportunities for energy-efficient continual learning on the edge. However, their limited precision and high variability may constrain their suitability for continual learning solutions. Our research takes inspiration from the computational principles of the brain to devise a continual learning mechanism suited to non-ideal low-precision memory devices. Computations in the brain are characterized by stochastic and low-precision synapses^46,47. Research in computational memory models demonstrates that stochastic binary synapses with low-precision metaplastic states can help memory retention²⁶. Our work draws a parallel between the low-precision hidden metaplastic states of a cascade metaplastic synapse and the conductance levels of a memristor device. Abstracting this concept, we interpret the plasticity of weights as their probability of update rather than the magnitude of update. This interpretation distinguishes the proposed work from previous continual learning models with metaplasticity. Previous work has also explored probabilistic updates in memristor weights considering learning settings with independent and identically distributed (i.i.d.) data^48,49. In a continual learning setting, where the distribution of data changes, such mechanisms are prone to catastrophic forgetting. In contrast, probabilistic metaplasticity effectively alleviates catastrophic forgetting by dynamically adjusting the update probability of weights based on their importance.

In this work, we demonstrated the efficacy of probabilistic metaplasticity in a spiking network trained with error threshold-based eRBP, where weights were realized with a Hafnium oxide-based memristor. However, the proposed mechanism is not limited to the setting above and can be explored in the context of different network architectures and memory devices. The proposed work naturally extends to other surrogate gradient-based training algorithms for spiking networks, where the local error at the neurons can be estimated⁵⁰. Additionally, it can be applied to deep neural networks with low-precision weights updated using stochastic rounding⁵¹. Incorporating probabilistic metaplasticity with gradient accumulation-based training is also possible. In this setting, probabilistic metaplasticity will eliminate the need to multiply metaplasticity factors with gradients. However, the high frequency of memory access may lead to a lower gain in energy consumption. The proposed mechanism also extends to networks that realize low-precision weights with other memory devices. It shows promise to enable continual learning in neuromorphic accelerators with low-precision weights⁵². Recent research has explored probabilistic training algorithms for emerging low-precision memory devices in deep neural networks considering learning settings with independent and identically distributed data^48,49. Probabilistic metaplasticity can be explored in these networks to alleviate catastrophic forgetting during sequential learning with non-stationary data distribution.

We explored the effect of different weight resolutions by employing multi-memristor weights alongside single-memristor weights. Although we observed improved performance with higher resolution weights, it increases the area footprint of memristor crossbars. Improving the resolution of a single memristor device can lead to better performance with lower area overhead. Recent research has shown memristor devices with up to $\sim$ 6-bit precision²¹. In addition to the resolution of conductance modulation, the nonlinearity and variability in memristor devices can affect the performance of error threshold-based training. Due to these non-ideal characteristics, weight updates vary in response to the same local error depending on the current conductance level of the memristor. To minimize this effect, we restricted weight mapping to only ten low-resistance levels of the 1T1R device. This choice avoids the higher variability observed in the high-resistance levels along with the substantial difference in conductance between the highest low-resistance and high-resistance levels during weight updates²⁰. However, it should be noted that choosing low-resistance levels increases parasitic voltage drop and sneak currents, which can also introduce nonlinearity and degrade the network performance. Similarly, extreme non-uniformity in the distribution of conductance levels can potentially degrade the performance of error threshold-based training. Future studies incorporating optimized circuits and system demonstrations of the proposed mechanism can further investigate the effect of these non-idealities.

This work limits the exploration of probabilistic metaplasticity in fully-connected spiking networks. Future research can expand the proposed mechanism to larger networks and tackle more complex continual learning tasks with longer sequences. It would be interesting to extend the proposed mechanism to convolutional spiking networks where weights are shared among inputs. We identified a trade-off between memory overhead and the network’s ability to balance stability and plasticity when metaplasticity coefficients are shared among network weights. Further theoretical analysis could offer valuable insights and pave the way for more efficient continual learning solutions. Further investigations into the contrasting learning dynamics when training the network based on error accumulated at neurons instead of gradients can also lead to a deeper understanding of the underlying mechanisms. Our empirical observations on continual learning tasks did not show any significant differences in performance (Supplementary Table S6). However, we note that gradient accumulation may allow for more gradual weight changes over time as the gradients can be accumulated over samples. During error threshold-based training, integrating error across samples may lead to incorrect attribution, especially when weights are updated based on immediate input activity. For settings that require gradual updates, tracing input activity over time or using low-precision counters to track smaller changes may prove beneficial. Error-triggered training is promising for resource-constrained platforms as it reduces memory overhead and merits robust study under various conditions.

In summary, we propose probabilistic metaplasticity for continual learning with low-precision memristor weights. The proposed approach consolidates important weights by modulating the probability of weight update rather than its magnitude. This eliminates the need for auxiliary high-precision memory for weight consolidation, leading to energy-efficient training with fewer memory accesses and computations. Benchmark evaluations demonstrated that probabilistic metaplasticity effectively prevents catastrophic forgetting in the online continual learning scenario with no task supervision, satisfying the criteria for autonomous applications. Moreover, it enables a spiking network with low-precision memristor weights to achieve accuracies equivalent to state-of-the-art continual learning models with high-precision weights while requiring lower memory overhead. Probabilistic metaplasticity can thus lead to efficient solutions for autonomous continual learning on edge.

Methods

Memristor device

We evaluate the proposed probabilistic metaplasticity model considering Hafnium Oxide-based 1T1R cells as weights. The Resistive Random Access Memory (RRAM) device consists of a $\text {TiN/Ti/HfO}_\text {2}/\text {Ti}$ stack connected to a transistor in the IBM 65nm 10LPe CMOS/Memristor process²⁰. The device can be used in binary mode by setting and resetting it to high and low conductive states. During SET operation, modulating the compliance currents by controlling the gate voltage of the transistor gives rise to 10 distinct levels in the high-conductive state, with mean conductance in the range of $\sim$ 40 μS–283 μS with an average conductance resolution of $\sim$ 27 μS²⁰. When used as a synaptic weight, the resolution of this device (calculated as (maximum conductance − minimum conductance)/average conductance resolution) is close to 3 bits of precision. In the multi-memristor weight, due to the parallel combination of memristors, the maximum and minimum conductance of the weight increases. However, since only one device is updated during training, the average conductance resolution during weight update is the same as a single memristor weight. We use scaling to map the multi-memristor weights to the desired range. With four and seven memristors per weight, the weight precision is close to 4 and 6 bits, respectively.

Spiking neural network training algorithm

We train the multilayer spiking neural network with event-driven random backpropagation (eRBP) combined with error threshold, which uses surrogate gradients²⁹. This section describes the eRBP training algorithm. The input stimuli to the network are encoded as Poisson spike trains, where the spike rate is proportional to the strength of the input. The spike train propagates through the weights and accumulates as the input current to the Leaky Integrate and Fire neurons. For the $j^{\text {th}}$ neuron, the accumulated current can be expressed as -

$$\begin{aligned} I_j(t + 1) = I_j(t) + \frac{{\Delta }t}{\tau _{\text {syn}}} \Big (\sum _{i=1}^N {w}_{ij} S_i(t) - I_j(t)\Big ). \end{aligned}$$

(7)

Here $S_{i}(t)$ is the input spike at the $i^{\text {th}}$ pre-synaptic neuron at timestep t, N is the total number of input neurons, $\tau _{\text {syn}}$ is the synaptic time constant, and $w_{ij}$ is the synaptic weight between $i^{\text {th}}$ and $j^{\text {th}}$ pre- and post-synaptic neurons. This current is integrated at the neuron membrane potential as shown in Eq. 8, where $\tau _{\text {mem}}, R, C$ and $V_{\text {rest}}$ are the membrane time constant, membrane resistance, capacitance and the resting potential of the neuron, respectively.

$$\begin{aligned} V_j(t+1) = V_j(t) + \frac{{\Delta }t}{\tau _{\text {mem}}}\bigg (\Big (V_{\text {rest}}-V_j(t)\Big )+I_j(t)R\bigg ) \end{aligned}$$

(8)

When the membrane potential reaches its threshold $V_{\text {th}}$, the neuron emits a spike $S_j$, and its membrane potential is reset to the resting potential. A refractory period follows every time a neuron spikes, during which the neuron remains inactive.

The spike train progresses through the network to produce the output layer spike train $S^{\text {out}}$. It is compared to the target spike train L to produce the error current $I^{\text {err}}$ -

$$\begin{aligned} {I}^{\text {err}}(t) = {S}^{\text {out}}(t) - {L}(t). \end{aligned}$$

(9)

Two sets of error neurons, for false positive and false negative errors, spike in response to $I^{\text {err}}$. The output spikes of error neurons ${S}^{\text {fp/fn}}$ are used to compute the error in the output and hidden layers. At the output layer o, the error at the $j^{\text {th}}$ neuron is -

$$\begin{aligned} E^{o}_j(t) = {S}^{\text {fp}}_j(t) - {S}^{\text {fn}}_j(t). \end{aligned}$$

(10)

While for the hidden layer h, the error is backpropagated to the $i^{\text {th}}$ neuron by -

$$\begin{aligned} E^{h}_i(t) = \sum _{j=1}^{n_{\text {out}}} \Big ({w}^{\text {fp},h}_{ij}{S}^{\text {fp}}_j(t) - {w}^{\text {fn},h}_{ij}{S}^{\text {fn}}_j(t)\Big ), \end{aligned}$$

(11)

where $n_{\text {out}}$ is the number of output neurons, and ${w}^{\text {fp},h}_{ij}$ and ${w}^{\text {fn},h}_{ij}$ are random feedback weights from error neurons encoding false positives and false negatives, respectively.

The neurons in the hidden and output layers integrate the errors as -

$$\begin{aligned} U(t+1) = U(t) + \frac{{\Delta }t}{\tau _{\text {mem}}}\Big (-U(t) + E(t)R\Big ). \end{aligned}$$

(12)

As the dendritic compartment accumulates the error, the weights are updated in proportion to it whenever there is an active pre-synaptic spike, and the post-synaptic current is within a range. The update in weight can be expressed as -

$$\begin{aligned} \Delta w_{ij} = -\eta S_iU_j\Theta (I_j), \end{aligned}$$

(13)

where $\eta$ is the learning rate, $S_i$ is the pre-synaptic spike, $U_j$ is the error integrated at the dendritic compartment of the $j^{\text {th}}$ post-synaptic neuron, and $\Theta$ is a boxcar function, which indicates that the current magnitude at post-synaptic neuron j falls within the threshold.

Table 1 The mean accuracies across tasks with standard deviations of different continual learning models on the split-MNIST and split-Fashion MNIST tasks after sequential training in the domain-IL scenario.

Full size table

Table 2 Evaluation of a spiking neural network with probabilistic metaplasticity and TACOS on the split-CIFAR-10 task in the domain-IL scenario. Probabilistic metaplasticity is evaluated considering multi-memristor weights $n_{mem}$ = 7 and TACOS considers full-precision weights. The table lists the mean and standard deviation of the individual task accuracies and the mean accuracy across tasks over 5 runs after sequential training.

Full size table

Table 3 Performance of a spiking network on the split-MNIST task in different settings. The table shows the mean and standard deviation of the individual task accuracies and the mean accuracy across tasks over 5 runs after sequential training.

Full size table

Table 4 Evaluation of probabilistic metaplasticity and neural activity-dependent weight consolidation models on the split-MNIST and split-Fashion MNIST tasks in the domain-IL scenario.

Full size table

Activity-dependent metaplasticity with gradient accumulation for continual learning

Metaplasticity in neural networks is usually abstracted with a metaplasticity factor, which modulates the magnitude of weight gradients^11,15. The factor assumes a low value for important weight, which restricts subsequent changes and preserves previously learned information. In this work, we adopt activity-dependent metaplasticity¹¹ for continual learning with gradient accumulation. According to this mechanism, the importance of a weight is determined by its magnitude and the activity of the adjacent neurons. The function is expressed as -

$$\begin{aligned} f(m_{ij},w_{ij}) = e^{-|m_{ij}w_{{ij}}|} \end{aligned}$$

(14)

Here, $w_{ij}$ is the weight value, and $m_{ij}$ is a metaplasticity coefficient assigned to a weight that captures the activity of its adjacent neurons. It is updated by computing the trace of the neuron activity as -

$$\begin{aligned} m_{ij}(t+1) = {\left\{ \begin{array}{ll} m_{ij}(t) + {\Delta }m, & \text {~if} \, X^{\text {tr}}_{i} \ge m^{\text {pre}}_{\text {th}} \& X^{\text {tr}}_{j} \ge m^{\text {post}}_{\text {th}} \\ m_{ij}(t), & \text {~otherwise} \end{array}\right. }, \end{aligned}$$

(15)

The weight gradients are modulated as -

$$\begin{aligned} \Delta w_{ij} = -\eta S_iU_j\Theta (I_j)f(m_{ij},w_{ij}) \end{aligned}$$

(16)

The required weight updates are highly precise and cannot be directly incorporated into low-precision memristor devices. Therefore, the weight gradients are accumulated in high-precision memory during sample spike-train presentations. Whenever the gradient reaches a threshold corresponding to the memristor resolution, the weights are updated. The gradients are reset after each sample presentation. The complete training procedure is described in Supplementary Algorithm S1.

Data availability

All datasets used in this work (MNIST³⁴, Fashion-MNIST³⁵ and CIFAR-10⁴¹) are publicly available.

Code availability

The source code supporting the figures and experiments in this work is available online in the Github repository: https://github.com/Nu-AI/Probabilistic_metaplasticity.

References

Hayes, T. L. & Kanan, C. Online Continual Learning for Embedded Devices. arXiv preprint arXiv:2203.10681 (2022).
Kukreja, N. et al. Training on the Edge: The why and the how. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 899–903. https://doi.org/10.1109/IPDPSW.2019.00148 (2019).
Dally, W. On the model of computation: Point. Commun. ACM 65, 30–32 (2022).
Article Google Scholar
French, R. M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3, 128–135, https://doi.org/10.1016/S1364-6613(99)01294-2 (1999).
Verwimp, E. et al. Continual learning: Applications and the road forward. arXiv preprint arXiv:2311.11908 (2023).
Kudithipudi, D. et al. Design principles for lifelong learning AI accelerators. Nat. Electron. 1–16 (2023).
Balasubramanian, V. Heterogeneity and efficiency in the brain. Proc. IEEE 103, 1346–1358. https://doi.org/10.1109/JPROC.2015.2447016 (2015).
Article CAS Google Scholar
Kudithipudi, D. et al. Biological underpinnings for lifelong learning machines. Nat. Mach. Intell. 4, 196–210. https://doi.org/10.1038/s42256-022-00452-0 (2022).
Article Google Scholar
Abraham, W. C. & Bear, M. F. Metaplasticity: The plasticity of synaptic plasticity. Trends Neurosci. 19, 126–130. https://doi.org/10.1016/S0166-2236(96)80018-X (1996).
Article CAS PubMed Google Scholar
Benna, M. K. & Fusi, S. Computational principles of synaptic memory consolidation. Nat. Neurosci. 19, 1697–1706. https://doi.org/10.1038/nn.4401 (2016).
Article CAS PubMed Google Scholar
Soures, N., Helfer, P., Daram, A., Pandit, T. & Kudithipudi, D. TACOS: Task agnostic continual learning in spiking neural network. In Theory and Foundation of Continual Learning Workshop at ICML’2021 (2021).
Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114, 3521–3526. https://doi.org/10.1073/pnas.1611835114 (2017).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Zenke, F., Poole, B. & Ganguli, S. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70 of Proceedings of Machine Learning Research (eds. Precup, D. & Teh, Y. W.) 3987–3995 (PMLR, 2017).
Kaplanis, C., Shanahan, M. & Clopath, C. Continual reinforcement learning with complex synapses. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80 of Proceedings of Machine Learning Research (eds. Dy, J. & Krause, A.) 2497–2506 (PMLR, 2018).
Laborieux, A., Ernoult, M., Hirtzlin, T. & Querlioz, D. Synaptic metaplasticity in binarized neural networks. Nat. Commun. 12, 2549 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Han, B., Sengupta, A. & Roy, K. On the energy benefits of spiking deep neural networks: A case study. In 2016 International Joint Conference on Neural Networks (IJCNN) 971–976. https://doi.org/10.1109/IJCNN.2016.7727303 (2016).
Yu, S. Neuro-Inspired computing with emerging nonvolatile memory. Proc. IEEE 106, 260–285. https://doi.org/10.1109/JPROC.2018.2790840 (2018).
Article CAS Google Scholar
Xia, Q. & Yang, J. J. Memristive crossbar arrays for brain-inspired computing. Nat. Mater. 18, 309–323. https://doi.org/10.1038/s41563-019-0291-x (2019).
Article ADS CAS PubMed Google Scholar
Cheng, M. et al. TIME: A training-in-memory architecture for memristor-based deep neural networks. In Proceedings of the 54th Annual Design Automation Conference 2017, DAC ’17. https://doi.org/10.1145/3061639.3062326 (Association for Computing Machinery, 2017).
Liehr, M., Hazra, J., Beckmann, K., Rafiq, S. & Cady, N. Impact of switching variability of 65 nm CMOS integrated hafnium dioxide-based ReRAM devices on distinct level operations. In 2020 IEEE International Integrated Reliability Workshop (IIRW) 1–4 (IEEE, 2020).
Park, J. et al. TiOx-based RRAM synapse with 64-levels of conductance and symmetric conductance change by adopting a hybrid pulse scheme for neuromorphic computing. IEEE Electron Device Lett. 37, 1559–1562. https://doi.org/10.1109/LED.2016.2622716 (2016).
Article ADS CAS Google Scholar
Nandakumar, S. R. et al. Mixed-precision deep learning based on computational memory. Front. Neurosci. 14. https://doi.org/10.3389/fnins.2020.00406 (2020).
Li, Y. et al. Mixed-precision continual learning based on computational resistance random access memory. Adv. Intell. Syst. 4, 2200026. https://doi.org/10.1002/aisy.202200026 (2022).
Article Google Scholar
D’Agostino, S. et al. Synaptic metaplasticity with multi-level memristive devices. In 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS) 1–5. https://doi.org/10.1109/AICAS57966.2023.10168563 (2023).
Lu, A. et al. High-speed emerging memories for AI hardware accelerators. Nat. Rev. Electr. Eng. 1, 24–34. https://doi.org/10.1038/s44287-023-00002-9 (2024).
Article ADS Google Scholar
Fusi, S., Drew, P. J. & Abbott, L. F. Cascade models of synaptically stored memories. Neuron 45, 599–611 (2005).
Article CAS PubMed Google Scholar
Zohora, F. T., Karia, V., Daram, A. R., Zyarah, A. M. & Kudithipudi, D. MetaplasticNet: Architecture with probabilistic metaplastic synapses for continual learning. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS) 1–5. https://doi.org/10.1109/ISCAS51556.2021.9401262 (2021).
Crestani, A. P. et al. Metaplasticity contributes to memory formation in the hippocampus. Neuropsychopharmacology 44, 408–414. https://doi.org/10.1038/s41386-018-0096-7 (2019).
Article CAS PubMed Google Scholar
Neftci, E. O., Augustine, C., Paul, S. & Detorakis, G. Event-driven random back-propagation: Enabling neuromorphic deep learning machines. Front. Neurosci. 11, 324 (2017).
Article PubMed PubMed Central Google Scholar
Payvand, M., Fouda, M. E., Kurdahi, F., Eltawil, A. M. & Neftci, E. O. On-chip error-triggered learning of multi-layer memristive spiking neural networks. IEEE J. Emerg. Sel. Top. Circuits Syst. 10, 522–535. https://doi.org/10.1109/JETCAS.2020.3040248 (2020).
Article ADS Google Scholar
Zyarah, A. M. & Kudithipudi, D. Semi-Trained memristive crossbar computing engine with in situ learning accelerator. J. Emerg. Technol. Comput. Syst. 14. https://doi.org/10.1145/3233987 (2018).
van de Ven, G. M., Tuytelaars, T. & Tolias, A. S. Three types of incremental learning. Nat. Mach. Intell. 4, 1185–1197. https://doi.org/10.1038/s42256-022-00568-3 (2022).
Article PubMed PubMed Central Google Scholar
Hsu, Y.-C., Liu, Y.-C., Ramasamy, A. & Kira, Z. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. In NeurIPS Continual learning Workshop (2018).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. https://doi.org/10.1109/5.726791 (1998).
Article Google Scholar
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M. & Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV) (2018).
Schug, S., Benzing, F. & Steger, A. Presynaptic stochasticity improves energy efficiency and helps alleviate the stability-plasticity dilemma. eLife 10, e69884. https://doi.org/10.7554/eLife.69884 (2021).
Zeno, C., Golan, I., Hoffer, E. & Soudry, D. Task agnostic continual learning using online variational bayes. arXiv preprint arXiv:1803.10123 (2018).
Li, Z. & Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2935–2947. https://doi.org/10.1109/TPAMI.2017.2773081 (2018).
Article CAS PubMed Google Scholar
Boybat, I. et al. Neuromorphic computing with multi-memristive synapses. Nat. Commun. 9, 1–12 (2018).
Article ADS CAS Google Scholar
Krizhevsky, A. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront (2009).
Daram, A. & Kudithipudi, D. NEO: Neuron state dependent mechanisms for efficient continual learning. In Proceedings of the 2023 Annual Neuro-Inspired Computational Elements Conference, NICE ’23 11–19. https://doi.org/10.1145/3584954.3584960 (Association for Computing Machinery, 2023).
Karia, V., Zohora, F. T., Soures, N. & Kudithipudi, D. Scolar: A spiking digital accelerator with dual fixed point for continual learning. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS) 1372–1376. https://doi.org/10.1109/ISCAS48785.2022.9937294 (2022).
Kim, S. & Lee, S. Continual learning with neuron activation importance. In Image Analysis and Processing – ICIAP 2022 (eds. Sclaroff, S., Distante, C., Leo, M., Farinella, G. M. & Tombari, F.) 310–321 (Springer International Publishing, 2022).
Soures, N. Lifelong learning in spiking neural networks through neural plasticity. Ph.D. thesis, Rochester Institute of Technology (2023).
Bartol, T. M. et al. Hippocampal spine head sizes are highly precise. bioRxiv https://doi.org/10.1101/016329. https://www.biorxiv.org/content/early/2015/03/11/016329.full.pdf (2015).
O’Connor, D. H., Wittenberg, G. M. & Wang, S.S.-H. Graded bidirectional synaptic plasticity is composed of switch-like unitary events. Proc. Natl. Acad. Sci. 102, 9679–9684. https://doi.org/10.1073/pnas.0502332102 (2005) https://www.pnas.org/doi/pdf/10.1073/pnas.0502332102.
Article ADS CAS PubMed PubMed Central Google Scholar
Misba, W. A., Lozano, M., Querlioz, D. & Atulasimha, J. Energy efficient learning with low resolution stochastic domain wall synapse for deep neural networks. IEEE Access 10, 84946–84959. https://doi.org/10.1109/ACCESS.2022.3196688 (2022).
Article Google Scholar
Zhang, Y., He, G., Tang, K.-T. & Wang, G. On-chip learning of multilayer perceptron based on memristors with limited multilevel states. In 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) 11–12. https://doi.org/10.1109/AICAS.2019.8771513 (2019).
Kaiser, J., Mostafa, H. & Neftci, E. Synaptic plasticity dynamics for deep continuous local learning (DECOLLE). Front. Neurosci. 14. https://doi.org/10.3389/fnins.2020.00424 (2020).
Gupta, S., Agrawal, A., Gopalakrishnan, K. & Narayanan, P. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37 of Proceedings of Machine Learning Research (eds. Bach, F. & Blei, D.) 1737–1746 (PMLR, 2015).
Davies, M. et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 82–99. https://doi.org/10.1109/MM.2018.112130359 (2018).
Article Google Scholar

Download references

Acknowledgements

This effort is partially supported by NSF EFRI BRAID Award #2317706, NSF NAIAD Award #2332744 and Air Force Research Laboratory under agreement number FA8750-20-2-1003 through BAA FA8750-19-S-7010. We thank Dr. Maximilian Liehr and Dr. Nathaniel Cady for helpful discussions on the RRAM device characteristics. We thank Dr. Abdullah Zyarah for helpful discussions and feedback. We thank Matrix AI Consortium for supporting with computing resources for this research. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF, Air Force Research Laboratory or the U.S. Government.

Author information

Authors and Affiliations

Neuromorphic Artificial Intelligence Lab, University of Texas at San Antonio, San Antonio, TX, 78249, USA
Fatima Tuz Zohora, Vedant Karia, Nicholas Soures & Dhireesha Kudithipudi

Authors

Fatima Tuz Zohora
View author publications
You can also search for this author inPubMed Google Scholar
Vedant Karia
View author publications
You can also search for this author inPubMed Google Scholar
Nicholas Soures
View author publications
You can also search for this author inPubMed Google Scholar
Dhireesha Kudithipudi
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

F.Z. conceived the idea, conducted the experiments, and wrote the manuscript. V.K. contributed to the energy estimation for the digital components and memory models. N.S. helped develop the experiments. D.K. oversaw the development and execution of the research and offered critical feedback. All authors reviewed the manuscript.

Corresponding author

Correspondence to Fatima Tuz Zohora.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zohora, F.T., Karia, V., Soures, N. et al. Probabilistic metaplasticity for continual learning with memristors in spiking networks. Sci Rep 14, 29496 (2024). https://doi.org/10.1038/s41598-024-78290-w

Download citation

Received: 16 April 2024
Accepted: 29 October 2024
Published: 27 November 2024
DOI: https://doi.org/10.1038/s41598-024-78290-w