清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

科技2yrs ago (2022)update 芯东西
216 0
网站公众号快速收录

本周四凌晨,国际学术顶刊Science公布了清华大学神经拟态芯片(又名“类脑芯片”)的最新研究成果。这一研究由清华大学精密仪器系教授、类脑计算研究中心主任施路平率队,论文作者均来自清华精密仪器系、光盘国家工程研究中心、北京未来芯片技术高精尖创新中心、类脑计算研究中心。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠
清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

2019年8月,施路平教授团队研发的全球首款异构融合类脑计算芯片“天机(Tianjic)芯”曾登上国际学术顶刊Nature封面,并展示该芯片如何驱动一辆自动行驶自行车实现自动控制平衡、识别语音指令、探测前方行人以及自动避障。

这一成就,被当时Nature总编斯基珀博士赞誉为“人工智能领域的重要里程碑”,并成为2019年科学界的年度热点研究之一。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

而这一次,在以往工作的基础上,其团队又研发了一款名为TianjicX28nm神经拟态芯片。

TianjicX的峰值动态能效为3.2TOPS/W,片上存储带宽为5.12tb/s,单位面积算力高达0.2TOPS/mm2,支持对每个任务进行计算资源的自适应分配和执行时间的调度。

研究团队打造了一个搭载该芯片的多智能任务移动机器人Tianjicat(天机猫),并设计让它作为猫这个角色,来参与猫捉老鼠的游戏。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

实验结果显示,NVIDIA Jetson TX2相比,在TianjicX上跑多个网络的延迟大幅减少了约98.74%,动态功率降低了50.66%

论文作者认为,TianjicX为移动智能机器人计算硬件的研发开辟了一条新的道路,使其能在低延迟、低功耗的情况下本地执行密集和复杂的任务,并支持多个跨计算范式神经网络模型以各种协调方式在机器人中并行执行。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

01.

为移动智能机器人设计硬件

需满足三个核心要求

移动机器人的长期目标,是在处理复杂和未知环境时,能够达到接近人类水平的智能。

近年来,随着人工智能(AI)快速发展,各种神经网络算法被广泛应用于机器人。神经网络算法通常是计算密集型的,为了在移动机器人上高效地实现多种神经网络,计算机硬件的创新与高效计算尤为必要,有3点核心要求

1、必须在低延迟情况下,支持多个神经模型的本地并发执行,这是提高实时处理能力的关键。

2、必须赋予部署多个模型的灵活性,以在处理动态场景时实现低延迟、高效率和高并发性之间的平衡。

3、必须支持异步执行和灵活的交互,以获得较高的硬件利用率,并在开放环境中实现适应性。

然而,现有的计算硬件解决方案在满足这些需求方面,面临着不同的困难。底层架构或执行模型的固有瓶颈,使得现有的计算硬件,无法在本地实现低延迟和高效率的多个密集算法。

通用处理器通常无法提供大规模并行计算,导致在机器人系统中执行神经网络时成本低、功耗高。图形处理器(GPU)具有很强的可编程性和高并行性,但频繁的片外内存访问和与CPU的交互,致使其存在功耗高、利用率低的问题。

近年来,许多性能大幅提升的AI计算硬件涌现于世。基于现场可编程门阵列(FPGA)和面向特定应用设计的深度学习加速器可通过定制架构优化提供更高效的加速。

这些加速器基于传统的冯·诺依曼体系架构,但随着神经网络多样性和多任务的显著增加,其硬件利用率和调度灵活性会面临巨大的挑战,很难同时在机器人中实现多种不同的高性能算法。

相比之下,采用非冯·诺伊曼架构的神经拟态芯片,是同时执行多个神经网络模型的更优选择。

当前神经拟态芯片通常采用预配置内核的方式,通过空间切片对神经网络进行流水线处理。每个核心在不同的执行周期中重复执行预配置操作,导致资源分配不灵活,致使资源利用不足。

因此,清华大学研究人员研发了一种神经拟态芯片TianjicX。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

TianjicX芯片

该芯片在任务执行和协作过程中具有时空弹性(spatiotemporal elasticity),即硬件对任务的计算资源和执行时间具有自适应分配能力。

02.

三个关键层次拆解:

架构、芯片、机器人系统

TianjicX可以实现跨计算范式的神经网络模型的真正并发执行,包括神经网络、脉冲神经网络(SNN)以及两者的混合,应用于多智能任务机器人(MIT)

为多智能任务机器人设计计算硬件,面临两个关键挑战:一是满足延迟并发能力(LCP)的性能要求,尤其是针对不同的神经网络实现;二是在支持任务间交互的同时,保持每个任务的独立执行不受干扰。

为了克服这些挑战,研究人员从架构、芯片和模型部署等不同层次进行了一系列设计。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

多智能任务机器人的神经拟态计算平台

多智能任务机器人神经拟态计算平台的关键设计包括三个层次:1)Rivulet执行模型;2)带有特殊编译器的TianjicX芯片;3)基于TianjicX的机器人系统。

1、架构:设计Rivulet执行模型

研究人员首先开发了Rivulet执行模型,通过可配置的原语序列和同步-异步混合执行机制来解决效率、弹性和适应性之间的关键冲突,以弥合机器人需求和具体的硬件实现之间的差距。

Rivulet模型抽象了神经网络的基本执行活动,将神经网络和SNN统一为“静态数据”和“动态数据”,为资源分配和任务调度提供了一个具体的可操作和可描述的实体。

在此基础上,研究团队构建了时间和空间切片相结合的资源模型来管理多rivulet,并通过虚拟分组,以一种同步-异步混合分组的方式设计了rivulet的执行,能够支持多个独立或相互作用的rivulet。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

Rivulet执行模型的说明

这一模型,奠定了TianjicX芯片的架构基础。

2、芯片:采用28nm制程、160FCore核心

为了有效地实现Rivulet模型,计算硬件需支持高效的多rivulet执行、核心过程控制、虚拟分组,并支持通信、调制和rivulet之间的相互调度。

对此,研究人员研发了一种基于28nm互补金属氧化物半导体(CMOS)的神经拟态芯片TianjicX

该芯片集成了160个可配置的跨计算范式核心、大规模并行计算单元和丰富片上存储,采用非冯·诺依曼式高度并行的多核分散架构,满足了Rivulet操作需求。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

TianjicX硬件架构

为了支持高效的跨范式计算和灵活的可编程性和调度能力,研究人员从生物神经元汲取灵感,设计了统一功能核心(FCore)的微架构。在核心中,一个平衡计算和调度的专用控制器在本地管理每个核心,从而提高通用性和效率。

研究人员还设计了一个具有多精度计算的统一原语指令集,以支持ANN(人工神经网络)、SNN和交叉建模的高可编程性和多功能性,并将原语进一步划分为不同的硬件模块,以确保硬件资源的最大化共享。

此外,通过事件驱动设计和核心的多级分组,不同的神经网络可根据环境的动态变化异步执行,并进行全局交互。研究团队亦进一步开发了采用时空映射方法的编译器栈进行模型部署,充分利用了TianjicX的灵活性,能根据不同场景的实际需要,灵活配置多个任务。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

编译器栈包含一个转换器、映射器和代码生成器

3、机器人系统:可实时完成多项智能任务

TianjicX芯片能提高机器人在复杂和动态环境中处理多用途和多智能任务的能力。

具体来说,弹性资源分配可以提高硬件利用率,满足机器人不同的性能要求。独立的执行上下文使机器人能够并发和异步地执行多个任务;支持任务之间的交互,可确保机器人的多个模块(即多模态感知和多运动模块的协调)顺利协作。

研究团队制作了一个配备有TianjicX芯片和多模态传感器的移动机器人Tianjicat(天机猫),并设计了由Tianjicat扮演猫的猫捉老鼠游戏。

通过实现不同的神经网络和SNN模型,该机器人可以实时完成声音识别、声源定位、目标检测和识别、避障和决策等多项任务。

03.

详解芯片特性,在猫鼠游戏中

能效超NVIDIA TX2

TianjicX芯片采用联电28nm高性能紧凑型(HPC+)CMOS工艺、FBGA-225封装。该芯片的物理布局如下图所示。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

TianjicX芯片的芯片布局、FCore核心区及FCore核心区分区情况

该芯片包括160个FCore和1个用于芯片间通信的高速序列化/反序列化(SerDes)接口。控制器只占用FCore大约1%的面积,但显著提高了任务执行和交互的灵活性和效率。核心内存模块由5个静态随机访问内存(SRAM)块组成,总容量为144千字节。通过高位宽并行读写接入接口,整个芯片在400MHz时钟频率下内存接入带宽可达5.12tb/s。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

TianjicX芯片的关键特性与性能

研究人员通过实验评估了TianjicX芯片的性能,重点关注功耗、延迟和吞吐量。

两种主流的神经网络模型MobileNet和ResNet50使用不同的映射策略实现。下图展示了TianjicX芯片的处理速度与功耗的关系。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

执行MobileNet和ResNet50的性能与功耗评估,并展示了GPU、CPU和神经网络加速器的性能表现

为了进行比较,研究人员还绘制了不同类型GPU、CPU和深度学习加速器的性能对比。

可以看到,TianjicX在中等功耗的情况下为MobileNet实现了高速的图像处理,可见其在边缘应用中具有竞争力。此外,TianjicX还能在低功耗的情况下处理ResNet50。

其实验结果展现了TianjicX支持大规模人工神经网络、SNN和混合脉冲/非脉冲模型的潜力。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

具有代表性的SNN和混合模型的性能和功耗

通过优化映射策略,TianjicX的性能还可以得到进一步提升。研究结果表明,TianjicX具有多尺度、跨范式处理单个神经网络模型的能力,这是支持神经网络模型多任务处理的关键。

为了展示TianjicX在多任务智能机器人应用方面的能力,移动机器人开发平台Tianjicat构建于一辆改装的移动汽车的基础上,并配备了TianjicX芯片阵列和多模态传感器,如下图所示。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

Tianjicat机器人、开发板、TianjicX芯片阵列实物图

开发板由4×1阵列的4个TianjicX芯片组成,可以根据需要控制使用一个或多个芯片。值得注意的是,这些芯片可以单独激活。在接下来的实验中,只有一个TianjicX芯片被激活,用于实现各种神经网络,其他3个芯片没被激活。

研究人员进一步在复杂的动态环境中设计了一个具有挑战性的猫捉老鼠游戏。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

Tianjicat五种主要状态的机器人场景示意图

Tianjicat扮演一只猫的角色,试图抓住一只随机奔跑的电子老鼠。各种障碍被随机地、动态地放置在不同的位置。

这只“机器猫”需要通过视觉识别、声音跟踪或两者结合的方式来追踪老鼠,然后在不与障碍物碰撞的情况下向老鼠移动,最终追上它。

在此过程中,多种神经网络算法被用于实现实时场景下的语音识别、声源定位、目标检测、避障和决策。 因此,这些神经网络算法的协同处理和并发处理是关键。

这项工作里,轻量级检测卷积神经网络(CNN)被用于端到端多目标检测,SNN充当声音处理神经网络的事件驱动开关,CNN-GRU(门控循环单元)混合网络被用于估计声源位置,基于SNN的神经状态机(NSM)被用于进行多网络调度和策略决策。

SNN可以通过固有的神经元动力学记忆时间信息,并将信息编码为二元脉冲序列,其基于阈值的机制自然也类似于决定是否要激活声音定位网络的开关。因此,研究人员选择SNN模型作为任务中的切换神经网络。

SNN和GRU使用相同的基于CNN的特征(称为audio-CNN)提取器进行语音预处理。该算法以异步和并行的方式部署在单一TianjicX芯片上,根据不同网络对计算性能的要求,采用混合时空映射的方法对一个TianjicX芯片有限的硬件资源进行优化分配。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

多个任务的资源占用情况:这些网络都映射在一个TianjicX芯片上,实现灵活的资源共享

整个系统占用128个FCore,这些FCore被分配到4个Step group中,以事件驱动的方式执行。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

在TianjicX上的外部事件驱动和异步并行多任务执行图

特别是,audio-CNN提取的特征被GRU和SNN共享。因此,通过对rivulet的空间和时间切片,这两个网络可以重用同一簇FCore(共4个)。与没有进行优化的两个独立FCore集群相比,存储消耗减少了8.2%,而处理时间却没有增加。

由于具有即时原语机制,任务间的协作和数据传输无需通过其他硬件进行调度。NVIDIA Jetson TX2相比,在TianjicX上多个网络的计算响应延迟减少了约98.74%

与GPU中多个神经网络串行执行相比,其整个机器人系统能灵活地以事件驱动的方式运作,具有高度并行性。

每个神经网络只在更新外部传感器的相应输入时执行,因此功耗较低。此外,具有高效异步并行执行和交互特性的多任务处理,对单个任务对其输入的响应的影响可以忽略不计。

各神经网络模型的延迟、功耗和功率效率如下图所示。

清华类脑芯片再登国际顶刊 让机器人玩猫捉老鼠

同时处理5个神经网络时,这款猫鼠游戏使用了一个TianjicX芯片内的128个核心,总动态功耗约为0.6W。与TX2相比,TianjicX的动态功率降低了50.66%

这些结果表明,面向在多任务智能机器人应用,TianjicX能在兼顾低延迟和高能效的情况下实现较高的实时性能。

04.

结语:未来将探索神经拟态硬件

与机器人计算的更多可能

移动机器人的快速发展,为基于其独特需求设计替代计算硬件带来了机遇。

神经拟态架构不仅可以用于提高智能水平,还可以为替代计算架构设计方法提供思路,包括以分散分布的方式进行资源配置、采用事件驱动的执行和调度、通过类神经网络活动实现近似计算、采用专门的硬件架构实现通用系统等等。

基于这些思想,这篇论文的作者设计了Rivulet执行模型和TianjicX芯片的实现,并在设计TianjicX时,就任务执行、资源分配和任务协作方面进行了多种权衡,以实现较高的时空弹性。

相较传统的神经拟态芯片,TianjicX能充分利用智能算法的数据局域性,提高内存利用率,支持多种数据移动模式,增强可编程性。

本篇论文的作者写道:“未来,我们将继续研究神经拟态硬件与机器人计算的结合,探索更多无人机器人的可能性。”

In August 2019, the Tianjic core, the world’s first heterogeneous fusion brain computing chip developed by Professor Shi Luping’s team, appeared on the cover of the top international academic journal Nature and showed how the chip drives a self-driving bicycle to achieve automatic control balance, recognize voice commands, detect pedestrians ahead and automatically avoid obstacles.

This achievement was praised as “an important milestone in the field of artificial intelligence” by Dr. Skipper, editor-in-chief of Nature at that time, and became one of the hot topics of the year in 2019.

This time, on the basis of previous work, his team has developed a model calledTianjicXOf28nmNeuromimicry chip.

TianjicXThe peak dynamic energy efficiency is3.2TOPS/WThe on-chip storage bandwidth is5.12tb/s,单位面积算力高达0.2TOPS/mm2,支持对每个任务进行计算资源Of自适应分配和执行时间Of调度.

The research team built a multi-intelligent task mobile robot equipped with the chip.Tianjicat(celestial robot cat)And designed it to play the role of cat to participate in the cat-and-mouse game.

The experimental results show thatVs.NVIDIA Jetson TX2In comparison withTianjicXThe latency of running multiple networks has been greatly reduced by about98.74%The dynamic power is reduced.50.66%.

The author believes that TianjicX opens a new way for the research and development of computing hardware of mobile intelligent robot, which enables it to perform intensive and complex tasks locally under the condition of low delay and low power consumption, and supports multiple neural network models across computing paradigms to execute in parallel in the robot in various coordinated ways.

01.

Design hardware for mobile intelligent robot

Three core requirements need to be met

The long-term goal of mobile robot is to achieve intelligence close to human level when dealing with complex and unknown environments.

In recent years, with the rapid development of artificial intelligence (AI), a variety of neural network algorithms have been widely used in robots. Neural network algorithms are usually computationally intensive. In order to efficiently implement a variety of neural networks on mobile robots, computer hardware innovation and efficient computing are particularly necessary.3Point core requirements:

1. The local concurrent execution of multiple neural models must be supported in the case of low latency, which is the key to improve the real-time processing ability.

2. Flexibility must be given to deploying multiple models to achieve a balance between low latency, high efficiency, and high concurrency when dealing with dynamic scenarios.

3. Asynchronous execution and flexible interaction must be supported to achieve high hardware utilization and adaptability in an open environment.

However, existing computing hardware solutions face different difficulties in meeting these requirements.The inherent bottleneck of the underlying architecture or execution model makes it impossible for the existing computing hardware to implement multiple dense algorithms with low latency and high efficiency locally.

General-purpose processors are usually unable to provide large-scale parallel computing, resulting in low cost and high power consumption when executing neural networks in robot systems. Graphics processing unit (GPU) has strong programmability and high parallelism, but frequent off-chip memory access and interaction with CPU result in high power consumption and low utilization.

In recent years, many AI computing hardware with greatly improved performance have emerged in the world. Deep learning accelerators based on field programmable gate array (FPGA) and application-specific designs provide more efficient acceleration through customized architecture optimization.

These accelerators are based on the traditional von Neumann architecture, but with the significant increase of neural network diversity and multitasking, their hardware utilization and scheduling flexibility will face great challenges. it is difficult to implement many different high-performance algorithms in robots at the same time.

In contrast, the neural mimicry chip with non-von Neumann architecture is a better choice to execute multiple neural network models at the same time.

At present, neural mimicry chips usually use the way of pre-configured kernel to pipeline the neural network through spatial slicing. Each core repeatedly performs preconfiguration operations in different execution cycles, resulting in inflexible resource allocation and underutilization of resources.

Therefore, researchers at Tsinghua University have developed a neuromimicry chip TianjicX.

TianjicX chip

The chip has spatio-temporal elasticity (spatiotemporal elasticity) in the process of task execution and cooperation, that is, the hardware has the ability to adaptively allocate computing resources and execution time of tasks.

02.

Three key levels of disassembly:

Architecture, chip, robot system

TianjicX can realize the real concurrent execution of neural network models across computing paradigms, including neural networks, impulsive neural networks (SNN) and a mixture of the two.Multi-intelligent task robot (MIT).

为多智能任务机器人设计计算硬件,面临两个关键挑战:一是满足延迟并发能力(LCP)Of性能要求,尤其是针对不同Of神经网络实现;二是在支持任务间交互Of同时,保持每个任务Of独立执行不受干扰.

In order to overcome these challenges, researchers have carried out a series of designs from different levels such as architecture, chip and model deployment.

Neural mimicry Computing platform for Multi-Intelligent Task Robot

The key design of neural mimicry computing platform for multi-intelligent task robot includes three levels: 1) Rivulet execution model; 2) TianjicX chip with special compiler; 3) TianjicX-based robot system.

1Architecture: designRivulet执行模型

The researchers first developedRivulet执行模型The key conflicts among efficiency, flexibility and adaptability are solved through configurable primitive sequences and synchronous-asynchronous hybrid execution mechanism, so as to bridge the gap between robot requirements and specific hardware implementation.

Rivulet model abstracts the basic execution activities of neural network, unifies neural network and SNN into “static data” and “dynamic data”, and provides a concrete operable and descriptive entity for resource allocation and task scheduling.

On this basis, the research team constructed a resource model with the combination of time and space slices to manage multi-rivulet, and designed the execution of rivulet in a synchronous-asynchronous mixed packet way through virtual grouping, which can support multiple independent or interactive rivulet.

Description of the Rivulet execution model

This model lays the foundation for the architecture of TianjicX chip.

2, chip: using28nm制程、160FCoreCore

In order to effectively implement the Rivulet model, computing hardware needs to support efficient multi-rivulet execution, core process control, virtual packets, and support communication, modulation and mutual scheduling between rivulet.

In response, the researchers developed a 28nm-based complementary metal oxide semiconductor (CMOS).Neural mimicry chipTianjicX.

The chip integrates 160configurable cross-computing paradigm cores, massively parallel computing units and rich on-chip memory, and adopts a non-von Neumann highly parallel multi-core decentralized architecture to meet the needs of Rivulet operation.

TianjicX hardware architecture

In order to support efficient cross-paradigm computing and flexible programmability and scheduling capabilities, researchers draw inspiration from biological neurons and design a unified functional core (FCore) microarchitecture. In the core, a dedicated controller that balances computing and scheduling manages each core locally, thus improving versatility and efficiency.

The researchers also designed a unified primitive instruction set with multi-precision computing to support the high programmability and versatility of ANN (artificial Neural Network), SNN and cross modeling, and further divided the primitive into different hardware modules to ensure the maximum sharing of hardware resources.

In addition, through event-driven design and core multi-level grouping, different neural networks can be executed asynchronously according to the dynamic changes of the environment and interact globally. The research team also further developed a compiler stack using spatio-temporal mapping method to deploy the model, which makes full use of the flexibility of TianjicX and can flexibly configure multiple tasks according to the actual needs of different scenarios.

The compiler stack contains a converter, a mapper, and a code generator

3Robot system: can complete multiple intelligent tasks in real time

TianjicX chip能提高机器人在复杂和动态环境中处理多用途和多智能任务Of能力.

Specifically, flexible resource allocation can improve the hardware utilization and meet the different performance requirements of the robot. The independent execution context enables the robot to perform multiple tasks concurrently and asynchronously, and supports the interaction between tasks to ensure the smooth cooperation of multiple modules of the robot (that is, the coordination of multi-modal perception and multi-motion modules).

The research team made a device equipped with a TianjicX chip and a multimodal sensorMobile robot TianjicatAnd designed a cat-and-mouse game in which Tianjicat plays the cat.

By implementing different neural networks and SNN models, the robot can complete many tasks such as voice recognition, sound source location, target detection and recognition, obstacle avoidance and decision-making in real time.

03.

Detailed explanation of chip features, in cat-and-mouse games

Energy efficiency exceeds NVIDIA TX2

TianjicX chip采用联电28nm高性能紧凑型(HPC+)CMOS工艺、FBGA-225封装.该芯片Of物理布局如下图所示.

Chip layout, FCore Core and FCore Core Partition of TianjicX Chip

The chip consists of 160FCore and a high-speed serialization / deserialization (SerDes) interface for inter-chip communication. The controller occupies only about 1% of the area of FCore, but significantly improves the flexibility and efficiency of task execution and interaction. The core memory module consists of five static random access memory (SRAM) blocks with a total capacity of 144kbytes. Through the high-bit-width parallel read-write access interface, the memory access bandwidth of the whole chip can reach 5.12tb/s under the 400MHz clock frequency.

Key characteristics and performance of TianjicX Chip

The researchers evaluated the performance of the TianjicX chip through experiments, focusing on power consumption, latency and throughput.

两种主流Of神经网络模型MobileNet和ResNet50使用不同Of映射策略实现.下图展示了TianjicX chipOf处理速度Vs.功耗Of关系.

执行MobileNet和ResNet50Of性能Vs.功耗评估,并展示了GPU、CPU和神经网络加速器Of性能表现

为了进行比较,研究人员还绘制了不同类型GPU、CPU和深度学习加速器Of性能对比.

It can be seen that TianjicX achieves high-speed image processing for MobileNet at medium power consumption, which shows that it is competitive in edge applications. In addition, TianjicX can handle ResNet50 at low power consumption.

其实验结果展现了TianjicX支持大规模人工神经网络、SNN和混合脉冲/非脉冲模型Of潜力.

Performance and power consumption of representative SNN and hybrid models

By optimizing the mapping strategy, the performance of TianjicX can be further improved. The research results show that TianjicX has the ability to deal with a single neural network model with multi-scale and cross-paradigm, which is the key to support the multi-task processing of neural network model.

In order to demonstrate the ability of TianjicX in multi-tasking intelligent robot applications, the mobile robot development platform Tianjicat is based on a modified mobile vehicle and is equipped with a TianjicX chip array and multimodal sensors, as shown in the following figure.

Physical diagram of Tianjicat robot, development board and TianjicX chip array

The development board consists of four TianjicX chips of a 4 × 1 array, and one or more chips can be controlled as needed. It is worth noting that these chips can be activated separately. In the next experiment, only one TianjicX chip was activated to implement various neural networks, while the other three chips were not activated.

The researchers further designed a challenging cat-and-mouse game in a complex dynamic environment.

Schematic diagram of robot scene in five main states of Tianjicat

Tianjicat plays the role of a cat, trying to catch an electronic mouse running at random. Various obstacles are randomly and dynamically placed in different locations.

The “robot cat” needs to track the mouse through visual recognition, sound tracking or a combination of both, and then move to the mouse without colliding with an obstacle and eventually catch up with it.

In this process, a variety of neural network algorithms are used to realize speech recognition, sound source location, target detection, obstacle avoidance and decision-making in real-time scenes. Therefore, the cooperative processing and concurrent processing of these neural network algorithms are the key.

In this work, lightweight detection convolution neural network (CNN) is used for end-to-end multi-target detection, SNN acts as the event-driven switch of sound processing neural network, CNN-GRU (gated cycle unit) hybrid network is used to estimate sound source location, and SNN-based neural state machine (NSM) is used for multi-network scheduling and policy decision.

SNN can remember time information through inherent neuronal dynamics and encode the information into binary pulse sequences, and its threshold-based mechanism is naturally similar to the switch that determines whether to activate the sound location network. Therefore, the researchers choose the SNN model as the switching neural network in the task.

SNN and GRU use the same CNN-based feature extractor (called audio-CNN) for speech preprocessing. The algorithm is deployed asynchronously and in parallel on a single TianjicX chip. According to the computing performance requirements of different networks, the mixed space-time mapping method is used to optimize the limited hardware resources of a TianjicX chip.

多个任务Of资源占用情况:这些网络都映射在一个TianjicX chip上,实现灵活Of资源共享

整个系统占用128个FCore,这些FCore被分配到4个Step group中,以事件驱动Of方式执行.

External event-driven and asynchronous parallel multitasking execution diagrams on TianjicX

In particular, the features extracted by audio-CNN are shared by GRU and SNN. Therefore, by slicing the space and time of the rivulet, the two networks can reuse the same cluster of FCore (a total of 4).And two independent ones without optimization.FCoreCompared with clusters, storage consumption is reduced.8.2%,而处理时间却没有增加.

由于具有即时原语机制,任务间Of协作和数据传输无需通过其他硬件进行调度.Vs.NVIDIA Jetson TX2In comparison withTianjicXThe computational response delay of multiple networks on the network is reduced by about98.74%.

Compared with the serial execution of multiple neural networks in GPU, the whole robot system can operate flexibly in an event-driven way and has a high degree of parallelism.

Each neural network is executed only when updating the corresponding input of the external sensor, so the power consumption is low. In addition, multitasking with efficient asynchronous parallel execution and interaction can have a negligible effect on the response of a single task to its input.

The delay, power consumption and power efficiency of each neural network model are shown in the following figure.

Simultaneous processing5个神经网络时,这款猫鼠游戏使用了一个TianjicXOn-chip128The total dynamic power consumption is about0.6W. Vs.TX2Compare toTianjicXThe dynamic power is reduced.50.66%.

These results show that TianjicX can achieve high real-time performance while taking into account both low latency and high energy efficiency for multi-tasking intelligent robot applications.

04.

Conclusion: neural mimicry hardware will be explored in the future.

More likely to be calculated with robots.

The rapid development of mobile robot brings an opportunity to design alternative computing hardware based on its unique requirements.

Neural mimicry architecture can not only be used to improve the level of intelligence, but also provide ideas for alternative computing architecture design methods. it includes resource allocation in a decentralized way, event-driven execution and scheduling, approximate computing through neural network-like activities, general-purpose systems using special hardware architecture, and so on.

Based on these ideas, the author of this paper designs the Rivulet execution model and the implementation of TianjicX chip, and makes a variety of tradeoffs in task execution, resource allocation and task cooperation when designing TianjicX, in order to achieve higher spatio-temporal flexibility.

相较传统OfNeural mimicry chip,TianjicX能充分利用智能算法Of数据局域性,提高内存利用率,支持多种数据移动模式,增强可编程性.

The author of this paper wrote: “in the future, we will continue to study the combination of neural mimicry hardware and robot computing to explore the possibility of more unmanned robots.”

© Copyright notes

Related posts

网站公众号快速收录