UniSIMD assembler 1.0.0

UniSIMD provides a unified and low-level macro assembler for ARM and x86 architectures. It declares a subset of shared SIMD instructions and a common API to reduce code deduplication and variation. Currently Intel SSE2 (32-bit x86 ISA) and ARM NEON (32-bit ARMv7 ISA) are supported. 64-bit wide SIMD with longer registers and adressing will be added later. UniSIMD is a C/C++ macro collection, thus can be easily included from header files.

Tags c c++ assembler simd macro header-files
License MITL
State alpha

Recent Releases

1.0.022 Nov 2018 19:05 major feature: UniSIMD assembler, code name "ENsed", base for future SIMD enhancements Renewed directory structure, move BASE/SIMD header files to core/config. Add new fp-compatibility and feature tasks, rename TASKS file to ROADMAP. Add support for 30 SIMD register pairs (2x128) backend on POWER7/8. Add support for 30 SIMD registers (scalar+128+256) backend on Skylake-X. Drop standalone SSE2 target from x64, reuse SSE4 (v4) slot, add compat flag. Add support for 128-bit AVX1+FMA3 (v16) and AVX2+FMA3 (v32) targets for AMD. Compactify POWER7/8 targets into one slot, add new RT_SIMD_COMPAT_PW8 flag. Swap legacy PowerPC G4/POWER6 VMX (now v4) with POWER7/8 VSX1/2 (now v1). - 64-bit POWER6 now matches 64-bit Nehalem target (both v4), 15x128/8x256-bit. Add support for POWER9 backend (v2) with immediate vector loads/stores. Move 128-bit 30 SIMD registers Skylake-X target from v1 to v2, match POWER9. Reserve 128-bit v1 and 256-bit v4 for 30 SIMD registers emulation on AVX1/2. Implement plain ARM-SVE backend (v4) for 256/512/1K4/2K8-bit vector lengths. Implement paired ARM-SVE backend (v1) for 512/1K4/2K8-bit SIMD target slots. New scheme: RT_128=4+8, RT_256=1+2, RT_512=1+2, RT_1K4=1+2 are 15 registers. New scheme: RT_128=1+2, RT_256=4+8, RT_512=4+8, RT_1K4=4+8 are 30 registers. Add elm*x_st instruction to detach scalar subset from vectors (via mem). Add support for horizontal pairwise/reductive add/mul/min/max instructions. Patch system allocators to compile on macOS, widen OS support in makefiles. Clean up SIMD tests to support PIE (also macOS). Separate 64-bit Linux from multilib build scripts, add for macOS. Add VMX-compatible scalar SIMD subset on PPC G4 and POWER family of CPUs. Add MSA/scalar compatibility on big-endian MIPS, support for fp32 11-bit DP. Rename sections in target-specific headers to BASE, SIMD, ELEM (for scalar). Optimize long displacements for BASE, SIMD, ELEM on RISCs where applicable. Implement proper SIMD-scaling for displacement types (a
0.9.122 Mar 2017 14:25 major feature: Unified SIMD Assembler, 3-operand + basic scalar SIMD, extra backends Expose 128/256-bit SIMD subsets (cmd i/j/l *, cmd c/d/f *) simultaneously. Add 3-operand SIMD instructions to all targets, emulate where not present. Implement basic scalar SIMD support (arithmetic + compare-to-mask-elem). Implement additional paired/quaded 8-register SIMD backends on x86_64. Add 8-register makefile flags RT_256_R8, RT_512_R8, RT_1K4_R8, RT_2K8_R8. Original 15-register makefile flags RT_128, RT_256, RT_512 remain. Add new makefile flag RT_1K4 for 15-register code-bases on paired AVX-512. Expose 30 registers as an extension to common baseline of 15 where present. Each major architecture has at least one SIMD target with 30 registers. Add new RT_SIMD selector flag to remap vector-length-agnostic subsets. Add new RT_REGS selector flag to choose targets within given RT_SIMD width. Rename SIMD target headers to reflect size-factor/sub-variant, move legacy. Add new internal flags RT_128X*, RT_256X*, RT_512X to match SIMD headers. New internal flags keep SIMD sub-variant value in format for native width. Implement SIMD flags compatibility layer in rtzero to map makefile flags. Rtarch main header selects appropriate BASE/SIMD target from flags above. Implement SIMD target format converters in rtbase for runtime selection. Change SIMD target reporting to native-size x size-factor v version format. Reserve _RX slots in SIMD target mask for predicated backends (30+8 regs). Clean up (drop) legacy SSE(1) support from x32 headers/makefiles. - SIMD registers save/restore for 128-bit AVX targets (backported down). Buffer allocation in SIMD tests (for 64-bit elems). Allow external override for SIMD compatibility modes. Minor in rtarch, accelerate release builds on multi-core machines.
0.9.022 Nov 2016 07:45 major feature: Unified SIMD Assembler, 256-bit SIMD on RISCs, basic AVX-512 support Adjust root rt_SIMD_INFO struct to contain both 32-bit and 64-bit constants. Add new sign-mask and full-mask general purpose constants to rt_SIMD_INFO. Expose 32/64-bit SIMD-element-size subsets (cmdo*, cmdq*) simultaneously. Element size in existing cmdp subset remains configurable with RT_ELEMENT. All three SIMD subsets (cmdo*, cmdp*, cmdq*) are still SIMD-width-agnostic. Expose bit BASE subset cmdz for 64-bit targets only. Existing address-size cmdx*, element-size cmdy and 32-bit cmdw remain. Add BASE move instructions for 64-bit immediates as pairs of 32-bit types. Add new rotate-right and inverse-logic BASE instructions (ror, ann, orn). Add new BMI1/BMI2 implementations for existing BASE instructions on x86. Implement non-portable x87 ISA subset for x86 targets internally. Implement fused-multiply-accumulate (fma/fms) on all SIMD targets. Add new mask-move SIMD instructions to common SIMD ISA (was x86 only). Add new fp-negate and inverse-logic SIMD instructions (neg, orn, not). Add new variable SIMD shifts with per-element count to all targets. Implement 256-bit SIMD support (2x128-bit, 15 regs) on modern RISC targets. Implement 512-bit SIMD support (4x128-bit, 15 regs) on modern Power targets. Implement 512-bit SIMD support (1x512-bit, 16 regs) on future x86 targets. - AVX1/AVX2 256-bit SIMD for x86 (1x256-bit, 16 regs) remains supported. - 256-bit SIMD with 15 regs becomes new common baseline for modern hardware. Improve test coverage for BASE and SIMD load-op instructions. Add tests for new rotate, logic, shifts, fma/fms instructions, run level 24. Add rtzero header file to clean up assembler definitions after use. Rename instruction parameters to better reflect their use as source/dest. Add formulas for all BASE and SIMD instructions for better clarity. Reserve the whole alphabet for future BASE and SIMD instruction subsets.
0.8.104 Sep 2016 22:05 minor feature: Unified SIMD Assembler, full 64-bit fp/int SIMD compute elements Add element-sized BASE ISA subset to -32-bit and address-sized subsets. New instruction mnemonics introduced for element-sized BASE subset (cmdy*). Add new rtarch headers to house element-sized SIMD subset for 64-bit targets. Support for 64-bit SIMD elements currently requires 64-bit addresses as well. Enable full-precision SIMD rcpps/rsqps and rceps/rseps instructions. Add new offset corrections for endianness related to element-sized subset. Add new SIMD width short names for and element-sized SIMD fields. Add new custom-sized integer types (address, element) with printf mods. Make current adjustable fp types follow SIMD element size (RT_ELEMENT). Adjust math macros and definitions to support double-precision arithmetic. Non-setting-flags instructions to not interfere with cmp on MIPS, Power.
0.8.015 Aug 2016 00:05 major feature: Unified SIMD Assembler, full 64-bit addressing for BASE and SIMD Double original 32-bit BASE ISA to -32-bit and address-sized subsets. Original instruction mnemonics follow in-heap/code-segment address size. New instruction mnemonics introduced for -32-bit subset (cmdw*). Setting-flags instruction mnemonics remapped from (cmdz*) to (cmd*z). Add combined-arithmetic-jump wrapper for better API stability/efficiency. Add new rtarch headers to house address-sized subset for 64-bit targets. Move original (now address-sized) mappings to rtbase for 32-bit targets. Add canonical forms for BASE div/rem and shifts (not always efficient). Add setting-flags versions for BASE orr/xor and unsigned shifts. Remap one-operand instructions from cmd_rr/mm to rx/mx and xr/xm. Move stack instructions to their own section at the end of rtarch headers. Move sregs instructions to their own section at the end of rtarch headers. Add config flags for full-precision SIMD rcpps/rsqps instructions. Add master flags for SIMD compatibility modes to rtarch main header. Add new offset corrections for endianness. Add Win64 support via TDM64-GCC toolchain (tdm64-gcc-5.1.0-2.exe). Add NULL-ptr checks to custom allocators (Linux/mmap, Win64/VirtualAlloc). Setting-flags instructions for 64-bit Power running 32-bit ISA. Non-setting-flags instructions (neg*x) to not set flags on MIPS.
0.7.124 Jun 2016 19:45 minor feature: Unified SIMD Assembler, 64/32-bit hybrid mode for native 64-bit ABI Use -sized and adjustable integer types in rtbase and SIMD test. Add a64 (AArch64 native ABI) and x64 (x86_64 native ABI) targets/makefiles. Add m64 (MIPS64 native ABI) and p64 (Power64 native ABI) targets/makefiles. Most of the current ISA remains 32-bit for BASE and SIMD with few exceptions. Adjust backend structures to support 64-bit pointer types in select places. Move sys_alloc/sys_free to platform-specific sections in SIMD test. Implement custom allocators (mmap) to limit address range to 32-bit (Linux). Limit address range to 2GB boundary as MIPS64 sign-extends 32-bit mem-loads. Treat code labels as 64-bit in label_ld/st and jmpxx_mm instructions. Implement 64-bit versions of stack_sa/la instructions on MIPS and Power. Variable SIMD shifts to support little-endian on Power targets. - ASM blocks to only use SIMD registers within VRSAVE segment on Power. Remove ASM block's zeroing of r15 as unnecessary on x32/x64 targets. Reformat/rework ASM blocks to better respect internal register mapping. Explicitly save/load SIMD registers in ASM blocks across all targets. Drop ASM clobber lists for lack of consistency across targets/SIMD-widths. Clang's ASM block l-value errors and other warnings, official support. Add build instructions to makefiles for Ubuntu 16.04 LTS 64-bit Live CD.
0.714 Apr 2016 09:25 minor feature: Unified SIMD Assembler, additional 32-bit CPU architectures Add a32 (AArch64:ILP32 ABI) and x32 (x86_64:mx32 ABI) targets/makefiles. Add m32 (MIPS32r5/r6 + MSA) and p32 (Power + VMX/VSX) targets/makefiles. Add yet another SIMD variant (v4) for x86/SSE4.1 and ARMv8/AArch32. Separate ARMv7/ASIMDv2 (v2) and ARMv8/AArch32 (v4) SIMD variants on ARM. Add ARM builds for Raspberry Pi 2 and 3 in addition to Nokia N900. Use static linking in SIMD tests for QEMU emulation. Add mmv (blendvps) to x86/x32 SSE4.1 for fast conditional loads. Add combined-compare-jumps to rtarch for better efficiency (MIPS, Power). Remove limitation for BASE instructions to only accept DP offsets. Add new immediate/displacement types, add comment that they are unsigned. Add comments throughout rtarch about instructions' set-flags behavior. Implement full-range 32-bit integer divide on ARMv7 (v1) as 64-bit fp-div. Add widening versions of integer multiply instructions to rtarch definitions. Add remainder wrappers for integer divide instructions to rtarch definitions. Add IEEE-compatible versions of fp div sqr for ARMv7 and Power targets. Add "residual correction" to non-IEEE fp div on ARMv7 and Power targets. Add SIMD tests for fp-to-int round and int-div remainder, run level 18.
0.622 Nov 2015 12:25 minor feature: Unified SIMD Assembler, additional SIMD targets Add new SIMD targets for SSE1, AVX1, AVX2. Add shifts by core register instructions. Add float-to-integer convert with mode parameter (x86, ARM + A32). Add signed-integer divide for ARM's A32 mode. Remap SIMD registers for ARM to avoid collision with div VFP fallback. Additional SIMD tests, run level 15.
0.530 Sep 2014 19:18 major feature: - instructions naming scheme finalized - change ARM instructions to set flags - added framework for internal constants (used by reciprocals) - added SIMD instruction for cube root, reciprocal steps redesigned - additional SIMD tests, run level 14