Exynos 5433 64-bit Cortex A57/A53: Krait Killer?

What is Exynos?

“Exynos” is the family of mobile SoCs from Samsung; the CPU cores in the modern versions are Qualcomm’s own “Krait” (though some are standard ARM Cortex cores) while the (GP)GPU core is Qualcomm’s own “Adreno” – unlike competing ARM SoCs which generally contain standard ARM CPU and GPU designs.

There are various series, with series 800 (Prime) representing the top of the range, with lower numbered series (e.g. 600, 400, 200, etc.) representing lower performance. Within the same series higher numbers (e.g. 805, 801, 800, etc.) represent newer generation and generally better performance and more features.

The CPU cores are called “Krait” and are Qualcomm’s own design under ARM licence – they are not standard ARM Cortex cores. The latest 400 series shares many features to the Cortex A15 – though some features are similar to the older Cortex A8/A9.

In this article we test CPU core (Krait) performance; please see our other articles on:

Hardware Specifications

We are comparing the internal CPU cores of various modern SoCs in the latest phones and tablets.

SoC Specifications Samsung Exynos 5433 / Samsung Galaxy Note 4C Qualcomm Snapdragon 805 / Samsung Galaxy Note 4F Qualcomm Snapdragon 801 / Sony XPeria Z3 Qualcomm Snapdragon 600 / Samsung Galaxy S4 LTE Samsung Exynos 5420 / Samsung Note 10 – 2014 Edition Comments
CPU Arch / ARM Arch Cortex A57+A53 ARMv8-A Krait 450 (APQ8084) ARMv7-A Krait 400 (MSM8974-AC) ARMv7-A Krait 300 (MSM8960) ARMv7-A Cortex A15+A7 ARMv7-A While the Cortex A5x series are 64-bit, the OS of Note 4 runs in 32-bit mode, thus ARMv7 normal code. It is unclear whether there will ever be a 64-bit version for this phone.
Cores (CU) / Threads (SP) 4C + 4c / 8 threads simultaneously 4C / 4 threads 4C / 4 threads 4C / 4 threads 4C + 4c / 4 threads Except Exynos which is big.LITTLE and has 4 big and 4 little cores, all other CPUs are quad-core. However, the Exynos 5433 can actually run 8 threads at the same time vs. 4 threads for the other CPUs including the older 5420.
Speed (Min / Max / Turbo) (MHz) 400-1900 (400-1300 / 700-1900) 300-2650 300-2466 384-1890 250-1900 (500-1300 / 600-1900) We see Krait 400 pushing close to 3GHz while Cortex designs hover around 2GHz, thus relying on compute power
L0D / L0I Caches (kB) n/a 4x 4kB 4x 4kB 4x 4kB n/a All Kraits have very small L0 caches while Cortex is a more traditional design.
L1D / L1I Caches (kB) 2x 4x 32kB 4x 16kB 4x 16kB 4x 16kB 2x 4x 32kB Cortex has 2x larger L1 caches than Krait but supposedly a bit slower.
L2 Caches (MB) 2MB 2MB 2MB 2MB 2MB All designs have the same size L2 cache.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (Neon2, Neon, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Native Benchmarks Samsung Exynos 5433 / Cortex A57+A53 Qualcomm Snapdragon 805 / Krait 450 Qualcomm Snapdragon 801 / Krait 400 Qualcomm Snapdragon 600 / Krait 300 Samsung Exynos 5420 / Cortex A15+A7 Comments
Exynos 5433 CPU Arithmetic
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 17.07 17.24 [+1%] 14.7 10.3 14.5 Here both 5433 and 805 are neck-and-neck within 1% difference. Despite its much higher clock (+40%) the Krait 450 just keeps up with the latest Cortex A57.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 90 [+7%] 84 74 62 92 5433 flexes its FP muscles here, being 7% faster than 805 despite the much higher clock. While double-precision floating-point workloads are uncommon on mobile/tablets, its use is increasing as more complex apps are ported.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 162 [+3%] 157 136 108 73 With FP64 VFP code, the 5433 is only 3% faster.
Despite its very high clock (+40%), both CPUs are pretty much within 3-7% of each other. Naturally 5433 also supports ARMv8 64-bit but is forced to run in legacy ARMv7 mode.
Exynos 5433 CPU Multi-Media
CPU Multi-Media Vectorised SIMD Native Integer (Int32) Multi-Media (Mpix/s) 22.48 Neon [+45%] 15.5 Neon 12.4 Neon 10.7 Neon 15.1 Neon Krait never seemed to do very well with SIMD (Neon) code and here we see 5433 being 45% faster than 805, the largest we’ve seen so far. ARM has really improved SIMD performance in modern Cortex cores with A15 already handily beating Krait designs. Qualcomm needs to overhaul the SIMD units to remain competitive.
CPU Multi-Media Vectorised SIMD Native Long (Int64) Multi-Media (Mpix/s) 4.3 Neon [+67%] 2.57 Neon 2.19 Neon 1.86 Neon 2.47 Neon With 64-bit Neon workload we see 5433 pull ahead, 67% faster than the 805! For integer SIMD multi-media code, Cortex is the core to beat!
CPU Multi-Media Vectorised SIMD Native Quad-Int (Int128) Multi-Media (kpix/s) 932 [+27%] 730 681 520 577 With normal int64 code, the 5433 still leads but that lead falls to 27%. It woud naturally do better in 64-bit mode if it were running an 64-bit OS.
CPU Multi-Media Vectorised SIMD Native Float/FP32 Multi-Media (Mpix/s) 20.2 Neon [+25%] 16.2 Neon 14.13 Neon 10.45 Neon 13.57 Neon Switching to floating-point Neon SIMD code, the 5433 is still 25% faster over 805.
CPU Multi-Media Vectorised SIMD Native Double/FP64 Multi-Media (Mpix/s) 7.59 [+31%] 5.78 4.6 3.89 4.16 Switching to FP64 VFP code (Neon does support FP64 in ARMv8), 5433 is still 31% faster than 805.
CPU Multi-Media Vectorised SIMD Native Quad-Float/FP128 Multi-Media (kpix/s) 301 [=] 295 257 184 190 In this heavy algorithm using FP64 to mantissa extend FP128, we finally have the 5433 slowing down just matching the 805. Cortex’s power is realised with SIMD code.
With highly-optimised Neon SIMD code, the Cortex A57 that powers 5433 makes mince-meat out of the 805’s Krait being between 25-67% faster despite the much lower clock speed. Qualcomm really needs to improve those SIMD units or risk being badly left behind. Naturally if the 5433 were running in ARMv8 64-bit mode the difference would be much higher.
Exynos 5433 CPU Crypto
GPGPU Crypto Benchmark Crypto SHA2-512 (MB/s) 118 Neon [+2.26x%] 52 Neon 45 Neon 32 Neon 67 Neon Starting with this tough 64-bit Neon SIMD accelerated hashing algorithm, 5433 again flexes its SIMD muscles beating the 805 over 2x (2.26x faster). It shows just how much better modern Cortex cores are executing SIMD code.
GPGPU Crypto Benchmark Crypto AES-256 (MB/s) 136 147 [+8%] 130 90 146 In this non-SIMD workload, the 805 manages to be 8% faster – a surprising result. While the 5433 does support AES HWA that is only in ARMv8 mode.
GPGPU Crypto Benchmark Crypto SHA2-256 (MB/s) 332 Neon [+46%] 227 Neon 225 Neon 148 Neon 186 Neon Switching to a 32-bit Neon SIMD, 5433 is on top again beating the 805 by 46%. Again, Cortex A5x does support SHA HWA but only in ARMv8 mode.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 216 [+30%] 166 145 109 165 Less rounds do seem to make a bit of a difference with 5433 now winning by 30% over the 805.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 362 Neon [+14%] 315 Neon 250 Neon 213 Neon 297 Neon SHA1 is the “lightest” compute workload and here 5433 is only 14% faster.
Again in SIMD Neon code the 5433 shows its power, beating the 805 between 14-126% similar to what we saw in the Mandelbrot tests. Naturally 5433 also supports both AES and SHA HWA but only in ARMv8 mode which needs a 64-bit OS. Here x86 does better as all instruction sets are available in both x86 and x64 unlike ARM who conveniently seems to forget about the 32-bit world.
Exynos 5433 CPU Financial
CPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 11.79 [+42%] 8.29 5.36 5.4 6.12 As this algorithm does not use SIMD, the 5433 still manages to handily beat the 805 by 42%.
CPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 6.11 [+47%] 4.14 2.83 3.24 3.28 Switching over to FP64 code, the 5433 still manages to be 47% faster – the 805 just cannot get a break!
CPU Finance Benchmark Binomial float/FP32 (kOPT/s) 1.26 2.03 [+61%] 1.76 1.29 1.98 Binomial uses thread shared data thus stresses the cache & memory system; here finally we see the 805 pull ahead by 61%, a big win considering past results.
CPU Finance Benchmark Binomial double/FP64 (kOPT/s) 1.26 1.85 [+46%] 1.71 1.53 2.41 Switching to FP64 code the 805 still wins but by just 46%. It seems this is the kind of algorithm it prefers.
CPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 2.51 [+2x] 1.26 1.42 1.04 1.14 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; the fortunes are reversed again as 5433 is now 2x (twice) as fast as the 805.
CPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1.87 [+2.08x] 0.897 0.711 0.883 1.23 And finall FP64 code does not make any difference, again the 5433 is 2x as fast.
The financial tests generally favour the 5433 which is between 40-100% faster than the 805, except in the “tough” binomial test where the 805 is between 40-60% faster. Even in VFP code the Cortex A5X is the core to beat!
Exynos 5433 CPU Science
CPU Science Benchmark SGEMM (MFLOPS) float/FP32 3906 Neon [+9%] 3579 Neon 3644 Neon 2626 Neon 4889 Neon In this complex Neon SIMD workload we would expect the 5433 to lead, and it does but only by 9%. It seems again that memory accesses slow it down and some of the 8 threads may be starving.
CPU Science Benchmark DGEMM (MFLOPS) double/FP64 1454 [+2.05x] 707 797 547 531 Neon does not support FP64 thus all CPUs use VFP code; here 5433 shows its power being over 2x (twice) faster than the 805.
CPU Science Benchmark SFFT (GFLOPS) float/FP32 720 Neon 989 Neon [+37%] 919 Neon 620 Neon 708 Neon FFT also uses SIMD and thus Neon but stresses the memory sub-system more: as we saw in Binomial, the 805 in the lead by 37%.
CPU Science Benchmark DFFT (GFLOPS) double/FP64 457 586 [+28%] 550 401 399 With FP64 VFP code, the 805 still leads by 28%. It seems the memory sub-system of the 5433 lets it down.
CPU Science Benchmark SNBODY (GFLOPS) float/FP32 758 Neon [+80%] 420 Neon 331 Neon 342 Neon 465 Neon N-Body simulation is SIMD heavy but has many memory accesses to shared data, but read-only – allows the 5433 to win again by 80%. It seems read-only data is not a problem, but read/modify/write is.
CPU Science Benchmark DNBODY (GFLOPS) double/FP64 339 [+46%] 232 183 145 199 With FP64 VFP code see the 5433 still winning but by just 46%.
The results mirror what we saw in the Financial tests: whenever thread-shared memory is used that is read/modified/written – the 5433 slows down, no doubt the extra 4 threads don’t help matters and likely slow it down.
Exynos 5433 CPU Multi-Core
CPU Multi-Core Benchmark Inter-Core Bandwidth (MB/s) 3994 [+10%] (but ~500/core) 3599 (but ~899/core) 2950 (but ~737/core) 2133 (but ~533/core) 1349 (but ~337/core) One thing that Qualcomm does very well is memory performance, both CPU and GPU-wise. But here 5433 has 4 more cores which helps it muscle out its rival with 10% more bandwidth. But while it technically wins, the bandwidth per core is just ~500MB/s while 805 has ~899MB/s, almost 2x more bandwidth. We see how all these caches perform in the Snapdragon 805 Cache and Memory performance article.
CPU Multi-Core Benchmark Inter-Core Latency (ns) 287 121 [-57%] 118 162 128 Latency, however, is much higher – or in other words 805 is 57% faster. It will be interesting to see whether this is due to different core transfer (e.g. big-2-LITTLE) or even between the same type (big-2-big / LITTLE-2-LITTLE).

The 5433 with its modern Cortex A5X design as well as 8-theads walks all over the 805 despite being clocked much lower – especially in SIMD (Neon) tests it is up to 2x (twice) as fast. Only in algorithms that make extensive use of shared thread data and read/modify/write it – the 805 catches a break and is faster.

It will be interesting to see whether the extra 4 threads (aka little cores) just get in the way in these tests and put too much strain on the memory system; effectively it may be better to use just 4 threads (aka BIG cores). We will investigate this in a future article.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java which is what Android and its apps are running. While key compute code will naturally be native, the rest of the code will naturally run on the JVM.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

JVM Benchmarks Samsung Exynos 5433 / Cortex A57+A53 Qualcomm Snapdragon 805 / Krait 450 Qualcomm Snapdragon 801 / Krait 400 Qualcomm Snapdragon 600 / Krait 300 Samsung Exynos 5420 / Cortex A15+A7 Comments
Exynos 5433 Java Arithmetic
Java Arithmetic Java Dhrystone (GIPS) 13.4 [+56%] 8.55 3.85 4.44 3.03 Unlike native Dhrystone (where we saw a minor delta), the Java version is undoubtely faster on the 5433 by over 50% in a clear win. We expected the Krait to do better in integer workloads.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 88 [+76%] 50 37 30 22 For FP64 JVM code, the 5433 is now 76% faster. While there may not be many FP64 Java workloads, the performance is there if you need it.
Java Arithmetic Java Whetstone final/FP32 (GFLOPS) 106 [+2.2x] 48 45 41 28 Switching to single-precision floating-point code, 5433 is even faster – over 2x 805.
The 5433 advantage increases with every test, be it integer, floating-point – crushing the 805.
Exynos 5433 Java Vectorised
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 4668 [+48%] 3147 3803 2201 2675 While vectorised code would normally be native, there may be apps using normal Java code and here the 5433 is almost 50% faster than 805 which somehow ends up slower than its older 801 brother. We put this down to JVM/Android differences (5.0.1 vs. 5.0.2).
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 2803 [+46%] 1917 1141 1168 With 64-bit integer vectorised workload, we see a similar delta of 46% in 5433’s favour.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 6053 [+88%] 3224 2587 1806 Switching to single-precision (FP32) floating-point code, the delta increases again to 88% – 5433 is almost 2x as fast!
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 6003 [+86%] 3219 2626 1636 We see the same thing here, with the 5433 enjoying a 86% advantage.
Vectorised Java code perfoms similar to non-vectorised Java in the previous test.

While native code showed some surprises, here the 5433 is the undisputed champion – beating the 805 in all tests by a wide margin of 50% to over 100% (2x as fast). For pure Java apps the 5433 should feel a lot faster.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is not really a surprise that the latest ARMv8 64-bit 8-core Cortex A57+A53 (albeit running in 32-bit ARMv7 mode) in Exynos 5433 would dominate the ageing Krait 400-series core in Snapdragon 805 – but the latter’s 40% higher clock could have thrown a few “wobblies”.

Unlike earlier big.LITTLE designs, all 8-cores can be used simultaneously – but this may actually present a problem when using static work allocators as the “big” cores may wait for the “LITTLE” cores to finish – in effect having 8 little cores. We will be exploring the differences in performance when using just the “big” cores, just the “LITTLE” cores or all in a future article.

It is naturally a pity that the 5433 does not use a 64-bit Android version and thus benefit from all the ARMv8 improvements, not to mention new instruction sets like AES HWA, SHA HWA, FP64 Neon and so on. It seems that Samsung (like other vendors we may add) may never actually release a 64-bit OS/ROM for it – and thus the 5433 like other Cortex A5x SoCs are destined to run 32-bit for their whole life… Without 64-bit binary drivers there may not be a way for 3-rd party developers (modders?) to make a 64-bit OS either…

However, even under these circumstances the Note 4-powered Exynos is the most powerful Note (CPU-wise) – though the roles seem to be reversed when comparing the GPUs as we saw in the previous article Exynos 5433 (Mali) GPGPU performance. Thus the decision as to which Note 4 to choose is more difficult – do you want CPU or GPU power? As lots of compute tasks are moving to GPGPU (even on tablet/phones) – we would lean towards GPU prowess… Don’t forget to consider memory performance which we’re invesigating in the next article Exynos 5433 Cache and Memory performance.

Tagged , , , , . Bookmark the permalink.

Comments are closed.