Julia program performance tuning

2019.11.15 杜岳華

Introduction

To achieve performant program is not easy

Many core computer science fields are involved...

  • Hardware part: CPU architecture
  • Operating systems
  • Software part: compiler

We just introduce some basic and usual tricks to get performance.

Statistics - correlation coefficient

$$ \rho = \frac{\sum_{n}^{i=1} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{n}^{i=1} (X_i - \bar{X})^2}\sqrt{\sum_{n}^{i=1} (Y_i - \bar{Y})^2}} $$

Correlation

In [1]:
x = [2, 3, 4, 5, 6, 2, 3, 4, 5]
y = [32, 32, 5, 42, 6, 17, 19, 20, 24]
Out[1]:
9-element Array{Int64,1}:
 32
 32
  5
 42
  6
 17
 19
 20
 24

Correlation

$$ \rho = \frac{\sum_{n}^{i=1} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{n}^{i=1} (X_i - \bar{X})^2}\sqrt{\sum_{n}^{i=1} (Y_i - \bar{Y})^2}} $$
In [2]:
n = length(x)
 = sum(x) / n
 = sum(y) / n
Out[2]:
21.88888888888889
In [3]:
 = x .- 
 = y .- 
Out[3]:
9-element Array{Float64,1}:
  10.11111111111111  
  10.11111111111111  
 -16.88888888888889  
  20.11111111111111  
 -15.88888888888889  
  -4.888888888888889 
  -2.8888888888888893
  -1.8888888888888893
   2.1111111111111107

Correlation

$$ \rho = \frac{\sum_{n}^{i=1} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{n}^{i=1} (X_i - \bar{X})^2}\sqrt{\sum_{n}^{i=1} (Y_i - \bar{Y})^2}} $$
In [4]:
ρ = sum( .* ) / (sqrt(sum(.^2))*sqrt(sum(.^2)))
Out[4]:
-0.20034374130204088

Abstraction with function

In [5]:
function correlation()
    n = length(x)
    @assert length(y) == n "Not matched sample size"
     = sum(x) / n
     = sum(y) / n
     = x .- 
     = y .- 
    ρ = sum( .* ) / (sqrt(sum(.^2))*sqrt(sum(.^2)))
    return ρ
end
Out[5]:
correlation (generic function with 1 method)

Benchmark

In [6]:
using BenchmarkTools
In [7]:
correlation()
Out[7]:
-0.20034374130204088
In [8]:
@btime correlation()
  3.127 μs (28 allocations: 1.19 KiB)
Out[8]:
-0.20034374130204088

Benchmark

In [9]:
@benchmark correlation()
Out[9]:
BenchmarkTools.Trial: 
  memory estimate:  1.19 KiB
  allocs estimate:  28
  --------------
  minimum time:     3.155 μs (0.00% GC)
  median time:      3.286 μs (0.00% GC)
  mean time:        4.314 μs (18.39% GC)
  maximum time:     6.021 ms (99.88% GC)
  --------------
  samples:          10000
  evals/sample:     8

Avoid global variables

In [10]:
function correlation(x, y)
    n = length(x)
    @assert length(y) == n "Not matched sample size"
     = sum(x) / n
     = sum(y) / n
     = x .- 
     = y .- 
    ρ = sum( .* ) / (sqrt(sum(.^2))*sqrt(sum(.^2)))
    return ρ
end
Out[10]:
correlation (generic function with 2 methods)
In [11]:
correlation(x, y)
Out[11]:
-0.20034374130204088

Avoid global variables

In [12]:
@btime correlation(x, y)
  244.085 ns (6 allocations: 816 bytes)
Out[12]:
-0.20034374130204088
In [13]:
@benchmark correlation(x, y)
Out[13]:
BenchmarkTools.Trial: 
  memory estimate:  816 bytes
  allocs estimate:  6
  --------------
  minimum time:     242.258 ns (0.00% GC)
  median time:      258.422 ns (0.00% GC)
  mean time:        309.224 ns (12.24% GC)
  maximum time:     113.919 μs (99.70% GC)
  --------------
  samples:          10000
  evals/sample:     434

Avoid copy things

Copy drive program to allocate new memory space

In [14]:
x[2:8]
Out[14]:
7-element Array{Int64,1}:
 3
 4
 5
 6
 2
 3
 4
In [15]:
@btime x[2:8]
  56.149 ns (1 allocation: 144 bytes)
Out[15]:
7-element Array{Int64,1}:
 3
 4
 5
 6
 2
 3
 4

Avoid copy things

In [16]:
x_ = 5x .+ 1
Out[16]:
9-element Array{Int64,1}:
 11
 16
 21
 26
 31
 11
 16
 21
 26
In [17]:
@btime x_ = 5x .+ 1
  382.456 ns (4 allocations: 368 bytes)
Out[17]:
9-element Array{Int64,1}:
 11
 16
 21
 26
 31
 11
 16
 21
 26

View, copy and deepcopy

In [18]:
@btime x[2:8];
  54.950 ns (1 allocation: 144 bytes)
In [19]:
@btime view(x, 2:8);
  26.640 ns (1 allocation: 48 bytes)

View, copy and deepcopy

In [20]:
largeX = rand(100_000)
@btime largeX[10:10_010];  # get 10000 elements
  5.209 μs (2 allocations: 78.27 KiB)
In [21]:
@btime view(largeX, 10:10_010);
  25.960 ns (1 allocation: 48 bytes)

View, copy and deepcopy

In [22]:
X1 = [[1, 2, 3], [4, 5]]
Out[22]:
2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5]   
In [23]:
X2 = copy(X1)  # shallow copy
Out[23]:
2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5]   
In [24]:
X2[1] === X1[1]
Out[24]:
true

View, copy and deepcopy

In [25]:
X3 = deepcopy(X1)
Out[25]:
2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5]   
In [26]:
X3[1] === X1[1]
Out[26]:
false

View, copy and deepcopy

In [27]:
@btime X2 = copy(X1);
  44.176 ns (1 allocation: 96 bytes)
In [28]:
@btime X3 = deepcopy(X1);
  474.327 ns (5 allocations: 672 bytes)

Avoid copy things

In [29]:
function correlation(x, y)
    n = length(x)
    @assert length(y) == n "Not matched sample size"
     = sum(x) / n
     = sum(y) / n
    x̂ŷ = (x .- ).*(y .- )
     = (x .- ).^2
     = (y .- ).^2
    ρ = sum(x̂ŷ) / (sqrt(sum())*sqrt(sum()))
    return ρ
end
Out[29]:
correlation (generic function with 2 methods)
In [30]:
correlation(x, y)
@benchmark correlation(x, y)
Out[30]:
BenchmarkTools.Trial: 
  memory estimate:  496 bytes
  allocs estimate:  4
  --------------
  minimum time:     178.035 ns (0.00% GC)
  median time:      184.145 ns (0.00% GC)
  mean time:        225.122 ns (11.98% GC)
  maximum time:     77.747 μs (99.59% GC)
  --------------
  samples:          10000
  evals/sample:     690

Memory pre-allocation

In [31]:
function prealloc(X)
    Y = similar(X, 10, 10)
    # Y = Array{SOMETYPE}(undef, 10, 10)
    # Y = fill(π, (2, 3, 4))
    
    # do something
    Y
end
Out[31]:
prealloc (generic function with 1 method)

Memory storage order of arrays is column-major in Julia

In [32]:
function row_major(X)
    s = 0.
    for i = 1:size(X, 1)
        for j = 1:size(X, 2)
            s += X[i, j]
        end
    end
    s
end
Out[32]:
row_major (generic function with 1 method)
In [33]:
X = rand(10_000, 10_000)
row_major(X)
@btime row_major(X)
  1.822 s (1 allocation: 16 bytes)
Out[33]:
4.9999165264804e7

Memory storage order of arrays is column-major in Julia

In [34]:
function col_major(X)
    s = 0.
    for j = 1:size(X, 2)
        for i = 1:size(X, 1)
            s += X[i, j]
        end
    end
    s
end
Out[34]:
col_major (generic function with 1 method)
In [35]:
col_major(X)
@btime col_major(X)
  99.203 ms (1 allocation: 16 bytes)
Out[35]:
4.999916526480132e7

Type stability

In [36]:
function type_instable(x)
    if x > 1
        return 5
    else
        return 5.0
    end
end
Out[36]:
type_instable (generic function with 1 method)
In [37]:
@code_warntype type_instable(4)
Variables
  #self#::Core.Compiler.Const(type_instable, false)
  x::Int64

Body::Union{Float64, Int64}
1 ─ %1 = (x > 1)::Bool
└──      goto #3 if not %1
2 ─      return 5
3 ─      return 5.0

Union splitting

ret1 = function1(args...)
ret2 = function2(ret1, ...)
ret3 = function3(ret1, ret2, ...)

Returns value of Any or Union{...}

Union splitting

ret1 = function1(args...)    # ret1 isa Union{A,B}
if ret1 isa A
    ret2 = function2_specialized_for_A(ret1, ...)
    ret3 = function3_specialized_for_A(ret1, ret2, ...)
    ...
else
    ret2 = function2_specialized_for_B(ret1, ...)
    ret3 = function3_specialized_for_B(ret1, ret2, ...)
    ...
end

Type stability

In [38]:
@code_warntype correlation(x, y)
Variables
  #self#::Core.Compiler.Const(correlation, false)
  x::Array{Int64,1}
  y::Array{Int64,1}
  n::Int64::Float64
  ȳ::Float64
  x̂ŷ::Array{Float64,1}::Array{Float64,1}
  ŷ::Array{Float64,1}
  ρ::Float64

Body::Float64
1 ─       Core.NewvarNode(:(x̄))
       Core.NewvarNode(:(ȳ))
       Core.NewvarNode(:(x̂ŷ))
       Core.NewvarNode(:(x̂))
       Core.NewvarNode(:(ŷ))
       Core.NewvarNode(:(ρ))
       (n = Main.length(x))
 %8  = Main.length(y)::Int64
 %9  = (%8 == n)::Bool
└──       goto #3 if not %9
2 ─       goto #4
3 ─ %12 = Base.AssertionError("Not matched sample size")::AssertionError
└──       Base.throw(%12)
4 ┄ %14 = Main.sum(x)::Int64
       (x̄ = %14 / n)
 %16 = Main.sum(y)::Int64
       (ȳ = %16 / n)
 %18 = Base.broadcasted(Main.:-, x, x̄)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}
 %19 = Base.broadcasted(Main.:-, y, ȳ)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}
 %20 = Base.broadcasted(Main.:*, %18, %19)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}}}
       (x̂ŷ = Base.materialize(%20))
 %22 = Base.broadcasted(Main.:-, x, x̄)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}
 %23 = Core.apply_type(Base.Val, 2)::Core.Compiler.Const(Val{2}, false)
 %24 = (%23)()::Core.Compiler.Const(Val{2}(), false)
 %25 = Base.broadcasted(Base.literal_pow, Main.:^, %22, %24)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(Base.literal_pow),Tuple{Base.RefValue{typeof(^)},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}},Base.RefValue{Val{2}}}}
       (x̂ = Base.materialize(%25))
 %27 = Base.broadcasted(Main.:-, y, ȳ)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}
 %28 = Core.apply_type(Base.Val, 2)::Core.Compiler.Const(Val{2}, false)
 %29 = (%28)()::Core.Compiler.Const(Val{2}(), false)
 %30 = Base.broadcasted(Base.literal_pow, Main.:^, %27, %29)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(Base.literal_pow),Tuple{Base.RefValue{typeof(^)},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}},Base.RefValue{Val{2}}}}
       (ŷ = Base.materialize(%30))
 %32 = Main.sum(x̂ŷ)::Float64
 %33 = Main.sum(x̂)::Float64
 %34 = Main.sqrt(%33)::Float64
 %35 = Main.sum(ŷ)::Float64
 %36 = Main.sqrt(%35)::Float64
 %37 = (%34 * %36)::Float64
       (ρ = %32 / %37)
└──       return ρ

Achieving type stable

There are some convenient functions to get type information.

In [39]:
X = rand(3, 4)
eltype(X)
Out[39]:
Float64

From parametric methods...

In [40]:
function foo(A::Array{T}) where {T}
    return zero(T)
end
Out[40]:
foo (generic function with 1 method)
In [41]:
foo(rand(3,4))
Out[41]:
0.0

Achieving type stable

There are some ways to generate objects with known type information.

In [42]:
zero(Float64)
Out[42]:
0.0
In [43]:
one(Int64)
Out[43]:
1
In [44]:
zero(Bool)
Out[44]:
false

Boundary check

In [45]:
@btime col_major(X)
  33.016 ns (1 allocation: 16 bytes)
Out[45]:
6.069340471885451
In [46]:
function col_major2(X)
    s = 0.
    for j = 1:size(X, 2)
        @inbounds for i = 1:size(X, 1)
            s += X[i, j]
        end
    end
    s
end
Out[46]:
col_major2 (generic function with 1 method)
In [47]:
col_major2(X)
@btime col_major2(X)
  27.140 ns (1 allocation: 16 bytes)
Out[47]:
6.069340471885451

Parallelism and concurrency

CPU-bound jobs

  • Threading
  • Multiprocess

I/O-bound jobs

  • Threading
  • Multitasking

Design principles

Write API (functions) as generic as possible.

Design types as specific and dynamic as possible.

If you are not sure how specific a type will be, let it be general.

Write API (functions) as generic as possible

In [48]:
function matrix_multiply(X, Y)  # not "this" generic
    n = size(X, 2)
    @assert size(Y, 1) == n "Dimension dismatched!"
    Z = similar(X, size(X, 1), size(Y, 2))
    for i = 1:size(Z, 1), j = 1:size(Z, 2)
        Z[i, j] = X[i, :] * Y[:, j]
    end
    return Z
end
Out[48]:
matrix_multiply (generic function with 1 method)
In [49]:
function matrix_multiply(X::Array, Y::Array)  # good? generic?
    n = size(X, 2)
    @assert size(Y, 1) == n "Dimension dismatched!"
    Z = similar(X, size(X, 1), size(Y, 2))
    for i = 1:size(Z, 1), j = 1:size(Z, 2)
        Z[i, j] = X[i, :] * Y[:, j]
    end
    return Z
end
Out[49]:
matrix_multiply (generic function with 2 methods)

Write API (functions) as generic as possible

In [50]:
function matrix_multiply(X::AbstractArray{T,2}, Y::AbstractArray{T,2}) where {T}  # generic
    n = size(X, 2)
    @assert size(Y, 1) == n "Dimension dismatched!"
    Z = similar(X, size(X, 1), size(Y, 2))
    for i = 1:size(Z, 1), j = 1:size(Z, 2)
        Z[i, j] = X[i, :] * Y[:, j]
    end
    return Z
end
Out[50]:
matrix_multiply (generic function with 3 methods)

Design types as "specific" and dynamic as possible

In [51]:
struct Point
    x
    y
end
In [ ]:
struct Point
    x::Number
    y::Number
end

Design types as "specific" and dynamic as possible

In [53]:
x = Point(2, 3.4)
Out[53]:
Point(2, 3.4)
In [54]:
typeof(x)
Out[54]:
Point
In [55]:
typeof(x).uid
Out[55]:
50365
In [56]:
y = Point(2., 3.4)
Out[56]:
Point(2.0, 3.4)
In [57]:
typeof(x).uid
Out[57]:
50365

Design types as "specific" and dynamic as possible

In [1]:
struct Point{T<:Number}
    x::T
    y::T
end
In [2]:
x = Point(2, 3)
Out[2]:
Point{Int64}(2, 3)
In [3]:
y = Point(2., 3.)
Out[3]:
Point{Float64}(2.0, 3.0)

Design types as "specific" and dynamic as possible

In [4]:
typeof(x)
Out[4]:
Point{Int64}
In [5]:
typeof(x).uid
Out[5]:
47142
In [6]:
typeof(y)
Out[6]:
Point{Float64}
In [7]:
typeof(y).uid
Out[7]:
47289

Advanced in hardware part

Memory hierarchy

Register/cache volume is limited.

Cache miss

  • L1 cache: ~32K
  • L2 cache: 256K
  • L3 cache: 3072K

Advanced in software part

Overview of a compiler

Compile time

Runtime

Compiler in details

Julia compiler

Julia compiler

Julia runtime

Julia JIT compiler

Thank you for attention