Julia program performance tuning¶

2019.11.15 杜岳華¶

Introduction¶

To achieve performant program is not easy¶

Many core computer science fields are involved...¶

Hardware part: CPU architecture
Operating systems
Software part: compiler

We just introduce some basic and usual tricks to get performance.

Statistics - correlation coefficient¶

$$ \rho = \frac{\sum_{n}^{i=1} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{n}^{i=1} (X_i - \bar{X})^2}\sqrt{\sum_{n}^{i=1} (Y_i - \bar{Y})^2}} $$

Correlation¶

In [1]:

x = [2, 3, 4, 5, 6, 2, 3, 4, 5]
y = [32, 32, 5, 42, 6, 17, 19, 20, 24]

Out[1]:

9-element Array{Int64,1}:
 32
 32
  5
 42
  6
 17
 19
 20
 24

Correlation¶

$$ \rho = \frac{\sum_{n}^{i=1} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{n}^{i=1} (X_i - \bar{X})^2}\sqrt{\sum_{n}^{i=1} (Y_i - \bar{Y})^2}} $$

In [2]:

n = length(x)
x̄ = sum(x) / n
ȳ = sum(y) / n

Out[2]:

21.88888888888889

In [3]:

x̂ = x .- x̄
ŷ = y .- ȳ

Out[3]:

9-element Array{Float64,1}:
  10.11111111111111  
  10.11111111111111  
 -16.88888888888889  
  20.11111111111111  
 -15.88888888888889  
  -4.888888888888889 
  -2.8888888888888893
  -1.8888888888888893
   2.1111111111111107

Correlation¶

$$ \rho = \frac{\sum_{n}^{i=1} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{n}^{i=1} (X_i - \bar{X})^2}\sqrt{\sum_{n}^{i=1} (Y_i - \bar{Y})^2}} $$

In [4]:

ρ = sum(x̂ .* ŷ) / (sqrt(sum(x̂.^2))*sqrt(sum(ŷ.^2)))

Out[4]:

-0.20034374130204088

Abstraction with function¶

In [5]:

function correlation()
    n = length(x)
    @assert length(y) == n "Not matched sample size"
    x̄ = sum(x) / n
    ȳ = sum(y) / n
    x̂ = x .- x̄
    ŷ = y .- ȳ
    ρ = sum(x̂ .* ŷ) / (sqrt(sum(x̂.^2))*sqrt(sum(ŷ.^2)))
    return ρ
end

Out[5]:

correlation (generic function with 1 method)

Benchmark¶

In [6]:

using BenchmarkTools

In [7]:

correlation()

Out[7]:

-0.20034374130204088

In [8]:

@btime correlation()

  3.127 μs (28 allocations: 1.19 KiB)

Out[8]:

-0.20034374130204088

Benchmark¶

In [9]:

@benchmark correlation()

Out[9]:

BenchmarkTools.Trial: 
  memory estimate:  1.19 KiB
  allocs estimate:  28
  --------------
  minimum time:     3.155 μs (0.00% GC)
  median time:      3.286 μs (0.00% GC)
  mean time:        4.314 μs (18.39% GC)
  maximum time:     6.021 ms (99.88% GC)
  --------------
  samples:          10000
  evals/sample:     8

Avoid global variables¶

In [10]:

function correlation(x, y)
    n = length(x)
    @assert length(y) == n "Not matched sample size"
    x̄ = sum(x) / n
    ȳ = sum(y) / n
    x̂ = x .- x̄
    ŷ = y .- ȳ
    ρ = sum(x̂ .* ŷ) / (sqrt(sum(x̂.^2))*sqrt(sum(ŷ.^2)))
    return ρ
end

Out[10]:

correlation (generic function with 2 methods)

In [11]:

correlation(x, y)

Out[11]:

-0.20034374130204088

Avoid global variables¶

In [12]:

@btime correlation(x, y)

  244.085 ns (6 allocations: 816 bytes)

Out[12]:

-0.20034374130204088

In [13]:

@benchmark correlation(x, y)

Out[13]:

BenchmarkTools.Trial: 
  memory estimate:  816 bytes
  allocs estimate:  6
  --------------
  minimum time:     242.258 ns (0.00% GC)
  median time:      258.422 ns (0.00% GC)
  mean time:        309.224 ns (12.24% GC)
  maximum time:     113.919 μs (99.70% GC)
  --------------
  samples:          10000
  evals/sample:     434

Avoid copy things¶

Copy drive program to allocate new memory space¶

In [14]:

x[2:8]

Out[14]:

7-element Array{Int64,1}:
 3
 4
 5
 6
 2
 3
 4

In [15]:

@btime x[2:8]

  56.149 ns (1 allocation: 144 bytes)

Out[15]:

7-element Array{Int64,1}:
 3
 4
 5
 6
 2
 3
 4

Avoid copy things¶

In [16]:

x_ = 5x .+ 1

Out[16]:

9-element Array{Int64,1}:
 11
 16
 21
 26
 31
 11
 16
 21
 26

In [17]:

@btime x_ = 5x .+ 1

  382.456 ns (4 allocations: 368 bytes)

Out[17]:

9-element Array{Int64,1}:
 11
 16
 21
 26
 31
 11
 16
 21
 26

View, copy and deepcopy¶

In [18]:

@btime x[2:8];

  54.950 ns (1 allocation: 144 bytes)

In [19]:

@btime view(x, 2:8);

  26.640 ns (1 allocation: 48 bytes)

View, copy and deepcopy¶

In [20]:

largeX = rand(100_000)
@btime largeX[10:10_010];  # get 10000 elements

  5.209 μs (2 allocations: 78.27 KiB)

In [21]:

@btime view(largeX, 10:10_010);

  25.960 ns (1 allocation: 48 bytes)

View, copy and deepcopy¶

In [22]:

X1 = [[1, 2, 3], [4, 5]]

Out[22]:

2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5]

In [23]:

X2 = copy(X1)  # shallow copy

Out[23]:

2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5]

In [24]:

X2[1] === X1[1]

Out[24]:

true

View, copy and deepcopy¶

In [25]:

X3 = deepcopy(X1)

Out[25]:

2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5]

In [26]:

X3[1] === X1[1]

Out[26]:

false

View, copy and deepcopy¶

In [27]:

@btime X2 = copy(X1);

  44.176 ns (1 allocation: 96 bytes)

In [28]:

@btime X3 = deepcopy(X1);

  474.327 ns (5 allocations: 672 bytes)

Avoid copy things¶

In [29]:

function correlation(x, y)
    n = length(x)
    @assert length(y) == n "Not matched sample size"
    x̄ = sum(x) / n
    ȳ = sum(y) / n
    x̂ŷ = (x .- x̄).*(y .- ȳ)
    x̂ = (x .- x̄).^2
    ŷ = (y .- ȳ).^2
    ρ = sum(x̂ŷ) / (sqrt(sum(x̂))*sqrt(sum(ŷ)))
    return ρ
end

Out[29]:

correlation (generic function with 2 methods)

In [30]:

correlation(x, y)
@benchmark correlation(x, y)

Out[30]:

BenchmarkTools.Trial: 
  memory estimate:  496 bytes
  allocs estimate:  4
  --------------
  minimum time:     178.035 ns (0.00% GC)
  median time:      184.145 ns (0.00% GC)
  mean time:        225.122 ns (11.98% GC)
  maximum time:     77.747 μs (99.59% GC)
  --------------
  samples:          10000
  evals/sample:     690

Memory pre-allocation¶

In [31]:

function prealloc(X)
    Y = similar(X, 10, 10)
    # Y = Array{SOMETYPE}(undef, 10, 10)
    # Y = fill(π, (2, 3, 4))
    
    # do something
    Y
end

Out[31]:

prealloc (generic function with 1 method)

Memory storage order of arrays is column-major in Julia¶

In [32]:

function row_major(X)
    s = 0.
    for i = 1:size(X, 1)
        for j = 1:size(X, 2)
            s += X[i, j]
        end
    end
    s
end

Out[32]:

row_major (generic function with 1 method)

In [33]:

X = rand(10_000, 10_000)
row_major(X)
@btime row_major(X)

  1.822 s (1 allocation: 16 bytes)

Out[33]:

4.9999165264804e7

Memory storage order of arrays is column-major in Julia¶

In [34]:

function col_major(X)
    s = 0.
    for j = 1:size(X, 2)
        for i = 1:size(X, 1)
            s += X[i, j]
        end
    end
    s
end

Out[34]:

col_major (generic function with 1 method)

In [35]:

col_major(X)
@btime col_major(X)

  99.203 ms (1 allocation: 16 bytes)

Out[35]:

4.999916526480132e7

Type stability¶

In [36]:

function type_instable(x)
    if x > 1
        return 5
    else
        return 5.0
    end
end

Out[36]:

type_instable (generic function with 1 method)

In [37]:

@code_warntype type_instable(4)

Variables
  #self#::Core.Compiler.Const(type_instable, false)
  x::Int64

Body::Union{Float64, Int64}
1 ─ %1 = (x > 1)::Bool
└──      goto #3 if not %1
2 ─      return 5
3 ─      return 5.0

Union splitting¶

ret1 = function1(args...)
ret2 = function2(ret1, ...)
ret3 = function3(ret1, ret2, ...)

Returns value of `Any` or `Union{...}`¶

Union splitting¶

ret1 = function1(args...)    # ret1 isa Union{A,B}
if ret1 isa A
    ret2 = function2_specialized_for_A(ret1, ...)
    ret3 = function3_specialized_for_A(ret1, ret2, ...)
    ...
else
    ret2 = function2_specialized_for_B(ret1, ...)
    ret3 = function3_specialized_for_B(ret1, ret2, ...)
    ...
end

Ref: Union splitting

Type stability¶

In [38]:

@code_warntype correlation(x, y)

Variables
  #self#::Core.Compiler.Const(correlation, false)
  x::Array{Int64,1}
  y::Array{Int64,1}
  n::Int64
  x̄::Float64
  ȳ::Float64
  x̂ŷ::Array{Float64,1}
  x̂::Array{Float64,1}
  ŷ::Array{Float64,1}
  ρ::Float64

Body::Float64
1 ─       Core.NewvarNode(:(x̄))
│         Core.NewvarNode(:(ȳ))
│         Core.NewvarNode(:(x̂ŷ))
│         Core.NewvarNode(:(x̂))
│         Core.NewvarNode(:(ŷ))
│         Core.NewvarNode(:(ρ))
│         (n = Main.length(x))
│   %8  = Main.length(y)::Int64
│   %9  = (%8 == n)::Bool
└──       goto #3 if not %9
2 ─       goto #4
3 ─ %12 = Base.AssertionError("Not matched sample size")::AssertionError
└──       Base.throw(%12)
4 ┄ %14 = Main.sum(x)::Int64
│         (x̄ = %14 / n)
│   %16 = Main.sum(y)::Int64
│         (ȳ = %16 / n)
│   %18 = Base.broadcasted(Main.:-, x, x̄)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}
│   %19 = Base.broadcasted(Main.:-, y, ȳ)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}
│   %20 = Base.broadcasted(Main.:*, %18, %19)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}}}
│         (x̂ŷ = Base.materialize(%20))
│   %22 = Base.broadcasted(Main.:-, x, x̄)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}
│   %23 = Core.apply_type(Base.Val, 2)::Core.Compiler.Const(Val{2}, false)
│   %24 = (%23)()::Core.Compiler.Const(Val{2}(), false)
│   %25 = Base.broadcasted(Base.literal_pow, Main.:^, %22, %24)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(Base.literal_pow),Tuple{Base.RefValue{typeof(^)},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}},Base.RefValue{Val{2}}}}
│         (x̂ = Base.materialize(%25))
│   %27 = Base.broadcasted(Main.:-, y, ȳ)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}}
│   %28 = Core.apply_type(Base.Val, 2)::Core.Compiler.Const(Val{2}, false)
│   %29 = (%28)()::Core.Compiler.Const(Val{2}(), false)
│   %30 = Base.broadcasted(Base.literal_pow, Main.:^, %27, %29)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(Base.literal_pow),Tuple{Base.RefValue{typeof(^)},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(-),Tuple{Array{Int64,1},Float64}},Base.RefValue{Val{2}}}}
│         (ŷ = Base.materialize(%30))
│   %32 = Main.sum(x̂ŷ)::Float64
│   %33 = Main.sum(x̂)::Float64
│   %34 = Main.sqrt(%33)::Float64
│   %35 = Main.sum(ŷ)::Float64
│   %36 = Main.sqrt(%35)::Float64
│   %37 = (%34 * %36)::Float64
│         (ρ = %32 / %37)
└──       return ρ

Achieving type stable¶

There are some convenient functions to get type information.

In [39]:

X = rand(3, 4)
eltype(X)

Out[39]:

Float64

From parametric methods...

In [40]:

function foo(A::Array{T}) where {T}
    return zero(T)
end

Out[40]:

foo (generic function with 1 method)

In [41]:

foo(rand(3,4))

Out[41]:

0.0

Achieving type stable¶

There are some ways to generate objects with known type information.

In [42]:

zero(Float64)

Out[42]:

0.0

In [43]:

one(Int64)

Out[43]:

In [44]:

zero(Bool)

Out[44]:

false

Boundary check¶

In [45]:

@btime col_major(X)

  33.016 ns (1 allocation: 16 bytes)

Out[45]:

6.069340471885451

In [46]:

function col_major2(X)
    s = 0.
    for j = 1:size(X, 2)
        @inbounds for i = 1:size(X, 1)
            s += X[i, j]
        end
    end
    s
end

Out[46]:

col_major2 (generic function with 1 method)

In [47]:

col_major2(X)
@btime col_major2(X)

  27.140 ns (1 allocation: 16 bytes)

Out[47]:

6.069340471885451

Parallelism and concurrency¶

CPU-bound jobs¶

Threading
Multiprocess

I/O-bound jobs¶

Threading
Multitasking

Design principles¶

Write API (functions) as generic as possible.¶

Design types as specific and dynamic as possible.¶

If you are not sure how specific a type will be, let it be general.¶

Write API (functions) as generic as possible¶

In [48]:

function matrix_multiply(X, Y)  # not "this" generic
    n = size(X, 2)
    @assert size(Y, 1) == n "Dimension dismatched!"
    Z = similar(X, size(X, 1), size(Y, 2))
    for i = 1:size(Z, 1), j = 1:size(Z, 2)
        Z[i, j] = X[i, :] * Y[:, j]
    end
    return Z
end

Out[48]:

matrix_multiply (generic function with 1 method)

In [49]:

function matrix_multiply(X::Array, Y::Array)  # good? generic?
    n = size(X, 2)
    @assert size(Y, 1) == n "Dimension dismatched!"
    Z = similar(X, size(X, 1), size(Y, 2))
    for i = 1:size(Z, 1), j = 1:size(Z, 2)
        Z[i, j] = X[i, :] * Y[:, j]
    end
    return Z
end

Out[49]:

matrix_multiply (generic function with 2 methods)

Write API (functions) as generic as possible¶

In [50]:

function matrix_multiply(X::AbstractArray{T,2}, Y::AbstractArray{T,2}) where {T}  # generic
    n = size(X, 2)
    @assert size(Y, 1) == n "Dimension dismatched!"
    Z = similar(X, size(X, 1), size(Y, 2))
    for i = 1:size(Z, 1), j = 1:size(Z, 2)
        Z[i, j] = X[i, :] * Y[:, j]
    end
    return Z
end

Out[50]:

matrix_multiply (generic function with 3 methods)

Design types as "specific" and dynamic as possible¶

In [51]:

struct Point
    x
    y
end

In [ ]:

struct Point
    x::Number
    y::Number
end

Design types as "specific" and dynamic as possible¶

In [53]:

x = Point(2, 3.4)

Out[53]:

Point(2, 3.4)

In [54]:

typeof(x)

Out[54]:

Point

In [55]:

typeof(x).uid

Out[55]:

In [56]:

y = Point(2., 3.4)

Out[56]:

Point(2.0, 3.4)

In [57]:

typeof(x).uid

Out[57]:

Design types as "specific" and dynamic as possible¶

In [1]:

struct Point{T<:Number}
    x::T
    y::T
end

In [2]:

x = Point(2, 3)

Out[2]:

Point{Int64}(2, 3)

In [3]:

y = Point(2., 3.)

Out[3]:

Point{Float64}(2.0, 3.0)

Design types as "specific" and dynamic as possible¶

In [4]:

typeof(x)

Out[4]:

Point{Int64}

In [5]:

typeof(x).uid

Out[5]:

In [6]:

typeof(y)

Out[6]:

Point{Float64}

In [7]:

typeof(y).uid

Julia program performance tuning¶

2019.11.15 杜岳華¶

Introduction¶

To achieve performant program is not easy¶

Many core computer science fields are involved...¶

Statistics - correlation coefficient¶

Correlation¶

Correlation¶

Correlation¶

Abstraction with function¶

Benchmark¶

Benchmark¶

Avoid global variables¶

Avoid global variables¶

Avoid copy things¶

Copy drive program to allocate new memory space¶

Avoid copy things¶

View, copy and deepcopy¶

View, copy and deepcopy¶

View, copy and deepcopy¶

View, copy and deepcopy¶

View, copy and deepcopy¶

Avoid copy things¶

Memory pre-allocation¶

Memory storage order of arrays is column-major in Julia¶

Memory storage order of arrays is column-major in Julia¶

Type stability¶

Union splitting¶

Returns value of Any or Union{...}¶

Union splitting¶

Type stability¶

Achieving type stable¶

Achieving type stable¶

Boundary check¶

Parallelism and concurrency¶

CPU-bound jobs¶

I/O-bound jobs¶

Design principles¶

Write API (functions) as generic as possible.¶

Design types as specific and dynamic as possible.¶

If you are not sure how specific a type will be, let it be general.¶

Write API (functions) as generic as possible¶

Write API (functions) as generic as possible¶

Design types as "specific" and dynamic as possible¶

Design types as "specific" and dynamic as possible¶

Design types as "specific" and dynamic as possible¶

Design types as "specific" and dynamic as possible¶

Advanced in hardware part¶

Memory hierarchy¶

Register/cache volume is limited.¶

Cache miss¶

Page fault¶

Advanced in software part¶

Overview of a compiler¶

Compile time¶

Runtime¶

Compiler in details¶

Julia compiler¶

Julia compiler¶

Julia runtime¶

Julia JIT compiler¶

Thank you for attention¶

References:¶

Returns value of `Any` or `Union{...}`¶