Data Science in Julia

Data manipulation

by Yueh-Hua Tu

Outline

  • Missings
  • DataFrames
  • Query

DataFrames

In [1]:
using Statistics
using DataFrames, CSV, Query
using RDatasets
In [2]:
using Pkg
Pkg.status(["DataFrames", "CSV", "Query", "RDatasets"])
Status `~/.julia/environments/v1.4/Project.toml`
  [336ed68f] CSV v0.6.2
  [a93c6f00] DataFrames v0.21.1
  [1a8c2f83] Query v0.12.2
  [ce6b1742] RDatasets v0.6.8

Missing

  • missing is used to represent a missing value in Julia, and it is a built-in object.
  • missing is the singleton of Missing type.
  • Whenever operations including missing are undetermined will return missing.
In [3]:
true && missing
Out[3]:
missing
In [4]:
1 + missing
Out[4]:
missing

Short-cut logic

In [5]:
false && missing
Out[5]:
false
In [6]:
true || missing
Out[6]:
true
In [7]:
ismissing(missing)
Out[7]:
true

DataFrame

  • DataFrame is a table-like data structure containing a series of vectors.
  • Each column corresponds to a variable and has the same length.
In [8]:
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
Out[8]:

4 rows × 2 columns

AB
Int64String
11M
22F
33F
44M
In [9]:
df = DataFrame()  # 先初始化一個空的,再放入資料
df[!, :A] = 1:8
df[!, :B] = ["M", "F", "F", "M", "F", "M", "M", "F"]
df
Out[9]:

8 rows × 2 columns

AB
Int64String
11M
22F
33F
44M
55F
66M
77M
88F

Initialize by rows

In [10]:
df = DataFrame(A=Int64[], B=String[])
Out[10]:

0 rows × 2 columns

AB
Int64String
In [11]:
push!(df, (1, "M"))
Out[11]:

1 rows × 2 columns

AB
Int64String
11M
In [12]:
push!(df, [2, "F"])
Out[12]:

2 rows × 2 columns

AB
Int64String
11M
22F
In [13]:
push!(df, Dict(:B => "M", :A => 4))
Out[13]:

3 rows × 2 columns

AB
Int64String
11M
22F
34M

Initialize by matrix

In [14]:
mat = [1 "M"; 2 "F"; 3 "F"]
DataFrame(mat, [:A, :B])
Out[14]:

3 rows × 2 columns

AB
AnyAny
11M
22F
33F

Initialize by dictionary

In [15]:
d = Dict("A" => [1, 2, 3], "B" => ["M", "F", "F"])
DataFrame(d)
Out[15]:

3 rows × 2 columns

AB
Int64String
11M
22F
33F

Get column

In [16]:
df[!, :A]
Out[16]:
3-element Array{Int64,1}:
 1
 2
 4
In [17]:
df[!, 1]
Out[17]:
3-element Array{Int64,1}:
 1
 2
 4
In [18]:
df.A
Out[18]:
3-element Array{Int64,1}:
 1
 2
 4

Indexing by integer

In [19]:
df[1, 1]
Out[19]:
1

Indexing by column name

In [20]:
df[1, :A]
Out[20]:
1

Dimension information

In [21]:
size(df)
Out[21]:
(3, 2)
In [22]:
nrow(df)
Out[22]:
3
In [23]:
ncol(df)
Out[23]:
2

Get column names

In [24]:
names(df)
Out[24]:
2-element Array{String,1}:
 "A"
 "B"
In [25]:
df[1:3, :]
Out[25]:

3 rows × 2 columns

AB
Int64String
11M
22F
34M

Get slice

In [26]:
df[1:3, [:B, :A]]
Out[26]:

3 rows × 2 columns

BA
StringInt64
1M1
2F2
3M4

Assign column

In [27]:
df[!, :C] = ['a', 'b', 'c']
df
Out[27]:

3 rows × 3 columns

ABC
Int64StringChar
11M'a'
22F'b'
34M'c'
In [28]:
df[!, :B] = ['α', 'β', 'γ']
df
Out[28]:

3 rows × 3 columns

ABC
Int64CharChar
11'α''a'
22'β''b'
34'γ''c'

Conditioning

In [29]:
df[df[!,:A] .% 2 .== 0, :]  # 取符合條件的列,跟所有欄位
Out[29]:

2 rows × 3 columns

ABC
Int64CharChar
12'β''b'
24'γ''c'

Calculation

In [30]:
df[!, :D] = df[!, :B] .* df[!, :C]
df
Out[30]:

3 rows × 4 columns

ABCD
Int64CharCharString
11'α''a'αa
22'β''b'βb
34'γ''c'γc

Aggregation

In [31]:
mean(df[!,1])  # 對第1欄取平均
Out[31]:
2.3333333333333335
In [32]:
median(df[!,:A])  # 對第一個欄位取中位數
Out[32]:
2.0
In [33]:
df = DataFrame(A = 1:4, B = randn(4))
mapcols(cumsum, df)
Out[33]:

4 rows × 2 columns

AB
Int64Float64
110.238524
231.09133
360.384994
4100.286134

Storage

In [34]:
CSV.write("test.csv", df)
Out[34]:
"test.csv"
In [35]:
df = CSV.read("test.csv")
Out[35]:

4 rows × 2 columns

AB
Int64Float64
110.238524
220.852803
33-0.706332
44-0.0988603

Play with data

In [36]:
first(RDatasets.datasets(), 6)  # 可以選你要的資料集
Out[36]:

6 rows × 5 columns

PackageDatasetTitleRowsColumns
StringStringStringInt64Int64
1COUNTaffairsaffairs60118
2COUNTazdrg112azdrg11217984
3COUNTazproazpro35896
4COUNTbadhealthbadhealth11273
5COUNTfasttrakgfasttrakg159
6COUNTlbwlbw18910
In [37]:
iris = dataset("datasets", "iris")
first(iris, 6)
Out[37]:

6 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
In [38]:
size(iris)  # 看一下他有幾列幾行
Out[38]:
(150, 5)

Concatenate DataFrames

In [39]:
size(vcat(iris, iris))
Out[39]:
(300, 5)
In [40]:
size(hcat(iris, iris, makeunique=true))
Out[40]:
(150, 10)

Check if missing exists in each row

In [41]:
completecases(iris)  # true means no missing
Out[41]:
150-element BitArray{1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

Select complete cases

In [42]:
first(iris[completecases(iris), :], 10)  # equivalent to complete_cases!(iris)
Out[42]:

10 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa

Get unique combinations

In [43]:
first(unique(iris), 10)
Out[43]:

10 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
In [44]:
iris[!,:PetalArea] = iris[!,:PetalLength] .* iris[!,:PetalWidth]
first(iris)
Out[44]:

DataFrameRow (6 columns)

SepalLengthSepalWidthPetalLengthPetalWidthSpeciesPetalArea
Float64Float64Float64Float64Cat…Float64
15.13.51.40.2setosa0.28

Sorting

In [45]:
first(sort!(iris, :SepalLength), 10)
Out[45]:

10 rows × 6 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpeciesPetalArea
Float64Float64Float64Float64Cat…Float64
14.33.01.10.1setosa0.11
24.42.91.40.2setosa0.28
34.43.01.30.2setosa0.26
44.43.21.30.2setosa0.26
54.52.31.30.3setosa0.39
64.63.11.50.2setosa0.3
74.63.41.40.3setosa0.42
84.63.61.00.2setosa0.2
94.63.21.40.2setosa0.28
104.73.21.30.2setosa0.26

Sorting multiple columns and specify the order

In [46]:
first(sort!(iris, [:Species, :SepalLength, :SepalWidth], rev=[true, false, false]), 10)
Out[46]:

10 rows × 6 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpeciesPetalArea
Float64Float64Float64Float64Cat…Float64
14.92.54.51.7virginica7.65
25.62.84.92.0virginica9.8
35.72.55.02.0virginica10.0
45.82.75.11.9virginica9.69
55.82.75.11.9virginica9.69
65.82.85.12.4virginica12.24
75.93.05.11.8virginica9.18
86.02.25.01.5virginica7.5
96.03.04.81.8virginica8.64
106.12.65.61.4virginica7.84

Join

In [47]:
employees = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Andy Doe"])
Out[47]:

3 rows × 2 columns

IDName
Int64String
11John Doe
22Jane Doe
33Andy Doe
In [48]:
jobs = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Chief"])
Out[48]:

3 rows × 2 columns

IDJob
Int64String
11Lawyer
22Doctor
34Chief

Inner join

In [49]:
full = innerjoin(employees, jobs, on=:ID)
Out[49]:

2 rows × 3 columns

IDNameJob
Int64StringString
11John DoeLawyer
22Jane DoeDoctor

Left join

In [50]:
left_join = leftjoin(employees, jobs, on=:ID)
Out[50]:

3 rows × 3 columns

IDNameJob
Int64StringString?
11John DoeLawyer
22Jane DoeDoctor
33Andy Doemissing

Right join

In [51]:
right_join = rightjoin(employees, jobs, on=:ID)
Out[51]:

3 rows × 3 columns

IDNameJob
Int64String?String
11John DoeLawyer
22Jane DoeDoctor
34missingChief

Outer join

In [52]:
outer_join = outerjoin(employees, jobs, on=:ID)
Out[52]:

4 rows × 3 columns

IDNameJob
Int64String?String?
11John DoeLawyer
22Jane DoeDoctor
33Andy Doemissing
44missingChief

Other join

  • Semi join: similar to inner join,but only outputs left table
  • Anti join
  • Cross join: Cartesian product
In [53]:
cross_join = crossjoin(employees, jobs, makeunique=true)  # 不需要key
Out[53]:

9 rows × 4 columns

IDNameID_1Job
Int64StringInt64String
11John Doe1Lawyer
21John Doe2Doctor
31John Doe4Chief
42Jane Doe1Lawyer
52Jane Doe2Doctor
62Jane Doe4Chief
73Andy Doe1Lawyer
83Andy Doe2Doctor
93Andy Doe4Chief

Split-Apply-Combine Strategy

Split by :Species, calculate size()

arguments

  1. DataFrame
  2. The column which you like to split
  3. The function or expression to apply to each group of data
In [54]:
by(iris, :Species, size)
┌ Warning: `by(d::AbstractDataFrame, cols::Any, f::Base.Callable; sort::Bool = false, skipmissing::Bool = false)` is deprecated, use `combine(f, groupby(d, cols, sort = sort, skipmissing = skipmissing))` instead.
│   caller = top-level scope at In[54]:1
└ @ Core In[54]:1
Out[54]:

3 rows × 2 columns

Speciesx1
Cat…Tuple…
1setosa(50, 6)
2versicolor(50, 6)
3virginica(50, 6)

More complex situation

In [55]:
by(iris, :Species) do df
    DataFrame(μ = mean(df[!,:PetalLength]), σ = var(df[!,:PetalLength]))
end
┌ Warning: `by(f::Base.Callable, d::AbstractDataFrame, cols::Any; sort::Bool = false, skipmissing::Bool = false)` is deprecated, use `combine(f, groupby(d, cols, sort = sort, skipmissing = skipmissing))` instead.
│   caller = top-level scope at In[55]:1
└ @ Core In[55]:1
Out[55]:

3 rows × 3 columns

Speciesμσ
Cat…Float64Float64
1setosa1.4620.0301592
2versicolor4.260.220816
3virginica5.5520.304588

Group by :Species, calculate sum and average

arguments

  1. DataFrame
  2. The column which you like to split
  3. The function or expression to apply to each group of data
In [56]:
aggregate(iris, :Species, [sum, mean])
┌ Warning: `aggregate(d, cols, fs, sort=false, skipmissing=false)`  is deprecated. Instead use combine(groupby(d, cols, sort=false, skipmissing=false), [names(d, Not(cols)) .=> f for f in fs]...)` if functions in `fs` have unique names.
│   caller = top-level scope at In[56]:1
└ @ Core In[56]:1
Out[56]:

3 rows × 11 columns (omitted printing of 5 columns)

SpeciesSepalLength_sumSepalWidth_sumPetalLength_sumPetalWidth_sumPetalArea_sum
Cat…Float64Float64Float64Float64Float64
1setosa250.3171.473.112.318.28
2versicolor296.8138.5213.066.3286.02
3virginica329.4148.7277.6101.3564.81

Group by

In [57]:
for subdf in groupby(iris, :Species)
    println(size(subdf, 1))
end
50
50
50

Group by and combine

In [58]:
gdf = groupby(iris, :Species)
Out[58]:

GroupedDataFrame with 3 groups based on key: Species

First Group (50 rows): Species = "setosa"

SepalLengthSepalWidthPetalLengthPetalWidthSpeciesPetalArea
Float64Float64Float64Float64Cat…Float64
14.33.01.10.1setosa0.11
24.42.91.40.2setosa0.28
34.43.01.30.2setosa0.26
44.43.21.30.2setosa0.26
54.52.31.30.3setosa0.39
64.63.11.50.2setosa0.3
74.63.21.40.2setosa0.28
84.63.41.40.3setosa0.42
94.63.61.00.2setosa0.2
104.73.21.30.2setosa0.26
114.73.21.60.2setosa0.32
124.83.01.40.1setosa0.14
134.83.01.40.3setosa0.42
144.83.11.60.2setosa0.32
154.83.41.60.2setosa0.32
164.83.41.90.2setosa0.38
174.93.01.40.2setosa0.28
184.93.11.50.1setosa0.15
194.93.11.50.2setosa0.3
204.93.61.40.1setosa0.14
215.03.01.60.2setosa0.32
225.03.21.20.2setosa0.24
235.03.31.40.2setosa0.28
245.03.41.50.2setosa0.3
255.03.41.60.4setosa0.64
265.03.51.30.3setosa0.39
275.03.51.60.6setosa0.96
285.03.61.40.2setosa0.28
295.13.31.70.5setosa0.85
305.13.41.50.2setosa0.3

Last Group (50 rows): Species = "virginica"

SepalLengthSepalWidthPetalLengthPetalWidthSpeciesPetalArea
Float64Float64Float64Float64Cat…Float64
14.92.54.51.7virginica7.65
25.62.84.92.0virginica9.8
35.72.55.02.0virginica10.0
45.82.75.11.9virginica9.69
55.82.75.11.9virginica9.69
65.82.85.12.4virginica12.24
75.93.05.11.8virginica9.18
86.02.25.01.5virginica7.5
96.03.04.81.8virginica8.64
106.12.65.61.4virginica7.84
116.13.04.91.8virginica8.82
126.22.84.81.8virginica8.64
136.23.45.42.3virginica12.42
146.32.55.01.9virginica9.5
156.32.74.91.8virginica8.82
166.32.85.11.5virginica7.65
176.32.95.61.8virginica10.08
186.33.36.02.5virginica15.0
196.33.45.62.4virginica13.44
206.42.75.31.9virginica10.07
216.42.85.62.1virginica11.76
226.42.85.62.2virginica12.32
236.43.15.51.8virginica9.9
246.43.25.32.3virginica12.19
256.53.05.82.2virginica12.76
266.53.05.51.8virginica9.9
276.53.05.22.0virginica10.4
286.53.25.12.0virginica10.2
296.72.55.81.8virginica10.44
306.73.05.22.3virginica11.96
In [59]:
combine(gdf, :PetalLength => mean)
Out[59]:

3 rows × 2 columns

SpeciesPetalLength_mean
Cat…Float64
1setosa1.462
2versicolor4.26
3virginica5.552

Reshape

Original data format is wide format

In [60]:
iris[!,:id] = 1:size(iris, 1)
first(iris, 10)
Out[60]:

10 rows × 7 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpeciesPetalAreaid
Float64Float64Float64Float64Cat…Float64Int64
14.92.54.51.7virginica7.651
25.62.84.92.0virginica9.82
35.72.55.02.0virginica10.03
45.82.75.11.9virginica9.694
55.82.75.11.9virginica9.695
65.82.85.12.4virginica12.246
75.93.05.11.8virginica9.187
86.02.25.01.5virginica7.58
96.03.04.81.8virginica8.649
106.12.65.61.4virginica7.8410

Stack specified column into data and transform into long format

In [61]:
d = stack(iris, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth])
first(d, 10)
Out[61]:

10 rows × 5 columns

SpeciesPetalAreaidvariablevalue
Cat…Float64Int64Cat…Float64
1virginica7.651SepalLength4.9
2virginica9.82SepalLength5.6
3virginica10.03SepalLength5.7
4virginica9.694SepalLength5.8
5virginica9.695SepalLength5.8
6virginica12.246SepalLength5.8
7virginica9.187SepalLength5.9
8virginica7.58SepalLength6.0
9virginica8.649SepalLength6.0
10virginica7.8410SepalLength6.1

Stack other columns

In [62]:
d = stack(iris, [:SepalLength, :SepalWidth], :Species)
first(d, 10)
Out[62]:

10 rows × 3 columns

Speciesvariablevalue
Cat…Cat…Float64
1virginicaSepalLength4.9
2virginicaSepalLength5.6
3virginicaSepalLength5.7
4virginicaSepalLength5.8
5virginicaSepalLength5.8
6virginicaSepalLength5.8
7virginicaSepalLength5.9
8virginicaSepalLength6.0
9virginicaSepalLength6.0
10virginicaSepalLength6.1

Unstack

In [63]:
d = stack(iris, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth])
first(unstack(d, :id, :variable, :value), 10)
Out[63]:

10 rows × 5 columns

idSepalLengthSepalWidthPetalLengthPetalWidth
Int64Float64?Float64?Float64?Float64?
114.92.54.51.7
225.62.84.92.0
335.72.55.02.0
445.82.75.11.9
555.82.75.11.9
665.82.85.12.4
775.93.05.11.8
886.02.25.01.5
996.03.04.81.8
10106.12.65.61.4

Not specify identifier

In [64]:
first(unstack(d, :variable, :value), 10)
Out[64]:

10 rows × 7 columns

SpeciesPetalAreaidSepalLengthSepalWidthPetalLengthPetalWidth
Cat…Float64Int64Float64?Float64?Float64?Float64?
1setosa0.111014.33.01.10.1
2setosa0.141124.83.01.40.1
3setosa0.141204.93.61.40.1
4setosa0.151184.93.11.50.1
5setosa0.151395.24.11.50.1
6setosa0.21094.63.61.00.2
7setosa0.241225.03.21.20.2
8setosa0.241505.84.01.20.2
9setosa0.261034.43.01.30.2
10setosa0.261044.43.21.30.2

Application

Calculate the average of every feature grouped by :Species

In [65]:
d = stack(iris)
x = by(d, [:variable, :Species], df -> DataFrame(vsum = mean(df[!,:value])))
unstack(x, :Species, :vsum)
┌ Warning: `by(d::AbstractDataFrame, cols::Any, f::Base.Callable; sort::Bool = false, skipmissing::Bool = false)` is deprecated, use `combine(f, groupby(d, cols, sort = sort, skipmissing = skipmissing))` instead.
│   caller = top-level scope at In[65]:2
└ @ Core In[65]:2
Out[65]:

5 rows × 4 columns

variablesetosaversicolorvirginica
Cat…Float64?Float64?Float64?
1SepalLength5.0065.9366.588
2SepalWidth3.4282.772.974
3PetalLength1.4624.265.552
4PetalWidth0.2461.3262.026
5PetalArea0.36565.720411.2962

Query

In [66]:
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
Out[66]:

3 rows × 3 columns

nameagechildren
StringFloat64Int64
1John23.03
2Sally42.05
3Kirk59.02

select-from-where

In [67]:
x = @from i in df begin
    @where i.age>50
    @select {i.name, i.children}
    @collect DataFrame
end
Out[67]:

1 rows × 2 columns

namechildren
StringInt64
1Kirk2

Sorting

In [68]:
df = DataFrame(a=[2,1,1,2,1,3],b=[2,2,1,1,3,2])

x = @from i in df begin
    @orderby descending(i.a), i.b
    @select i
    @collect DataFrame
end
Out[68]:

6 rows × 2 columns

ab
Int64Int64
132
221
322
411
512
613

Filtering

In [69]:
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @where i.age > 30. && i.children > 2
    @select i
    @collect DataFrame
end
Out[69]:

1 rows × 3 columns

nameagechildren
StringFloat64Int64
1Sally42.05

Projecting

In [70]:
data = [1,2,3]

x = @from i in data begin
    @select i^2
    @collect
end
Out[70]:
3-element Array{Int64,1}:
 1
 4
 9

Flattening

In [71]:
source = Dict(:a=>[1,2,3], :b=>[4,5])

q = @from i in source begin
    @from j in i.second
    @select {Key=i.first,Value=j}
    @collect DataFrame
end
Out[71]:

5 rows × 2 columns

KeyValue
SymbolInt64
1a1
2a2
3a3
4b4
5b5

Grouping

In [72]:
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,2,2])

x = @from i in df begin
    @group i.name by i.children
    @collect
end
Out[72]:
2-element Array{Grouping{Int64,String},1}:
 ["John"]
 ["Sally", "Kirk"]

Q & A