Analyzing LendingClub data using JuliaDB

If there is anything years of research on how to streamline data analytics has shown us, it is that working with big data is not a cake walk. No matter how one looks at it, it is time consuming and computationally intensive to create, maintain, and build models based on large datasets.

Introducing JuliaDB

In Julia v0.6, we aim to take another step towards solving this problem with our new package,JuliaDB.

JuliaDBis a high performance, distributed, column-oriented data store providing functionality for both in-memory and out-of-core calculations. Being fully implemented in Julia, JuliaDB easily integrates data loading, analytics, and visualization packages throughout the Julia language ecosystem. Such seamless integration allows for rapid development of data and compute intensive applications.

This example will use datasets provided by LendingClub , the world’s largest online marketplace for connecting borrowers and investors. On their website, they provide publicly available, detailed datasets that contain anonymous data regarding all loans that have been issued through their system, including the current loan status and latest payment information.

The analysis conducted below is similar to that performed on the same datasets in this post by the Microsoft R Server Tiger Team for the Azure Data Science VM.

The first step in conducting this analysis is to download the following files from the website: LoanStats2016_Q1.csv, LoanStats2016_Q2.csv, LoanStats2016_Q3.csv, LoanStats2016_Q4.csv, LoanStats2017_Q1.csv, LoanStats3b.csv, LoanStats3c.csv and LoanStats3d.csv. A basic clean-up of the data files is performed by deleting the first and last line of descriptive text from each csv.

Writing the Julia code

Once the file clean-up is done, add the following packages: JuliaDB, TextParse, IndexedTables, NullableArrays, DecisionTree, CoupledFields, Gadfly, Cairo, Fontconfig, Dagger, and Compose, followed by loading the required ones.

# Packages that need to be installed with Julia 0.6
Pkg.add("Cairo") # Needed for PNG creation with Gadfly
Pkg.add("Fontconfig") # Needed for PNG creation with Gadfly
using Dagger, Compose
using ROC, Gadfly
using DecisionTree, JuliaDB, TextParse, NullableArrays
import TextParse: Numeric, NAToken, CustomParser, tryparsenext, eatwhitespaces, Quoted, Percentage

Now define a variable that contains a path to the directory containing the data files, and a dictionary that contains the names of all of the columns that are contained in the dataset as keys.

dir = "/home/venkat/LendingClubDemo/files"
const floatparser = Numeric(Float64)
const intparser = Numeric(Int)

t  = Dict("id"                 => Quoted(Int),
          "member_id"          => Quoted(Int),
          "loan_amnt"          => Quoted(Nullable{Float64}),
          "funded_amnt"        => Quoted(Nullable{Float64}),
          "funded_amnt_inv"    => Quoted(Nullable{Float64}),
          "term"               => Quoted(TextParse.StrRange),
          "int_rate"           => Quoted(NAToken(Percentage())),
          "delinq_2yrs"        => Quoted(Nullable{Int}),
          "earliest_cr_line"   => Quoted(TextParse.StrRange),
          "inq_last_6mths"     => Quoted(Nullable{Int}),
...and so on

Calling the function “loadfiles” from the JuliaDB package parses the data files, and constructs the corresponding table (providing the above dictionary as input helps it construct the table, although it doesn’t necessarily need this input). Since none of the dictionary columns are index columns, JuliaDB will itself create its own implicit index column with each row having a unique integer value, starting with 1.

LS = loadfiles(glob("*.csv", dir), indexcols=[], colparsers=t,

Once done, we classify some loans as bad loans and others as good loans based upon whether the payment on the loan is late, in default, or has been charged off. We then split the table based upon whether the loans are good or bad.

bad_status = ("Late (16-30 days)", "Late (31-120 days)",
              "Default", "Charged Off")
# Determine which loans are bad loans
  is_bad = map(status->(status in bad_status),
               getdatacol(LS, :loan_status)) |> collect |> Vector{Bool}
# Split the table into two based on the loan classification
  LStrue = filter(x->x.loan_status in bad_status, LS)
  LSfalse = filter(x->!(x.loan_status in bad_status), LS)

Constructing a relevant model necessitates that we identify which factors are the best in identifying good and bad loans. Over here, the feature selection method that we use is a graphical comparison based upon how each numerical column’s row values are associated with either a good or bad categorization of individual loans. We construct two density plots of the values contained in each numerical column, one for good loans and the other for bad. This process necessitates that we first figure out which columns are numerical. We do that by using the following set of “isnumeric” functions.

# Define a function for determining if a value is numeric, whether or
# not the value is a Nullable.
  isnumeric(::Number) = true
  isnumeric{T<:number}(::nullable{t}) =="" true="" isnumeric(::any)="false" isnumeric{t<:number}(x::quoted{t})="true" isnumeric{t<:nullable}(x::quoted{t})="eltype(T)" <:="" number 

We then map our isnumeric function over each column of the JuliaDB table, construct Gadfly layers for each density plot for the good and bad loans, and then display that collection for feature selection.

# Density plots of the numeric columns based on loan classification
  varnames = map(Symbol, collect(keys(filter((k,v)->(k != "id" && k!="member_id" && isnumeric(v)), t))))
  layers = Vector{Gadfly.Layer}[]

  for s in varnames
      nt = dropnull(collect(getdatacol(LStrue,s)))
      nf = dropnull(collect(getdatacol(LSfalse,s)))
      push!(layers, layer(x = nt, Geom.density, Theme(default_color=colorant"blue")))
      push!(layers, layer(x = nf, Geom.density, Theme(default_color=colorant"red")))

  # Layout the individual plots on a 2D grid
  N = length(varnames)
  M = round(Int,ceil(sqrt(N)))
  cs = Array{Compose.Context}(M,M)
  for i = 1:N
      cs[i] = render(Gadfly.plot(layers[2i-1],layers[2i],
  for i = N+1:M^2
      cs[i] = render(Gadfly.plot(x=[0],y=[0]))
  draw(PNG("featureplot.png",24inch, 24inch), gridstack(cs))

The Gadfly plots would typically look like this:

In order to make sure that our analysis is as close as possible as that conducted by Microsoft, we’ll select the same set of predictor variables that they did:

revol_util, int_rate, mths_since_last_record, annual_inc_joint,
dti_joint, total_rec_prncp, all_util

Creating the predictive model

Our predictive model will be created by using the random forest model of the DecisionTree.jl package. There are two steps here — one where we use a large amount of data to construct the model, and two, a smaller set of data to test the model. So we randomly split the data into two parts, one containing 75% of the data points, to be used for training the model, and the other containing the other 25%, to be used to test the model.

# Split the data into 75% training / 25% test
  n = length(LS)
  p = randperm(n)
  m = round(Int,n*3/4)
  a = sort(p[1:m])
  b = sort(p[m+1:end])
  LStrain = LS[a]
  LStest  = LS[b]
  labels_train = is_bad[a]

The random forest model needs us to create two vectors — one being a vector of labels, and the other being the corresponding feature matrix. For the label vector, we reuse the index vector used above (when extracting the training subset of the original data to extract the corresponding subset of the is_bad label vector). For the construction of the feature matrix, we extract the columns for our selected features from the distributed JuliaDB table, gather those columns to the master process, and finally concatenate the resulting vectors into our feature matrix.

features_train = [revol_util_train int_rate_train mths_since_last_record_train annual_inc_joint_train total_rec_prncp_train all_util_train]

Having done this, we can now call the “build_forest” function from the DecisionTree.jl package.

model = build_forest(labels_train, features_train, 3, 10, 0.8, 6)

Should we want to save our model to reuse at a later time, we can store it to our disk.

f = open("  loanmodel.jls", "w")
serialize(f, model)

We can now test our model on the rest of the data. To do this, we will generate predictions in parallel across all workers by mapping the “apply_forest” function onto every row of the JuliaDB dataset.

predictions = collect(map(row->DecisionTree.apply_forest(model,
    [row.revol_util.value; row.int_rate.value;
     row.annual_inc_joint.value; row.dti_joint.value;
     row.total_rec_prncp.value; row.all_util.value]), LStest)).data

With our set of predictions, we construct a ROC curve using the ROC.jl package and calculate the area under the curve to find a single measure of how predictive our trained model is on the dataset.

# Receiver Operating Characteristics curve
curve = roc(Vector{Float64}(predictions), BitArray{1}(is_bad[b]))

# An ROC plot in Gadfly with data calculuated using ROC.jl
Gadfly.plot(layer(x = curve.FPR, y = curve.TPR, Geom.line),
    layer(x = linspace(0.0,1.0,101), y = linspace(0.0,1.0,101),
          Geom.point, Theme(default_color=colorant"red")),
    Guide.xlabel("False Positive Rate"),
    Guide.ylabel("True Positive Rate"))

The ROC would look like this.

The area under the curve is:

# Area Under Curve

We have shown how to use JuliaDB to create a model predicting the quality of a loan, combining data access and sophisticated analysis in a single environment.

责编内容来自:Julia Computing (源链) | 更多关于

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 综合编程 » Analyzing LendingClub data using JuliaDB

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录