In today’s world, financial data is the new currency in the finance world. However, it is becoming harder and harder to use traditional statistical methods to process financial data that is ever more expanding. Machine learning provides a set of techniques and modules that enable the creation of programs that predict financial data in the form of company value, stock prices, or even detecting a case of financial fraud.
This article describes the creation of a decision tree that enables the detection of fraudulent transactions used by customers using credit card transactions. The report covers the decision tree implementation using Java programming. The article starts by covering the main requirements of the machine learning algorithms. Next, the article describes the dataset and its main components. Later, the decision tree is implemented using a decision tree. Last, GridDB read, write, and store methods are implemented to store our data for deployment and testing purposes.
Requirements
In this section, we will cover the main libraries, modules, and database environment used to demonstrate the usage of machine learning to detect fraudulent credit card transactions.
Weka 3.9: Download and position the weka.jar file in the following path: /usr/share/java/.
GridDB 4.6: Make sure to activate the GridDB cluster after installation.
It is critical to place the Weka Jar file in the CLASSPATH of your environment. The same applies to the GridStore Jar file to ensure the GridDB database is up and running.
To perform this task, please use the following command lines:
$ export CLASSPATH=${CLASSPATH}:/usr/share/java/weka.jar
$ export CLASSPATH=$CLASSPATH:/usr/share/java/gridstore.jar
The Dataset
For demonstration purposes, we have chosen a sample dataset that contains credit card transactions. This dataset is available at the following link here. The dataset is composed of 9999 instances with 31 attributes that are detailed as follows:
- Time. Measures the number of seconds that passed between every credit card transaction.
- V1 to V28. Measures the credit card user identifiers that are anonymized for privacy reasons.
- Amount. Measures the transaction amount in dollars.
- Class: Indicates the transaction type, and this attribute takes two possible values: 0 not fraudulent, 1 fraudulent.
Here is an extract of the dataset:
"Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount","Class"
0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62,"0"
The dataset used in our article is a comma-separated values file stored on the project resource page. Using Java, the data processing code will start by reading the file path and using a scanner method to read the data and storing the data in primary memory. The following Java snippet performs what has been explained above:
// Handling Dataset and storage to GridDB
File data = new File("/home/ubuntu/griddb/gsSample/creditcard.csv");
Scanner sc = new Scanner(data);
sc.useDelimiter("\n");
while (sc.hasNext()) // Returns a boolean value
{
int i = 0;
Row row = collection.createRow();
String line = sc.next();
String columns[] = line.split(",");
int Time = Integer.parseInt(columns[0]);
float V1 = Float.parseFloat(columns[1]);
float V2 = Float.parseFloat(columns[2]);
float V3 = Float.parseFloat(columns[3]);
float V4 = Float.parseFloat(columns[4]);
float V5 = Float.parseFloat(columns[5]);
float V6 = Float.parseFloat(columns[6]);
float V7 = Float.parseFloat(columns[7]);
float V8 = Float.parseFloat(columns[8]);
float V9 = Float.parseFloat(columns[9]);
float V10 = Float.parseFloat(columns[10]);
float V11 = Float.parseFloat(columns[11]);
float V12 = Float.parseFloat(columns[12]);
float V13 = Float.parseFloat(columns[13]);
float V14 = Float.parseFloat(columns[14]);
float V15 = Float.parseFloat(columns[15]);
float V16 = Float.parseFloat(columns[16]);
float V17 = Float.parseFloat(columns[17]);
float V18 = Float.parseFloat(columns[18]);
float V19 = Float.parseFloat(columns[19]);
float V20 = Float.parseFloat(columns[20]);
float V21 = Float.parseFloat(columns[21]);
float V22 = Float.parseFloat(columns[22]);
float V23 = Float.parseFloat(columns[23]);
float V24 = Float.parseFloat(columns[24]);
float V25 = Float.parseFloat(columns[25]);
float V26 = Float.parseFloat(columns[26]);
float V27 = Float.parseFloat(columns[27]);
float V28 = Float.parseFloat(columns[28]);
int Amount = Integer.parseInt(columns[29]);
int Class = Integer.parseInt(columns[30]);
row.setInteger(0,i);
row.setInteger(1, Time);
row.setFloat(2, V1);
row.setFloat(3, V2);
row.setFloat(4, V3);
row.setFloat(5, V4);
row.setFloat(6, V5);
row.setFloat(7, V6);
row.setFloat(8, V7);
row.setFloat(9, V8);
row.setFloat(10, V9);
row.setFloat(11, V10);
row.setFloat(12, V11);
row.setFloat(13, V12);
row.setFloat(14, V13);
row.setFloat(15, V14);
row.setFloat(16, V15);
row.setFloat(17, V16);
row.setFloat(18, V17);
row.setFloat(19, V18);
row.setFloat(20, V19);
row.setFloat(21, V20);
row.setFloat(22, V21);
row.setFloat(23, V22);
row.setFloat(24, V23);
row.setFloat(25, V24);
row.setFloat(26, V25);
row.setFloat(27, V26);
row.setFloat(28, V27);
row.setFloat(29, V28);
row.setFloat(30, V28);
row.setInteger(31,Amount);
row.setInteger(32,Class);
rowList.add(row);
i++;
}
Once the data was retrieved from our dataset file, it is critical to close the scanner used to process the data using the following Java statement:
sc.close();
The decision tree is implemented to predict the credit card transactions that are considered fraudulent. In the subsequent section, we will implement the decision tree algorithm written in Java.t
Implementing a Decision Tree Algorithm in Java
We will use the J48 algorithm in the Weka package to implement our decision tree. The decision tree generated is known as a C4.5 decision tree, which can either be pruned or unpruned. To start the implementation process, we will begin by importing all the needed libraries.
The following is the list of libraries used to implement our decision tree:
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.trees.J48;
import weka.classifiers.Evaluation;
Our decision tree can be configured using the options argument. The following is a series of default changes that can be implemented to make sure our decision tree accepts a high number of instances.
The following is the Java code is used to change these settings:
String[] options = new String[4];
options[0] = "-C";
options[1] = "0.25";
options[2] = "-M";
options[3] = "30";
Import Packages
We need a set of libraries to ensure our dataset is processed using Java. These libraries are going to serve as a storage area in our program. The initial libraries that will be imported are ArrayList and List modules.
import java.util.ArrayList;
import java.util.List;
Additionally, we will need to import two extra libraries, Properties and Random, used to process our GridDB store.
import java.util.Properties;
import java.util.Random;
Next are the libraries needed to implement the GridDB interface.
import com.toshiba.mwcloud.gs.Collection;
import com.toshiba.mwcloud.gs.ColumnInfo;
import com.toshiba.mwcloud.gs.Container;
import com.toshiba.mwcloud.gs.ContainerInfo;
import com.toshiba.mwcloud.gs.GSType;
import com.toshiba.mwcloud.gs.GridStore;
import com.toshiba.mwcloud.gs.GridStoreFactory;
import com.toshiba.mwcloud.gs.Query;
import com.toshiba.mwcloud.gs.Row;
import com.toshiba.mwcloud.gs.RowSet;
Finally, we import packages to interact with our CreditCard CSV file.
import java.io.IOException;
import java.util.Scanner;
import java.io.File;
import java.io.BufferedReader;
import java.io.FileReader;
Write Data into GridDB
To store our credit card data in our GridDB database, we start by instantiating the connection object to a predefined server. In addition, we initialize the GridStore class using the getInstance() method.
// Manage connection to GridDB
Properties prop = new Properties();
prop.setProperty("notificationAddress", "239.0.0.1");
prop.setProperty("notificationPort", "31999");
prop.setProperty("clusterName", "cluster");
prop.setProperty("database", "public");
prop.setProperty("user", "admin");
prop.setProperty("password", "admin");
// Get Store and Container
GridStore store = GridStoreFactory.getInstance().getGridStore(prop);
store.getContainer("newContainer");
String containerName = "mContainer";
Once our object was created, we started by extracting our columns from our dataset. This can be done using the add() method.
// Define container schema and columns
ContainerInfo containerInfo = new ContainerInfo();
List<columninfo> columnList = new ArrayList</columninfo><columninfo>();
columnList.add(new ColumnInfo("key", GSType.INTEGER));
columnList.add(new ColumnInfo("Time", GSType.INTEGER));
columnList.add(new ColumnInfo("V1", GSType.FLOAT));
columnList.add(new ColumnInfo("V2", GSType.FLOAT));
columnList.add(new ColumnInfo("V3", GSType.FLOAT));
columnList.add(new ColumnInfo("V4", GSType.FLOAT));
columnList.add(new ColumnInfo("V5", GSType.FLOAT));
columnList.add(new ColumnInfo("V6", GSType.FLOAT));
columnList.add(new ColumnInfo("V7", GSType.FLOAT));
columnList.add(new ColumnInfo("V8", GSType.FLOAT));
columnList.add(new ColumnInfo("V9", GSType.FLOAT));
columnList.add(new ColumnInfo("V10", GSType.FLOAT));
columnList.add(new ColumnInfo("V11", GSType.FLOAT));
columnList.add(new ColumnInfo("V12", GSType.FLOAT));
columnList.add(new ColumnInfo("V13", GSType.FLOAT));
columnList.add(new ColumnInfo("V14", GSType.FLOAT));
columnList.add(new ColumnInfo("V15", GSType.FLOAT));
columnList.add(new ColumnInfo("V16", GSType.FLOAT));
columnList.add(new ColumnInfo("V17", GSType.FLOAT));
columnList.add(new ColumnInfo("V18", GSType.FLOAT));
columnList.add(new ColumnInfo("V19", GSType.FLOAT));
columnList.add(new ColumnInfo("V20", GSType.FLOAT));
columnList.add(new ColumnInfo("V21", GSType.FLOAT));
columnList.add(new ColumnInfo("V22", GSType.FLOAT));
columnList.add(new ColumnInfo("V23", GSType.FLOAT));
columnList.add(new ColumnInfo("V24", GSType.FLOAT));
columnList.add(new ColumnInfo("V25", GSType.FLOAT));
columnList.add(new ColumnInfo("V26", GSType.FLOAT));
columnList.add(new ColumnInfo("V27", GSType.FLOAT));
columnList.add(new ColumnInfo("V28", GSType.FLOAT));
columnList.add(new ColumnInfo("Amount", GSType.FLOAT));
columnList.add(new ColumnInfo("Class", GSType.INTEGER));
containerInfo.setColumnInfoList(columnList);
containerInfo.setRowKeyAssigned(true);</columninfo>
Store the Data in GridDB
Once we have processed the dataset using Java programming, we move to the next set, storing our dataset for long-term use in our database. This is where GridDB becomes very handy, as we will keep each one of our columns in a predefined row.
The following Java code snippet is used to conduct that task:
row.setInteger(0,i);
row.setInteger(1, Time);
row.setFloat(2, V1);
row.setFloat(3, V2);
row.setFloat(4, V3);
row.setFloat(5, V4);
row.setFloat(6, V5);
row.setFloat(7, V6);
row.setFloat(8, V7);
row.setFloat(9, V8);
row.setFloat(10, V9);
row.setFloat(11, V10);
row.setFloat(12, V11);
row.setFloat(13, V12);
row.setFloat(14, V13);
row.setFloat(15, V14);
row.setFloat(16, V15);
row.setFloat(17, V16);
row.setFloat(18, V17);
row.setFloat(19, V18);
row.setFloat(20, V19);
row.setFloat(21, V20);
row.setFloat(22, V21);
row.setFloat(23, V22);
row.setFloat(24, V23);
row.setFloat(25, V24);
row.setFloat(26, V25);
row.setFloat(27, V26);
row.setFloat(28, V27);
row.setFloat(29, V28);
row.setFloat(30, V28);
row.setInteger(31,Amount);
row.setInteger(32,Class);
rowList.add(row);
To commit our changes to our GridDB database, we can use the put() method that takes the rows of our database as an argument.
The following code is used to perform this task:
collection.put(rowList);
Retrieve the Data from GridDB
To retrieve the instances of our store data found in our database, we will have to use the SELECT query to fetch the needed rows and columns from our GridDB database.
The following is a code snippet that performs the above explained task:
Query<row> query = container.query("SELECT * ");
RowSet</row><row> rs = query.fetch();</row>
We can use the println() method found in Java to view our data. This task can be performed as follows:
// Print GridDB data
while (rs.hasNext()) {
Row row = rs.next();
System.out.println(" Row=" + row);
}
Build the Decision Tree
After ensuring our data is securely saved in our GridDB database, we can process by running our decision tree algorithm and using Java to determine our model’s accuracy. In our Java program, we can use the method Instances() to convert our data to a suitable machine learning data model that can be there after used to create our decision tree.
The following code can be used to perform that task:
BufferedReader bufferedReader= new BufferedReader(new FileReader(res));
Instances datasetInstances= new Instances(bufferedReader);
Once our data is ready after being converted to ready-to-use instances, we will use the buildClassifer() method to create our machine learning model. The following model is used to build our decision tree:
mytree.buildClassifier(datasetInstances);
After running our machine learning model, it is critical to ensure our model is working correctly determining its accuracy. These tasks can be accomplished during the cross-validation model.
Evaluation eval = new Evaluation(datasetInstances);
eval.crossValidateModel(mytree, datasetInstances, 10, new Random(1));
We observe in the last line of code that we perform a cross validation to our data instances. This process will ensure that the dataset is split in different ways to obtain unbiased results, especially since we count on a limited dataset.
Finally, we print the summary of our model as follows:
System.out.println(eval.toSummaryString("\n === Summary === \n", true));
Compile and Run the Code
To compile and run the code, we will start by locating the FraudulentTransactions.java file found in the gsSample/ path. Once the folder is located, make sure to expect the following commands to run your Java code and compile your results:
~/griddb$ javac gsSample/FraudulentTransactions.java
~/griddb$ java gsSample/FraudulentTransactions.java
Conclusion
The following is the summary of the results we produced using our decision tree algorithms using our Java program:
=== Summary ===
Correlation coefficient 0.7445
Mean absolute error 0.0009
Root mean squared error 0.0296
Relative absolute error 25.4503 %
Root relative squared error 71.3443 %
As we demonstrated, the decision tree has resulted in an accuracy of 71.34% in classifying fraudulent credit card transactions. To increase the accuracy of our model and reach better accuracy, it is advisable to increase the size of our dataset to enable the module to learn the different fraudulent transactions.
Make sure to close the query, the container, and the GridDB database:
query.close();
container.close();
store.close();
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.