# The machine learning commands “Train”, “Predict” and “Cluster” explained based on a practical example

**Part 1 – “Train” and “Predict”** in “ACL™ Robotics”

In this blog post we show you two examples of methods by which the analysis software “ACL™ Robotics” - previously known as “ACL™ Analytics” - of the software manufacturer Galvanize makes it possible to implement machine learning. For expert users: Both supervised and unsupervised learning approaches are supported. “ACL™ Robotics” is a software solution which has already been assisting the manual and automated analysis of large amounts of data for many years. Besides having a variety of interfaces to e.g. SAP (via “SAP Connector”), Salesforce, Google Hive, Amazon Redshift, Outlook, PDF imports or any ODBC data sources desired, an automated script language helps to automate the sequence of analytic steps. The software developer Galvanize allocates this to the field of RPA (Robotic Process Automation). Individual analytic steps are implemented by methods or commands, such as sorting, summarizing, joining and relating to name only a few. With Version 14 these analytic commands have been extended by three machine learning commands named as “Train”, “Predict” and “Cluster”. In this blog post we are familiarising you with the use of these three commands, taking specific examples from the world of business, such as forecasting return values and the clustering of customers combined with due dates for payment. For existing ACL users, we also offer the opportunity to download ACL projects with the examples, and thus be able to try out each method, step by step, for yourself.

This blog post is, due to its length, divided into two parts:

- Part 1 deals with the “Train” and “Predict” commands
- Part 2 deals with the “Cluster” command, which is based on the k-means algorithm

The example for Part 1 is taken from the Sales process: Customers order goods of varying values and may return a part. Let us take, for example, the following question:

**What amount does the return value come to in the case of an order value of € 2,000, € 4,200.50 or € 65,000? **

We will answer this question with the use of the “Train” and “Predict” commands in ACL (see the menu item “Machine Learning”). Figure 1 shows 967 orders and their return values. A notional dataset serves as the basis for the calculations. Each point in the graph represents an order. The respective order value is plotted on the x-axis, the associated return value on the y-axis. The angle bisector in blue specifies the maximum return value, as this is always less than or equal to the order value. Orders plotted on the angle bisector have a return value that is equal to the order value.

**Workshop:**

- Download data: Open ACL and the project “estimate_returns.acl” (download it free here). This contains two tables:
- “Orders_and_returns”. This dataset is the so-called training data and contains the data from Figure 1.
- “Unseen_orders”. This dataset contains the order values for which we would like to estimate the return values.

- Model calculation: In this step the model is being trained. Using “orders_and_returns”, training is done for a number of models, i.e. various statistical models attempt to determine the relation between the order values and the return values. The model which best explains the relation (the Winning Model) is then used to estimate the return values for the given order values of € 2000, € 4,200.50 and € 65,000. ACL carries out the procedure mentioned automatically. For this purpose, please proceed as follows: Click on “orders_and_returns” in the side bar. You will now see the associated table in the Basic View. Under “Machine Learning” -> “Train” you can find the view from Figure 2.

Next, change the parameters and settings in the input mask, and then start the search for the Winning Model by pressing “OK” (cf. Figure 3).

- “Time to search for an optimal model (minutes)” determines how long the search for the Winning Model lasts altogether.
- “Maximum time per model evaluation (minutes)” determines for how long, at the most, each individual model is adapted to the data. ACL advises that the total time should correspond to at least 10 times the evaluation time per model. In general, it is the case that, the longer you make the aforementioned times, the better are the individual models, and thus also the Winning Model. The times chosen in this example are, in general, far too short for larger datasets. ACL advises that you take 45 minutes per 100 MB of data as the time specified as being the “Maximum time per model evaluation (minutes)”.
- In our example, we wish to carry out a regression, because a numerical value, the return value, is supposed to be estimated. For this reason, you will need to activate the “Regression” check box.
- As already mentioned, in the case of the “Train” command, various models are taken into consideration, and the best one is subsequently selected. In order to be able to choose this model, a respective performance figure is assigned. The smaller this number is, the better is the associated model. There are various types of calculation for a given performance figure. In the dropdown menu, the “Model Scorer”, you can select one. It is recommended to choose the best-suited type, based on the results.
- Under “Train on…”, select the key or feature variables, i.e. the variables with which you wish to estimate the return value. In our case, there is only one feature variable. For this reason, only “order” is marked. Under “Target Field”, you can determine the variable which is supposed to be estimated. In our example, we have chosen “return” for this.
- Under “Model Name...”, enter the name for the model. The model will be listed in the side bar with this designation.
- Under “To...”, specify the name of the table which, as per the calculation, contains information on the Winning Model. This table will likewise be listed in your side bar upon executing the “Train” command.
- Under “If...”, you can, optionally, exclude entries from the training data. Under the tab “More”, you can adjust one or two expert settings. This is, however, not necessary in the case of our example.

Once ACL has calculated the Winning Model, you will see it in your side bar through the “model_returns” file created.

3. Forecast return values: The initial question was: What amount does the return value come to in the case of an order value of € 2,000, € 4,200.50 or € 65,000? In the table "unseen_orders" in your ACL project, exactly these three order values are stored. “Unseen”, in this case, relates to the model not having been trained with these order values. Open the table mentioned in the Basic View, and select “Machine Learning” -> “Predict”. Next, specify the model calculated (in our example, this would be “model_returns.model”), and then a name for the table which is supposed to contain the estimated return values. Under “If…” and “More…” you can optionally exclude entries when making a forecast. (Cf. Figure 4)

The dataset with which you look for the Winning Model and the dataset for which you then estimate the values of the target field need to have exactly the same feature variables. The “estimated_returns” table in your side bar now contains the forecasts for the order values € 2,000, € 4,200.50 and € 65,000 based on the Winning Model and the training data. The rounded forecasts are: € 796, € 1,672 and € 25,873. Figure 5 shows the training data, as well as the three order values under review, € 2,000, € 4,200.50 and € 65,000; and their associated forecasts, in red.

With the aid of the “Train” and “Predict” commands, at this point you have arrived at a forecast for the return value in regard to particular order values. In order to obtain more insight into the automated procedure of ACL, predictions have been made in Figure 6 for order values equal to € 1,000, € 2,000, € 3,000,…, € 80,000. The Winning Model outlined above has once again been used for the forecasts. It can be clearly recognised that the Winning Model is a linear model, because the forecasts are all plotted on a straight line (cf. Figure 6). Should the “orders_and_returns” training data be amended, in general the Winning Model and the associated model parameters change. The Winning Model may also be a non-linear model. As the training data exclusively contains orders that have been returned, the model from ACL found always estimates the anticipated return value should an order be returned. The model does not estimate whether an order will be returned!

The procedure described above is assigned to Supervised Learning. The term “Supervised” relates, in this case, to values for the numerical target field being estimated with the aid of the key variables. The sequence presented also works with multiple key variables. The number of variables included in the model is thereby increased. No “target field” exists with Unsupervised Learning. In the second part of this blog post, an example with an Unsupervised Learning approach is presented.

In the first part of this blog post, we showed you how you can calculate forecasts based on a dataset using a combination of the “Train” and “Predict” commands. If you would like to use your own data, the necessary steps you need to perform can be summarised as follows:

- Create a training dataset. This always includes the key variables and the target field.
- Create a dataset for which forecasts are supposed to be calculated. This contains the same key variables as the training dataset.
- Calculate the Winning Model with a suitable choice of parameters.
- Calculate forecasts using the Winning Model for the dataset from Step 2.

We hope that you enjoyed this interactive blog post. Have fun trying out the commands explained above. If you have any questions, please do not hesitate to contact us at any time.

In the second part, which will be published soon, we will be looking at the “Cluster” command with an example from customer master data and payment due dates.