What Data to Include in MES-Related Machine Learning

What Data to Include in MES-Related Machine Learning?

Selecting the raw data used to train machine learning is one of the most challenging steps of implementing effective machine learning and MES-related machine learning is no different. 

There has been a lot of hype and focus on IIoT and big data that basically only collects and stores data from numerous sensors and production-related values. Any of these can be used and it is the first step to obtaining valuable insights into making your production facilities more efficient. 

The next step is to come up with a mathematical algorithm that correlates the raw data with predicted output values. This is called training and requires providing training data. Training data includes a value that you want to predict.  For example, predicting the production rate may include the following training data which has been limited for this discussion:

Product CodeOperator NameTarget RateActual Rate
(Value to Predict)
Product AJim Thompson10098
Product AJim Thompson10099
Product ASue Smith100115
Product BJim Thompson120116
Product BSue Smith120120
Product AJim Thompson100100
Product CJim Thompson10097
Product CSue Smith10098
Product DJim Thompson120119
Product DSue Smith120118


The idea is to include columns of values in the training data that will have an impact on predicting the Actual Rate. This will vary depending on your production process. For example in the case of a packaging line, the container vendor may have an impact on the production rate. One vendor may cause less downtime while another causes more downtime. For a process, the humidity of the room may have an impact on the production rate.

Once the training is complete and an algorithm, known as the model, has been created, then the Actual Rate can be predicted based on the values for the remaining columns. In our example above the Product Code, Operator Name, and Target Rate.

It is important to keep in mind that it is not practical to collect 1000s of sensors and production-related values and use them all in machine learning. The reason for this is many of the values may not have a direct impact on the predicted output value. This can even make the predicted value inaccurate. For this reason, only values that have a direct impact should be included. The other issue is that the more columns of data included in the training increase the computing power requirements to a point where the learning time is not practical.