While somewhat of a buzzword over the past decade, with the advent of the modern computing age Machine Learning has skyrocketed in both popularity and accessibility. Iteratively faster processors and larger memory capacity have combined with recent theoretical and technical discoveries across the ML field to enable all sorts of techniques previously only accessible under the realm of supercomputers. Neural Networks, for example, have been around since the 1940’s; having been abandoned in the 1970’s due to the need for more powerful computers, they now enjoy widespread use in just about every field. Indeed, it is widely acknowledged now that we have begun the “golden age” of machine learning.
While advances in computing resources have facilitated the extensive modern uptake of complex machine learning techniques, what has really helped this uptake has been an explosion in the amount of data available to train machine learning models.
As with any type of learning, very little comprehension can take place if there is nothing from which to learn. Machine learning techniques take existing data, extract patterns and learn models from that data, and then use that learned information to make decisions, predictions, classifications, and a whole assortment of other actions. Data typically comes in two main forms:
Labelled data allows for Supervised Learning (or Semi-Supervised Learning if only some labels are available), where models can learn a direct mapping from input samples to a set of labels that describe those samples. The aim of supervised learning is often to build a model which can then predict the label of new data, e.g., for classification, where a discrete label is predicted based on some new input, or regression, where a continuous value is predicted based on some new input.
Unlabelled data means the learning algorithm must perform Unsupervised Learning. Since no target outputs are given to the algorithm, it must uncover its own structure in the data. Forecasting (e.g., predicting trends or future values) and Clustering (i.e., grouping similar data points) are typically unsupervised learning tasks. Another technique often used with unlabelled data is Reinforcement Learning, where some objective reward or punishment is given to the learning algorithm based on its actions. Search and optimisation algorithms such as Genetic Algorithms often employ reinforcement learning.
With the advent of the Internet of Things, it is estimated that more than 90% of the entire amount of data generated over the course of human history has been created in the past 2 years, a figure that has remained more or less constant since 2013. Data generation is growing at an exponential rate, and storage capacity for this data is rapidly growing to meet demand. The upshot of this is that there now exists far more data for any single topic than ever existed in the past. And more data means a richer vein from which to learn.
An oft-used adage in Machine Learning and Data Science circles is:
Good data in equals good data out.
The quality of outputs of any learning system can be directly linked with the quality of the inputs to that system. Sparse, noisy, messy data will typically prove difficult to learn from, necessitating complex and often fragile models. On the other hand, clean, organised, and well-maintained data will allow for even simple models to produce excellent results.
Corvil acquires and enriches vast streams of high-quality data from the network at scale and applies machine learning algorithms to provide real-time intelligent and actionable analysis.
Corvil Intelligence Hub is a data analytics and intelligence solution for digital business, security, compliance, and IT operations teams. It ingests, processes, analyzes, and reports on electronic transactions and events for improved customer experience, operations, transparency, security, and compliance.
Among its many use cases, Intelligence Hub uses labelled data to predict Order to Trade Ratios using linear regression, unlabelled time-series data for forecasting and anomaly detection using statistical methods, as well as unlabelled time-series data for anomaly detection using Neural Networks.