As machine learning models become more and more computationally complex, there is a need for high-performance computing architecture. With the newly opened Esports lounge, there is an opportunity to utilize 30+ machines using the top-spec hardware. Cresco in conjunction with ClearML is being used to train cancer predicting models for tissue samples.

Each of these machine house an RTX 2080 SUPER GPU, which when used in a cluster provide a massive amount of computational power. With this cluster working together, the time taken to train machine learning models and instantiate other intensive programs will be largely reduced.

As mentioned above there are two applications that are being used to accomplish this feat. The first is Cresco, which is an open-source, distributed computing solution to keep track of, send data to, and run workloads on each of these machines. Cresco has many plugins built to accomplish tasks (such as running commands, transferring files, or viewing system statistics) without requiring interaction with each individual machine. This is very useful since running a task on a computer that is being used would provide the user with slow response times, which is not acceptable in this environment.

cresco topology
Fig. 1 – A Typical Cresco Deployment

ClearML is the other part of this project. ClearML is also open-source and provides a way to execute and analyze machine learning tasks processed by the cluster of machines. This tool can also organize multiple machine learning jobs and to assign specific jobs to certain machines within the cluster. This is very helpful because, for example, half of the machines could be training a natural language processing model, while the other half trains a colon cancer detection model. The cluster could be divided further to add more training/testing workloads.

Clear ML deployment
Fig. 2 – A ClearML Network

UPDATE 12/14/2021: We are not yet at full capacity as we are still testing and finalizing agent controllers and finding the best configurations for ClearML. As more machines come online and we finalize our agent controllers, metrics for machine learning tasks will be provided here.