If you are new to machine learning you probably are wondering “What hardware specifications do my computer or server need to run machine learning?”. We have combined the results of the tests run by Pugetsystems and came up with this short answer “With a lot of GPUs and VRAM“, the longer and more detailed answer can be read a bit late in that article, but to be useful to as many people as possible we will start with some basic information about machine learning.
An oversimplified explanation of Machine Learning
Machine Learning (ML) is a branch of computer science that uses algorithms to imitate the way humans learn, but to do that, ML has to have access to vast amounts of data. For example, if you want to make your algorithms make a decision on ”What type of animal do you have in the picture” it will start with one picture and ask “Is this a cat?” – if the answer is “No” it will disregard it, if the answer is “Yes” it will analyze it and look for some features of the image and give them some weights, it will continue to do so for the rest of the images constantly adjusting these weights trying to configure them so it will be able to recognize Cats in the images without the need of human confirmation or at least as close as possible.
For instance, on the first image the algorithm sees with a confirmed Cat, it will notice that the Cats have 4 legs, so it will think that 100% of all the images that have animals with 4 legs are cats. But the following picture is one of a crocodile. It has 4 legs as well, but it is not a Cat, so it will compare the 2 images and see that Cats also have fur, but crocodiles have scales. So from now on, it will look for animals with 4 legs and fur, but the next image is one of a Giraffe. It has 4 legs and fur, but it is not a Cat, so it will compare them again and find that Giraffes are taller than Cats. The next one is of a dog, it passed all of the requirements we have so far, but still, it is not a Cat. The algorithm analyzes the difference and sees that the Cat has Vertical pupils. The next image is that of a Sphynx cat. The algorithm disregards it because it has no fur, but then the engineers intervene and mark it as Cat so the algorithms adjust the weight of the “have fur” requirement to be 50% less. The next image is of a Tiger. It disregards it because Tigers have round pupils, the engineers intervene again and correct the label to Cat, so the algorithm adjusts the weights to be “have fur” – 66%, have “vertical pupils” 66%. This process will continue and the algorithms will add more requirements the more images of cats they see and will adjust the weights of the existing ones. Eventually, the algorithms will have enough information to create a threshold that will allow ML to have good enough % success identifying Cats in the pictures, so for the next images, it sees, it does not have to compare it to all known images of Cats, but just add all of the numerical values of the requirements and if they are above the threshold it will say it is a Cat. This process has to be done for every animal/product/subject you want your algorithms to recognize.
Now as you can see from the above explanation of Machine Learning you will eventually end up with a long list of Yes and No questions that need answers.
– Does it have fur? – Yes
– Does it have 4 legs? – Yes
And so on and so forth.
The more complex algorithms you have, the more questions will be on that list. The good news is that you need the answers to all of them, but they are not linked, meaning you don’t have to wait for one of them to be answered to formulate the other question. You can just answer them at the same time and add the corresponding values to reach a conclusion. So, the more answers on your checklist you can get at the same time (parallel) the faster you will get your answer to the big question. This will determine the hardware specifications of your server.
To calculate the score for the threshold you can use either the Central processing unit (CPU) or Graphics processing unit (GPU). The CPU needs to be able to solve all kinds of problems, so it is not optimized for anything specific. Usually, CPUs have fewer Cores, but they operate at higher clock speeds because most tasks need to be executed in order. On the other hand, GPUs have lots of small cores, because they need to finish many small jobs fast and they need to do them in parallel, so they are better suited for our purposes.
For the most part, the CPU won’t matter that much, unless you have some type of algorithm optimized for the CPU. Most people will not even use their machine learning servers for anything else, so you can save some money by picking a less expensive CPU. Keep in mind that the CPU dictates the number of PCIe lanes and their generation so don’t go too cheap on it. You need to have in mind what type of GPU you will put in your server and pick CPU with the required number of PCIe lanes with the proper Generations so you won’t bottleneck your systems when it transfers the information from your storage to the VRAM of your GPU, but other than that it will not be used so much.
The GPUs are of great importance, because they will do the actual work. You need as many as possible cores at the best speed and the most VRAM you can afford. Why? Well, the more Cores you have, the more questions you can answer at the same time, and the faster they go, the more cycles you can do. But why do you need more VRAM? Well, in order to work any algorithm needs something to compare the image to and getting that image from storage and into the GPU for comparison requires time, but you can store the image in VRAM for later use, the more VRAM you have, the more images can be saved in the VRAM, thus reducing the time to compare it to something else. In other words, the bigger the VRAM cache, the bigger the data massive you can work.
In here you need the fastest storage as you can get, preferably NVMe. Why? Well, first you need it to transfer all of the data to the VRAM. But if you don’t have enough VRAM this process may need to be repeated more than once. Sometimes you also have some random reads from storage and you need them to be as fast as possible, so you don’t slow the rest of the process. Just imagine that your GPUs can process 500 MBs, but you still use HDDs that can sustain only 100MB a sec – you have wasted 4/5ths of your productivity.
The choice of motherboard will come down to how many PCIe lanes and what Generation it has and will be largely determined by the CPU you choose. One thing we can add to that is we offer our custom PCIe extension kits, so if your motherboard supports bifurcation, you can use one of them to split one PCIe x16 slot to 4×4 or 2×8 slots, which will allow you to add additional GPUs or Storage devices
Power Supply Unit (PSU)
The power supply depends on the parts you put in, but it is a good idea to have redundant power supplies, so you don’t lose any progress if something happens to one of them.
We could add specific parts in here, but the list will get old very fast, and it will be just the halo products of Nvidia, AMD, and Intel, because they always are the best, at least statistically, but that is not helpful for the people who are on the budget, instead we opted to explain to you what each and every part contribute to the process and let you decide what is the best for your budget. We can offer you an alternative to buying your own computer and that is renting one or more of our GPU Servers for Machine Learning – we offer a wide range of options for beginners and advanced users, plus you can rent them for a week and test if they work for you.