Yeah, so there are two pieces to this, two about the data. Usually, we have very little data. What comes out of our lab and comes out of the bioinformatics software is just a list of mutations. It says “Oh, for this person, she has a mutation on the chromosome tree, this specific mutation. She should have an A here; she has a G. That’s it. And we have a whole list for each person, saying every mutation that they have. But those don’t mean much, because in this list of mutations there is the color of your hair, the color of your eyes – all the kind of things that make you who you are, make you special. And in the middle of all of it there’s also things that cause diseases, so what we wanna do is we wanna separate these mutations from the ones that don’t cause any problem.
The way we do it is the first part, without any machine learning, is just we have a lot of databases from bioinformatics tools; most of those we didn’t build them, they are things that can be used by anyone working with bioinformatics. That tells us more information about each mutation, or about each position in the genome. They say, “Oh, this position is very conserved”, so pretty much all of the species we know that we have the sequence of their DNA, they have the same position, and we never see any different. So this becomes a number, which was created by a specific research project, and then this number is used by our software later to say if this is very relevant, to say if a mutation causes diseases or not.
So we add a lot of information to each one of these variants, so our lists become — each position in this list of mutation becomes a lot of data. We know which proteins were affected… Everything that can be calculated using knowledge from biology and from genetics, and this becomes a huge list. There are thousands of small points like this for each mutation. So that’s the point where we are before we start talking about machine learning.
After that, we have a huge matrix; you can think about this list, and a lot of these features become columns, so there’s a quite big matrix, and we built a machine learning model using an algorithm called random forest.
[00:08:02.24] It’s not very popular these days; you’re probably thinking a lot about deep learning and TensorFlow and these kinds of things, but three or four years ago we weren’t talking that much about it, and there are some studies that say that this algorithm (random forest) works well for genomics, for genetic data.
We started working on a library that is built in Go to create these kinds of models. The library is called Cloud Forest; it’s pretty much just an implementation of this algorithm, and we’ve started passing our data with all of those extra columns into this software for it to build a big model trying to predict new mutations, if they were causing diseases or not, if they were pathogenic or not. So we did a lot of rounds, like cleaning up our data, trying to understand how each feature of this software works, because I’m not a specialist in machine learning… I don’t know much about it, so I had to learn it while I was doing it.
In the end, we created several models, we started working with them, and one of those was better, and then we put it into the real software and we started passing new data into it and trying to predict whether new patients that were coming in we could find their mutation earlier, or at least filter out a lot of the things that don’t matter before we put it into a web page for our doctor’s work. Does it make better sense now?