DeepSeek's Cost-Efficient Model Training ($5M vs hundreds of millions for competitors)
Download MP3DeepSeek: Data Hurdles
===
Chris Detzel: [00:00:00] Hello, Data Enthusiasts. This is Chris Detzel and I'm Michael Burke. Welcome to Data Hurdles. ~We are your gateway into the intricate world of data, or AI, machine learning, big data, and social justice intersect. Expect thought provoking discussions, captivating stories, and insights from experts all across the industries as we explore the unexpected ways data impacts our lives.~
~So ~get ready to be informed. Inspired and excited about the future of data. Let's conquer these data hurdles together. ~ ~
Welcome to another data hurdles. I'm Chris Detzel and
Mike: Michael Burke. How you doing, Chris? Pretty good, man. How about you? I'm doing really well today. I didn't get a lot of sleep.
My, my daughter had a cold last day, so it's been a little rough, but doing well, excited about a lot of the things I've been learning at Databricks. Diving into a bunch of training and just my head's all over the place lately with, uh, less people and more textbook.
Chris Detzel: Get to learn somehow.
And I think you were telling me Databricks does a really good job of the learning experience. So you do a lot of learning for the first few months and then you get to dive in to talk to customers at some point.
Mike: There is essentially, if you like learning, there is like unlimited learning in their academy, I've probably done a couple hundred hours at this point, [00:01:00] and I would say that in total there's thousands of hours available to learn outside of what you do in your day to day. It's wild.
Chris Detzel: I'm like a, hey, let's just go and get it done kind of person, and then I learn as I go.
Mike: You have to be, and there's just so much that you'll never cover everything, right?
But,
Chris Detzel: Hey today we wanted to talk a little bit about some of these these models were specifically around DeepSeek. What do you think?
Mike: Yeah, let's do it. DeepSeek has been in the news a lot lately. I have tried it out a few times and had mixed reviews. When they first came out, you could query it and ask it, who made this or what was this model trained on and things like open AI would be spitting out all the time.
, in the latest news, like we're hearing a lot more about things like cost performance and optimizations that are really interesting and I think we should definitely talk about that.
Chris Detzel: I think we should also, I agree and we will do that here shortly, but I also think we need to think about security.
When we talk about Chinese companies coming into the U. S., things like TikTok, and we won't [00:02:00] necessarily talk about TikTok, but, there's all these regulations or things going on with the government wanting to shut down, and I'm like, oh, who knows? But, what's the difference, this is another Chinese company coming in, taking all of it or getting all the budget data from, we can talk about all of that and just get our takes on that.
That's fair.
Mike: Yeah, absolutely. Look, it is something that we all need to be concerned about, not just on American versus Chinese models. I think that's in particular something that should be concerning about where your data is going and data residency and what your data is being used for.
More particularly what are you putting into a large language model? We talk about this all the time. That's a, you're on chat GPT and Claude all the time and you're giving it so much information about your life, probably about certain aspects of your job, definitely about your podcasting.
So it's starting to know and learn more about you than you're probably aware of. And I think this is something that most people. Don't think about when they give an application information [00:03:00] and in the way that chat GPT and these other technologies, these large language models can respond back to you.
You tend to provide much more granular information than you ever would have search engine. And that allows
Chris Detzel: true because it talks to us
Mike: exactly. So that allows a certain degree of context that a model can obtain about what you're trying to do. Or what you need. That's great for answering questions, but it, that could also be used for a lot of other things if it's stored and sold and distributed it the wrong way.
Chris Detzel: Just being sold and stored and distributed period. Exactly. Wrong way or right way. It's still taking our stuff, our data. So in particular, deep sea what are you seeing and why is this such a big deal?
Mike: Of Yeah. So DeepSeq was really interesting. A couple of weeks ago, they made some really amazing claims on primarily, that the models that they had built were trained with significantly limited hardware compared to your standard chat GPT or your CLOD.
And then also that [00:04:00] those models were much more efficient. As far as how much money they cost to train and also how many resources, including like electricity that was needed to train these models. Now I've gone through some of the papers in detail, but I'm certainly not an expert here. The big thing to really understand is that DeepSea claimed that their latest model was trained and it cost about 5 million, right?
And when you think about ChachiPT, the cost of those is just blowing up hundreds of millions, even billions of dollars are being spent. This is an area where any kind of efficiency like that is going to be a huge disruptor in not only how much can they charge for a model, but how quickly they can iterate on it.
So when you think about NVIDIA, whose stock has skyrocketed because everybody has been training on these GPUs, they're doing some sort of consumption forecasting model saying everybody is continuing to use large language models. And so therefore we are going to be able to produce this demand [00:05:00] of GPUs that's going to be consumed at a certain rate.
Now, if a company like DeepSeek can come in and train a model on a fraction of the consumption, their business model falls apart, right? It is all contingent on large language models, right?
Chris Detzel: That's why that stock went way down whenever that was announced, probably back up now, it just ain't,
Mike: NVIDIA's to yeah, absolutely.
And I think the other really interesting thing to talk about with DeepSeek. Is that there are these new kinds of strategies coming out on how we deploy large language models and how they operate and make decisions. And so when you think about a chat GPT, we call it like a model, right? As a singular thing, but really what's happening under the scene now is you don't just have one large model because that.
Really would be only okay at doing most things, right? Versus this new kind of framework that has been rolled out and most large companies are using it now is [00:06:00] called mixture of experts. And what that means is that there are many models being trained in specific areas and sub areas of information. And they are actually experts that can be used to communicate and respond to a question better than a generalized model ever could.
And that's really, if you think about how, yeah, so DeepSeek has been doing that. And one of the big differentiators that they released that was also in the news, but not as big as the drop with NVIDIA and the price to train it was that their entire mixture of experts or most of it was completely open sourced and no other company has done that to date.
And what that enables is a entire group of people that didn't have access to that kind of technology to understand how it works and if the model really can be trained for 5 million, that unlocks a whole bunch of new use cases for places like [00:07:00] universities or large enterprises to start investing in these models, to build specific use cases for their own business needs.
So let me
Chris Detzel: get it right. So instead of having one large language model that. It takes all that data and you have these sub expert likes language models that are just smaller that are focused in on certain areas. How do they funnel all the, is it like, Hey, they'd get this big model that's taking all the data and then funneling it into these different expert language models.
How does that work?
Mike: Yeah, it's a great question. And I'm totally oversimplifying a lot of this, but what I just made that up, so no, it's great. It's great. What really happens is. And again, this is also an oversimplification, but there are other models that are used to learn how to properly route requests to the right expert.
And so one of the ways that we do that is through like a common practice is through reinforcement learning, where you're [00:08:00] essentially training something over feed feedback that's provided to it of whether it got the question right or wrong, if it's routing that question or that request or that set of tokens to the right model to give a response.
Reinforcement learning, we've talked about it a little bit in past episodes, but is an incredibly powerful tool. And when you start mixing reinforcement learning with large language models and other kinds of classification models, you start to build this capability or this mixture of experts that is just unmatched, not only in terms of accuracy and performance, but also cost.
A mixture of experts where you have smaller models that need to be traversed less than a huge gigantic model reduces. The overall cost of your large language platform. And, a lot of people don't know this, but when you think about the total cost of running a company like ChatGPT, the training is exceptionally high, right?
It's hundreds of millions of dollars, but that's actually not their biggest cost. Their biggest cost is inference. Which [00:09:00] is serving the model to you. So when you use the model, they're spending significantly more than they are even to train the model to run it for you. And again, they're making money back on that cost.
But there is a huge amount of cost associated with inference. And most of these companies, if they can come up with solutions like mixture of experts that can reduce that cost by creating less overhead for you to have to search through a massive model, you just have to search through this little one that you get routed to.
That improves their margins. And so these kinds of strategies are going to make large language models and companies like ChatGPT much more profitable over time as these efficiencies get better and better.
Chris Detzel: What's the drawbacks of having a case of export like models? Is it actually not a drawback in comparison to having one big giant model?
Mike: If you think of a giant model when you put in a, and I'll see if we can do this through a podcast, this might be something where we need a visual. Eventually there's a lot of great material. That explains this, but when you put in a question into a chat GPT, it essentially [00:10:00] has to loop through this whole neural network many times to get you a response and even to answer your question.
So it's linking these word after word, and each time it goes through a token, it's traversing this whole neural network. So the bigger that neural network is, the more complicated and costly it is to find the information you need and probably the less accurate it is. And if you think about like how our even our brains work, we have, in the way that we file information or compartmentalize it, like we have notes, specific notes for a specific subject, right?
We wouldn't go through every book in our library looking for information on a specific topic. We would have some sort of indexing system like the Dewey Decimal System. If you use the library back in the day before all the search was available. Where you had to go look something up and then within there would be a section of books that related to the topic that you needed to do some research on or answer a question on.
The same way with kind of mixture of experts, that's how it [00:11:00] works is that the model has enough of an understanding of the question you're asking or the subject of the question you're asking to route you to the right section of knowledge that will answer that question. And that's a lot more cost efficient.
Chris Detzel: So there's just more benefits to doing the expert way now. So you're saying that. This is what I'm getting out of this conversation is that DeepSea was doing the expert stuff and saving a ton of money and the other models like Anthropics Cloud or OpenAI, ChatGPT was not. They're just using this big model, going through this loop of things that you just mentioned.
And so what they're probably now doing is like, how we could, how do we build these experts and models and then route it through that way? That's probably what they're running around doing now. Don't you think? Or, yeah,
Mike: no, absolutely. And, you brought up an interesting point of one of the other ways that ChatGPT is saving ton of money and resources.
Is by how they're [00:12:00] training these models. So you mentioned this earlier about Oh, are they using chat GPT? And, I
Chris Detzel: didn't say it here. I said it on offline, but
Mike: Yeah, exactly. But and early on when you had DeepSeek, you could actually ask, where did you train this information?
And it would spit back chat GPT, which is wild to me. But one of the really interesting things in this process is another way that companies can save money when they're building models. And that's by taking one of these giant model platforms like chat GPT and actually using the information to train a smaller model by using questions and answer pairs that you derived from that model to train a smaller model.
This process is called distillation. And you're essentially like taking a huge metaverse of information and distilling the question and answers. You need to train a smaller, more specific expert model. And so it's really interesting that up until recently, we weren't sure if this was going to work, examples are coming out that are proving [00:13:00] that you can obtain the same kind of performance by going through distillation that you could by training a model off raw data.
You can do it in significantly less cost. Why use hundreds of billions of parameters when you only need to use 8 billion and create a condensed model that could be run locally on your computer? It doesn't need to be running on a mega GPU farm from NVIDIA. So I think that this is really where, if we think out for what is this going to look like in 5 to 10 years, I think we're going to see models get.
More specific, more condensed, and be able to run on smaller and smaller devices. Right now, a lot of stuff is still reaching back out to the clouds of these giant GPU farms. Like everything, as we get more optimizations and we come up with better strategies, I think you're going to see a lot more of that switch to being on device.
Chris Detzel: I feel like, we were talking about this a year ago, plus. Yes. Like, when we started first talking about chat GPT and things. You talked about having some of the, [00:14:00] smaller model specifically on your computer that you could already use and was already doing some cool things when it sounds like still, that's the way they're going to go, but they just haven't figured out completely how to do that.
Now there's been some breakthroughs through like deep seek, it sounds and they're looking at the industry and changing it in a way and how they're giving you that information through these. People I'm experts are kind of models or whatever. So that is interesting.
And it's more interesting to me that, the chat to your T or the open AI's and Google's or whatever, spending so much money on the NVIDIA chips that deep sink, since they don't have access, we don't give them access to those chips. They just figure out a way to to do it and scale it, on a lot cheaper way, which was crazy.
What kind of chips do they use? Do we know this or,
Mike: so they're not using the, what is the latest chip called? I'm trying to remember here, but the flagship GPU from NVIDIA, and that was the big thing is that we [00:15:00] have been optimizing and optimizing on these massively, like one of these GPUs down just one, I think is going for 30 to 50, 000 right now.
It's like crazy. And yeah, your average GPU a few years ago was like 3, 000. The demand has skyrocketed, the supply has been shortened, and people are paying outrageous premiums for this hardware to continue to serve models to the public and grow their businesses. So anything that can reduce that is going to significantly disrupt the market.
Chris Detzel: So a few years ago, it was only three, three grand. Now it's 50 grand for one. That's for the same thing. Sounds ridiculous.
Mike: It's different. It's significantly better, but the price increase is generated not necessarily by the technology advancements, but by the demand of the GPU.
Chris Detzel: That's crazy. Demand sucks sometimes. One other
Mike: thing that is really cool I wanted to bring up about the newest DeepSeq models is chain of thought. And I'm not sure if we've talked about this before, what is chain of thought, but if you've noticed, [00:16:00] even on chat GPT, when you ask a complicated question now.
It breaks it down into steps. And so you'll say, it says if you ask it for to take a list of people that you know from X, Y, and Z and help me write a brief bio on each and then maybe do something else that information, put it in the table, it's going to start thinking.
And then it's going to write out a series of steps, and then it's going to execute on those steps. That's what chain of thought is. And up until really recently, before actually we jumped into chain of thought the reason that we use chain of thought is that you tend to get much better answers on complicated questions.
And you've been using ChatGPT since the early days, if you ask it a complicated question, it a lot of times would fail to answer it. And this was particularly true when you asked it math questions, right? Or things like that, or even that one that I posted online about my anti spam right question with apples.
Now chat GPT can answer that question. It still won't say it's a human, so it still works. This chain of thought [00:17:00] is something that DeepSeek is also completely open sourced in their models. And nobody else has done this as well. So the amount of. Open sourced material that they've provided to the industry is on par with what Lama has done previously from Facebook, it's just these advancements are only going to get better if we continue to open source and a lot of the big players, the chat GPT's of the world, they're not open sourcing that information and that's stunting the growth.
It's stunting universities from being able to do things or individuals or businesses. And I think there's pros and cons to all of that, but to me, educating the public on some of these more complicated processes and universities and researchers, that's what advances the entire industry and moves everything forward.
Chris Detzel: I'm starting to see, so get off a little bit, but I'm starting to see businesses like products, companies, really, truly embed AI type of stuff into their products. So actually being an AI usable and, like for example, [00:18:00] I might've said this before, but. Like the company I work for today on zoom info, when you think of like a sales, an AE that is selling to a B2B company that is looking for not just names and contacts, but looking at what's going on in the within this company, what is there, who's doing podcasts, who's looking at my website, like it puts it all in one place, all the research that you used to do.
It's just in one place that tells you all about a company or all about a person instead of you going online and trying to find it on spending two days of finding the, and now it will also write an email to them specifically around their earnings or their initiatives and things like that. Like it's doing these things already.
I think, I'm not sure what. But they're using like a open AI or a cloud or even deep sea, they're using these companies to run the back end of the AI stuff. It's pretty cool. Like it's changing the way everybody is doing business. [00:19:00] It's insane.
Mike: No, it is, and it's getting easier and easier.
Even. I just went through a training on how Databricks does this, right? And it is amazing. Like previously to do this, you needed to know a lot about engineering. You needed to know a lot about ops. You need to understand how to deploy these things and what kind of databases, your vector databases that you needed, what would scale now?
Like platforms like Databricks and there's others out there too, but I think they do a really good job at this. They, you can create one of these large language models, fine tune to your needs. In your own kind of localized rag in a matter of an hour. And when could we ever do that? And sure, there's a lot of pieces to any model to get performance out of it that requires some skills still.
I don't think that we're fully out of the weeds where you could just click and go, but it is. Doable. And it's something that can really bring data to becoming a product in your technology stack. And that's something that we were starting to see more and more of, but the integrations are becoming much deeper.
Like [00:20:00] in one thing that I thought was really interesting that you said is not only consume info and I watched your video on this, I thought it was great, provide contact information. But it has relative context of the business to write something that is tailored, maybe not even just the business, maybe it's your communications over email as well, but it can actually form something that is meaningful to the context of your business relationship.
And I think that's a real differentiator. That kind of technology is what's going to differentiate us across the board.
Chris Detzel: No longer it's time that you have to be super technical, right? Being technical helps. But if I just know a little bit, how to ask the model to code and what kind of code and, like I'll give you an example and I've not used this exact example, but something similar.
I was trying to build, and I'm not done with this yet, but I was trying to build a way of users within the Dallas Fort Worth area to find races, 5Ks, 10Ks, half marathons because that's a very on [00:21:00] my website. And so I built the whole infrastructure, right? And then it was like you need an API from this company right here.
So it told me that I needed an API to get the data. And so all I have to do is go to that company to get access to the API. And they will, that will vent fee. And it said, and it told me how to install this API into a website and onto a GoDaddy website, right? This is what cloud 3. 5 is told.
And then once, and they'll build the infrastructure, the HTML, the.
The JavaScript and the CSS of what that page could look like, and it could connect the API directly to it. If I just had the API information from the company that they would put into GoDaddy, and then what it will do is it will ask for that information and they'll give it that information. And then, voila, it's going to be there.
I can do that. I just have to be somewhat smart about it and maybe do some, it might take me a few hours to figure it out rather than like months, [00:22:00] like I can do that through cloud or whatever system you want to use, doesn't matter.
Mike: Yeah. And it's amazing. There's so many tasks like that, that the average.
Individual who doesn't, who has some technical skill, but isn't a programmer engineer, they can get so much further now and even engineers, even, people like myself who are doing programming,
Chris Detzel: if I knew what you knew,
Mike: you could learn dangerous. I'd be so dangerous. You could learn so quickly and new things.
I can't tell you how many hours I've spent on sites like Reddit and Stack Overflow.
Chris Detzel: Yeah.
Mike: Combing threads to try to find an answer to a one question that would get me one inch further to solving my goals. And now I can move at probably 10x the speed on learning something new and sure it doesn't get things right every time and there's still a lot of problems and the more niche you go, the more less well crafted the answers are, I would say,
Chris Detzel: but directionally
Mike: it gets you there, [00:23:00] it gets you there quick,
Chris Detzel: it's going to be
Mike: wild.
Chris Detzel: Yeah, I was just saying, sometimes you run out of, or I run out of questions because I'm, what it does is it says, Oh, and you've reached your limit. Wait until one o'clock to, to do a scan. I'm like, you
Mike: and you pay for this product. That's how, you've been using it. So 20
Chris Detzel: bucks. So one
Mike: thing, one thing I would recommend for everyone to do and also you that's all is when you go into chat, GPT next or cloud, there's a couple of prompts you can look up online, but I know one of them just off the top of my head that is really fun to do is ask it.
What do you know about me? Yeah. And watch the report it writes, and you can ask it specific questions about the things it knows about you, and it will go deeper. It is wild when you start seeing that, how much that model has profiled you and how much it understands about you. So I think everyone on the call who's listening should definitely go through this, try it out and just just see for yourself how much information these models are collecting.
[00:24:00] And then it's pretty easy to envision what could be done with that kind of information over time.
Chris Detzel: Mike, this has been good, man. I learned a lot, so I appreciate your wisdom and your expertise and kind of this area. So any other kind of party thoughts?
Mike: No, I just say, if anyone else has anything else they want to bring to the table on large language models, we're all trying to keep up right now.
Feel free to reach out to us. We'd love to have you on the show and we'd love to do some more of these kind of deeper dives on the technical aspects of large language models.
Chris Detzel: All right. Thanks everyone for tuning in to another data hurdles. I'm Chris Dutzel and please rate and review us. We need all the ratings and reviews we can get.
And
Mike: I'm Michael Burke. Thanks for tuning in.
Chris Detzel: Thanks everyone.
