Financial Data Science Guest Lecture at University of Virginia

by Chris Conlan


Chris Conlan: Professor Martinet was one of my favorite teachers when I was in the statistics department. I tried to take all the classes I could with her. So, this has always been a dream of mine to come back and guest lecture at one of her classes. My name is Chris Conlan. I'm sitting here in Bethesda Maryland a few hours away from any of you that are still in Charlottesville, and I'm in my office here at Conlan Scientific, and we are a small shop of experts that do financial data science. I know you’ve already heard a little bit of this from Gretchen, but I was at U.V.A. in the statistics department. I’m a career small business owner, but in the day to day, I am a programmer. I still write a lot of code, mainly in python, sometimes in R; I’m very good with both languages. And I write a lot and I try to talk and lecture a lot and get out and educate people. I’ve done dozens and dozens of lectures speaking engagements about these sorts of things throughout my career just because I like interacting with you guys and engineers.

Here's sort of my top five academic disciplines I care about. If you think about what we do at my business as the intersection of some bundle of subjects, these are the five of them: Mathematical statistics I learned from Gretchen. Machine Learning, I learned from Jeff Holt (if any of you guys have taken classes with them). Then the other things are sort of tangential, but we'll get into more about what those mean. What we do here, the types of roles that we have at my company here, I'm sure you guys are interested in because you'll be applying to jobs, trying to get jobs like these in the future. I break it out into four distinct roles where data science is sort of general, people are responsible for getting the data, people are responsible for making machine learning models and then people are responsible for production. Optimizing the code and making it fast. Those are the other three. Oh, so where data science is sort of general, the boss is sort of a data scientist that plans all of the solutions, develops all of the programs and just plan for projects that we're going to do for our clients. And then the other three distinct roles are really more specialized. It might seem crazy at this point in your education that someone would need to be a data engineer, that someone would just be responsible for collecting and preparing data and that could be a full-time job. But it might seem more realistic that someone could be a machine learning engineer where all they do is develop models, tweak models and get them working for people. And then I really emphasize computer science. I think this is one of the things that differentiates us from other data scientists and other data science businesses out there is that we really care about computer science. I've written and talked a lot about computer science and it's all about just making the code fast. I cannot tell you how many tips and tricks and bits of knowledge I've picked up that can take certain programs that you run from hundreds of seconds down to one second or take them from hours down two minutes and we try to collect all of that knowledge and spread it across all of our projects because it can make the difference between something being feasible and infeasible or overpowered and not powerful enough. So, I'll keep emphasizing computer science throughout.

This is to give you an idea of the other companies that we compete with in sort of an industry that we rolled into. We've sort of been grouped under the artificial intelligence umbrella and we recently got ranked as the number one artificial intelligence developer in the Washington DC area, which we're really proud of. So, this is just a screenshot of the page where we were ranked number one on this list, but if you check out other companies in this list, you can get an idea of the other types of businesses that do this sort of work. And I think it's interesting to note that we got to the top of the list by doing only finance. We focus completely on finance, but we still got to the top of the list for artificial intelligence developers in general. To give you an idea of what that means, here are our clients. I put them into three groups and these groups are sort of monikers, they're not literal. It would be hedge funds, banks, and fintech and what that really means is not just hedge funds, anyone that invests money. Not just banks, anyone that lends money. And then Fintech is just for everything else, anyone that's using software in the finance world, you can lump them into the fintech category and that's generally where we do the rest of our work.

So here's a little more detail of that list. It's really investors lenders and financial services in general. Those are the three categories of clients we’ll typically have and it's really simple. The end goal of a lot of our work is really simple. Um an investment management firm will come to us and say help us pick stocks. A lender will come to us, that could be any type of lender, that could be a bank that could be lending loans for cars, banks lend a lot of different things for a lot of different products. It could be houses, it could be both, it could be anything and they say help us pick loans because lots of people apply for loans but not all of them get it. And then fintech is more just us applying our general experience and knowledge to other financial services problems that people have out there. So that's all the introduction. That's all the background about me.

This presentation is going to be all about what makes financial data different. Why do I think it deserves its own category, its own delineation or its own profession even. Why do the people that work for me carry the title Financial Data scientists instead of just data scientist. We're going to get into why that is, and keep this picture in your mind the whole time. This is a picture of the price of a basket of stocks. You buy 11 of each stock for a few different stocks. I think these are all video game stocks from the video game industry. This is what holding a certain basket of video game stocks looks like for the past 10 years, that would have been your portfolio value. And the only thing that's important to remember is we talk about why financial data is different, that this line is wiggly and it's a time series and we're going to figure out why this is, what we're trying to predict. Here are the main reasons we're not, I'm not going to dive into all those, all these reasons right now. These are sort of section headers, these are the six sections of the presentation and these are all the reasons why financial data is different. These reasons will build on each other and eventually, we'll get to an example before I go ahead. Low signal to noise ratio. This is something that makes financial data different. You guys might have fit a lot of models in your education so far in your projects, you might have fit things like the Iris Data Set or the Wine Data Set, the Golf Data Set, all the toy data sets that are from that big repository that we all use when we're in school. You can get really high accuracy on a lot of those datasets, especially the ones that represent naturalistic or physical relationships in financial data science, your models are going to have low accuracy. They're going to come back with really low r-squared values. If you're picking out loans for a lender, you might predict the repayment rate on those loans with an accuracy, or r-squared, of 11% and that actually could represent a massive financial gain for your lending client because before they had the 11%, they had zero. And maybe before that, their lenders were so out of sync on what they actually wanted to lend to people, they had negative r-squared because they were predicting so bad, worse than the average or worse than just using the mean about which loans to give out. So it's just a fact of it, you're going to have lower square if you're trying to pick stocks, you're going to decide stock goes up when the stock goes down over the next few weeks. It's very famously said that anyone that can call stocks 51% of the time will make a lot of money forever. Because all you need to do is have more wins than losses. It's not quite that simple, but in a lot of ways that saying is true because you will design strategies or training strategies that call stocks with an accuracy of 53% or 54% and sometimes those are incredibly profitable. So just accepting that there's so much noise and by noise, I mean, useless information, you will accept that your models are going to get low accuracy and you'll just have to creatively deal with that problem because at the end of the day, it's fun to make accurate models. But we really care about making money for our clients and you don't always need super accurate models to make money for your clients.

Here's the mathematical argument for what I just explained. This is the famous bias variance decomposition, except it's tweaked a little bit because the term on the right is different. The term on the right is the fitted function or the estimate of X or the model of X against Y. And when you do the bias variance decomposition this way, it tells you a very interesting fact about noise. And it gives you a method of thinking for modeling the bias variance decomposition in the presence of a lot of noise you get at the very bottom, the error of your estimate of X is the bias plus the variants, plus the irreducible error. So noise in this case is a synonym for irreducible error because what that is, is information in your dataset that means absolutely nothing. And you can never make it mean something because it presents no causal, naturalistic or meaningful relationship to why you can never make that information mean something. So you're going to have a high irreducible error. Which going back to the last slide: That's why you have a low r-squared. Because this irreducible error term is big.

Gretchen Martinet: So Chris there's a question in the chat: If it has no relation to the Y, why keep it in the model.

Chris Conlan: We will get to that and that's a really important question. The short answer is that no feature is going to be useful all of the time. So you have to keep thinking about that one feature as one column of your data set because that column is useful. Sometimes you have to keep it in there even for one. So we'll sort of build a mental model for thinking about where and how that useless data appears and it will make sense why it has to be in the model. So here's another reason that financial data is different. All of the useful features are derived features. Say you have a machine learning model and you have a bunch of financial data. So you're trying to call the stock market, you have information about companies, you have information about prices, you have information about anything you can get your hands on because you're going to use anything you can get your hands on that's structured well you can't just throw the revenue into the machine learning model. You can't just give the revenue, the net income, the assets and liabilities. You can't just give all those numbers to the machine learning model and let it work its magic. It doesn't work like that in financial data. All of the useful features are going to be derived which means all of them need to be transformed from their source number, from their raw number into something more useful. Because what are you going to do if you tell a model to just model the revenue? Is it going to say more revenue is more good? Is it going to pick out a specific revenue number and say that 10 million is the best revenue number? I mean you actually could get a model to do that if you fit it thoughtlessly and if you build your training data thoughtlessly. So we're going to talk about why the features have to be derived.

Here's a fun thought experiment, but it's something that I bring up all the time to people that think this problem is easy. Say you have a bunch of companies: A through G. They represent stocks, and they have these numbers for net income. Q1 net income is one thing, Q2 net income is another thing. You're going to try to come up with a mathematical formula that ranks companies A Through G From worst to best. So give each a score and tell me according to that score from your mathematical formula Which one has the best revenue growth. The naive answer. The people that I start this with divide two by one and that represents revenue growth. Well if you start to look through that table, your answers start to get a little ridiculous. It works for company A, but then it really doesn't work for company D. It really doesn't work for a company E. And then you'd have to ask questions like what's better company B or company E. Yeah, “it's hard” is the answer. I show people this problem, I asked for solutions, people toy around with the problem and then I say okay well it's actually hard. You need more than just two numbers. You need many numbers in a time series to get a good answer out of that. And that's why we have python, scripts and files in our code that have hundreds of lines in them and all they're doing is finding fancy ways to figure out the net income growth or figure out the revenue growth because you have to account for all of these edge cases, and if you don't, you fill your model with junk data and it doesn't figure out anything valuable. No feature is useful all the time.

So, an attendee asked the question: if X has no relation to Y, then why would you even keep that column in your data? And the reason is, it's useful some of the time, but it's not useful all the time. Anyone that's familiar with the stock market would agree with the statements on this slide here, you can't just invest in the company with the highest revenue because that tends to be a good thing, but it doesn't always mean that the stocks going to go up or that it's going to keep going up, it's going to go up other than other stocks. More than other stocks, you can use technical indicators. They work sometime. They definitely don't work all the time. Anyone that's tried to trade or day trade or follow technical indicators to-their-grave has had this experience where they don't work all the time. And a lot of people say they work all the time, but they don't. And I'm sure everyone is familiar with the complex macroeconomic scenario in the country and the world. You would think that when unemployment goes up, the stock market falls, that should tend to happen, because that unemployment going up should be indicative of something wrong with the economy that should translate into the stock market getting a little tighter, a little less risky, you're getting more conservative, but it does not happen all the time. Anyone that's lived through the past two years also knows that it gets complicated. So the reason that we don't remove that column (that feature) from the model is because fundamental information like revenue, technical information, like Bollinger Bands and macro information like unemployment, those are all useful pieces of information, but if they're useful at T1, they might not be useful at T2 and so on. They're going to fall in and out of usefulness over time as the economy moves.

Here's another reason financial data is different. All of the good data is expensive and there's two reasons why good data is expensive. One of them is that the data is public, but it is a mess. If you're into finance you've probably heard of the SEC EDGAR portal. Publicly listed companies are required to file their 10 Q’s in their 10 K's which are their quarterly in their annual financial statements, your balance sheets, your income statements, your earnings reports, they dump all these documents into SEC EDGAR as pdf files and the pdf files are filled with a lot of useless information and some extremely useful numbers with extremely great value. For example, the revenue. You'll go to Apple and you'll go to Apple's 10-Q, and you'll have this big 20 page 10-Q document and maybe you only care about five numbers in that document like the net income, the revenue, the cash flow, the assets, the liabilities. Someone has to do all that work to turn it into a workable data set to turn it into a data frame, for example, or a relational database. So this is an example of publicly available information that you're going to have to pay someone else to organize for you because you don't want to organize those millions of documents before you get started on your financial machine learning project. That's a whole business in itself. And once there's so much differentiation and how people do that, there's no right answer in terms of how to turn the SEC EDGAR into a relational database. You will have different approaches to survivorship bias, you’ll have different approaches to renamed and overlapping tickers. Sometimes tickers are used in the 90’s and then they're also used in the 2010’s and they're the same ticker. So you have to reconcile how that data is organized and how you notate the fact that this company existed before and this company existed now and this company might or might not be in business today. There's a lot of standardization and cleaning to do to get it there. And there are a lot of data providers that attempt to do this.

There’s dozens of data providers that I've seen do this and there's only one or two that I really like. Whoever you get it from has to agree with your philosophy and agree with your goals. And if you're that particular, like I'm particular, it's probably going to be expensive. So I've gone through data that's expensive because it's hard to organize because that's interesting; the whole data engineering problem of turning a bunch of PDFs into a relational database; that's interesting. But other data is expensive just because of monopoly pricing power. That's not as interesting. That's just expensive for the sake of being expensive because only one person has it and that one person wants to sell it for a lot of money. My main culprits there that I'm complaining about are the NYSE and the NASDAQ. Those data feeds are expensive and have very complicated pricing sheets and pricing schedules and it's really expensive to get real time data, but it's really inexpensive to get data that's lagged by 20 minutes. Which is why you see a lot of those things in the financial news data that's lagged by 15 or 20 minutes.

I want to emphasize that good data is expensive. If you go on the internet and search for a stock price API’s or foreign exchange data API’s, you'll find a lot of cheap options that are both cheap and bad. And it's important to understand the nuances of these concepts because when you're sourcing your data, you don't want to build your software around data that is not of high quality because then at the very end of the project or the very end of your development, when you realize the data was low quality, you'll say, “wow I sort of just bought the expensive one first. I should have just bought the good one first because while the data is expensive, you really can't get back your time.” So next reason that financial data is different (and this is one of the reasons that really, really deeply affects the way we practice financial data science and the tools that were able to use) is that the data is correlated across time. And I'll add that it's correlated across assets, asset classes and groups of assets.

Gretchen Martinet: Is data engineering something that tends to get automated, or does it still involve a lot of manual effort?

Chris Conlan: That's a great question on the topic of good data being expensive. You have to do a lot of research and sourcing the data. You have to figure out who your vendor is and who you like, who agrees with your philosophy. Once you do that, there's still a lot of data engineering to do because there's no way that that data vendor has the data in a format that you need or in the way that you need it; that they're going to access and organize it the same way you are. So while we do all this work to be really careful about purchasing and sourcing data and building our software around it, we still need full time data engineers to work with it because the data engineer has to bridge the gap between what the vendor provides and the problem that we're trying to solve.

Moving on to why data is correlated across time and how that affects the tools that we use and the methods that we use and what we can do in the financial data world: You are all familiar with market crashes. The three ones that stick out to me in terms of being confusing and difficult for Financial Data Science to deal with are these three. Some things will move the market, some events will move the market all in the same direction all at the same time. And we know that there are huge crashes with the financial crisis and the corona virus. There are huge recoveries that followed both of those. Then you have things like the Trump-China trade war where moving in the same direction means moving sideways all in unison. So it's not always as simple as talking about black swans and crashes. Sometimes you'll have elongated periods of time like 2018 where there's so much hesitancy in the market; where it goes sideways all year and you have no idea what to do with it. So what this does is it creates opportunities for data leakage. And that is because most models assume IID data. IID, if you're not familiar with this acronym, means “independent identically distributed.” And in the machine learning world, in a practical sense, it means that we assume that one row of our matrix of training data “X” is independent and identically distributed from another row or from every other row. And you absolutely cannot have that in finance. It's just impossible to consider that you have that. So we have to think about ways of mitigating it. And while we think about ways of mitigating it, we run into a lot of scenarios where we can't use the tools that we've been taught, or the tools that we’re used to using. The biggest one that I run into all the time that trips so many people up when they get into this is cross validation. Cross validation just breaks when you're dealing with financial data. It is the easiest way to produce huge quantities of data leakage and give you the impression that you have a really fancy, really good model; really good results, when in fact you do not. The reason that is, is because you have to think about what a row of your training data represents. I mean we're building a machine learning model. A model has to solve a well-designed scientific experiment that you posed to it.

So what does a row of your data represent? It's probably not just going to be the return of Apple tomorrow. It's too small. It could be the return of Apple over the next year, it could be a trade on Apple where you decide if something happens on Apple, “I'm going to jump in this trade. I'm going to stick with it for a number of days until a certain condition isn't met. And then if that condition is met, I'm going to jump out.” So that's a trade that happens with an unfixed time duration. You don't know when you start, you don't know when you're ending. It could be a basket of things. You could have a week of performance on a sector index. That would be a basket of many stocks. And then you can also make up stocks. If you think that there's some behavior, for example, in the chip market. And you think that there's an interesting relationship to model modeling the difference in price between NVIDIA and AMD (NVIDIA’s price divided by AMD’s price), you can actually come up with a good model around that because maybe it represents some underlying relationship in the chip manufacturing market that you want to exploit. So here's why cross validation breaks the machine learning models. They don't know anything about time. All they take in is like a soup of numbers. You have a huge matrix (big X). That is just a soup of numbers. And you have a vector (Y), which gives a label for every one of the rows to X. You’re going to leak data from your in-sample data into your out-of-sample data. If you train your model using basic cross validation techniques or the out of the box cross validation techniques; any out of the box model or off the shelf model that has a cross validation method built into it, it will typically random sample from the rows to build a bunch of in-sample and out-of-sample cross validation sets. That's really bad because when you do that you're probably sampling a little bit of data from each year of your data so that the machine learning model doesn't know about these years. But we know about them and we know that there's data from each year, 2017 through 2020, in your in-sample dataset. And then it's going to test on more data, 2017 through 2020, in your out-of-sample test. You can think about this as just one in-sample and one out-of-sample test. If on this chart on the left, all of your training data covers all the years in your test set, you will create data leakage and you will get a really good result--an unrealistically good result. Because the information is correlated across assets and across time, your model is not going to learn anything about the data in the test set. It’s going to learn about what happened in each year. It's going to learn what happened in 2017 and what happened in 2018 and what happened in 2019. And it's going to apply that to all of those correlated stocks within those same years and you're going to get a result that’s too good because in this case, we would just be teaching the model what happened over that span of time rather than what happened in the stocks in our test set, which is not good. It's not realistic. So something that's time aware, something that's different in a cross validation sense, you would do four tests, one for each year and you would hold out an entire year from your test set on each test. If you just think of one column of this table as the training and then the testing set, that's your in-sample in your out-of-sample. You would want to hold an entire year out-of-sample. Your model doesn't know about that. But you know that and you're going to keep that data outside of your training set in order to make sure that your results are valid. Then you would do that four times over and it would give you a much more accurate idea of what your actual model accuracy is. Think about it if you're going to train a model and use it in the real stock market, I don't know what 2022 is yet and I don't know what the market behaviors of 2022 are yet. So I'm going to prepare a model that's able to learn the general market dynamics regardless of some span of time that I’m in or some event like 2008, like the coronavirus crash, or like the 2018 Trump-China trade war. I'm not going to teach it about the Trump-China trade war. I'm going to teach it to be resilient to anything because that's all I can do. Yeah, it's not this simple so to speak, is just holding out data from different time periods. But this is how you have to start thinking about it, and this is the schematic for proving that off the shelf algorithms will fail you.

Essentially what I've been saying the training data is just a soup of numbers that doesn't know anything about dates or times, you know about the dates and times and you know that there's serial correlation between those periods of time and history and between assets within those periods of time. So you have to do the work to hold out that data. Here are some other things that cause data leakage. Some other things that we commonly use through machine learning packages or libraries that just failed because of the time correlated issue that I've shared. And I know I'm being big and scary. I'm telling you that like you know, throughout all of your knowledge, forget all the tools that you've learned. But that's the honest truth of it. And you will have to sort of re-learn the basics and learn how the engine of the car works in order to apply some of this knowledge to the financial world. Because up to this point, you know, you can get through an entire underground machine learning education and not deal with time series and that's all we're doing here. Here's the rule of thumb: you import the machine learning package; you pick your favorite for me. I use python and I could still learn a lot. You throw out everything except the fit function or the predict function. Al you can keep is the thing that maps X to Y. Using an algorithm and all of those quality of life tools, all those extra little features they throw in there for imputation and missing data and cross validation, throw all those out. And then you just have to use common sense. I'm fearmongering, like, I get it. I'm telling you just throw out everything that you know. I'm going to show you a way to think about that data that will make it seem much less scary and package the problem up into a much smaller problem.

I've been writing a book called the Financial Data Playbook that is attempting to explain a lot of these topics without code. So I've been thinking very mathematically about what these problems mean and how to condense them into something analytical, something I can put on paper to explain how they work. And this is a preview from that book from that argument in that book. Is there any familiarity with the “X” or problem here? Classic machine learning problems? Maybe you see it? Maybe you haven't. You have a matrix that features “X” and you have a picture of labels “Y”. You got to use “X” to predict “Y” in this entire data set. This is not a truncation of the data set. This is the whole thing, what this data set does. If you give this data set to a machine learning model, it will tell you if the machine learning model is capable of learning nonlinear relationships. Because this is the most non-linear thing ever. One of the zeroes is the one that you're going to want. In other words, if a row sums up to one, the Y is the one very nonlinear relationship. A linear regression or a multiple linear regression cannot learn this relationship. It would need nonlinear interaction terms in order to learn this relationship. And then the classic suite of other machine learning algorithms that all of the nonlinear stuff like neural nets, decision treaties, SVM’s, they could all learn this because they see it in a more spatial dimension and because they are nonlinear by definition. So, if you can solve this, you're a nonlinear machine learning algorithm. Here's a step towards what financial data looks like, which is essentially generalizing this problem. In the generalized version of the export problem, you can set some different parameters. You're going to have more columns. You have an arbitrary number of columns, and then if there's 10 in here, there's 10 in a row. Y equals 1. You can play with a bunch of other different ways. This is just a toy problem. But for circumstances for sake of this example, if you have 10 in a column, Y was 1. This can be solved by the same set of algorithms that can solve this problem all nonlinear problems. But the reason it becomes relevant to financial data because it starts to look a little like financial data. And then you really start to understand the number of parameters required for a model to solve this. The number of parameters required is related to the size of the data required is related to which models are actually feasible for solving the problem. So for example, a neural net, a neural net would need a lot of parameters; it's in the hundreds of thousands of parameters to solve something like this. But a decision tree would need a few 100 parameters to solve this. Generalized linear regression with interaction terms would need a dozens of parameters to solve this. And that's a lot, because in this problem it takes very few parameters or anything to solve because it's so small. But when you start to blow up the number of columns, the number of parameters required by the models that might increase exponentially, it might increase cubically, it might increase in order of 2^N, it might increase by an N choose K binomial pattern; all of this, I go into more detail in the book about this, but this becomes impractical for a lot of algorithms including neural. That's when the number of columns starts to exceed five. Here's the financial data one and this is taking it all the way back to the question from earlier. If X has no relation to Y, why do you keep that one column in X? And here's the answer. This is what financial data looks like. And in this problem it's the same exact problem except you're going to fill the white spaces with noise and when you fill the white spaces with noise, it represents useless data or a feature that is not useful right now. At the end of the day, all the features are useful, but a white space represents a feature that is not useful within that row. And the reason I formulated the problem this way is because when you have around 20 columns, it really looks like a financial data set, and when you have around 20 columns, it starts to exhibit a lot of the behaviors that we observe within financial data sets. It can help you figure out which models are feasible or infeasible for this type of data. One of the great reasons that I bring up this data set is because it helps us figure out which model is good. So immediately, linear regression terms with interactions don't work because they have an N choose K. relationship to this data, it becomes way too big. And then immediately neural nets stop working here too. There's a lot of detail about this in the book and I'm happy to talk about it with you guys. But immediately neural nets stop working on this type of data and it can be proven mathematically, just on paper, based on how neural networks. What does start to work and what becomes very helpful in this scenario are ensembles of decision trees: your XG boosts, your random forests, your data boosts; any ensemble of decision trees starts to solve this problem much more quickly with much fewer parameters. That reflects to the real world really well, because those are the types of models we tend to use in financial machine learning. Theoretically, all of the nonlinear algorithms that we can talk about are able to solve this problem, but practically they're not, because for 20 columns, for example, require a lot of parameters, but a neural network will require 10 million times more parameters than a random forest to solve this problem for 20 columns.

All right. I will get into some resources, things that you can look at if you're interested in this type of problem or if you're interested in getting into financial data science. This is my latest book on algorithmic trading. I recommend it a lot because it has a data source that I completely own that no one can mess with. It will always be there and you can create a financial machine learning pipeline. I also recommend this because there is an open source piece of code that goes with this book and whenever we get a new client, a lot of times we'll just take the open source code for this book, throw it in a new GitHub repository, and we'll start there. It's the boilerplate and a set of tools that you can use to start building out really big financial data projects. I have to bring this one up because I know you guys are using R in this class. This book's pretty outdated in the sense that the data source no longer works. So it's kind of difficult to replace the data source and there's not really any mention of machine learning in here because of how much, how much further in the past it was written. But I just wanted to share it with you guys. When I talk about computer science, I make everyone that works for me to read this. It is about writing fast python code. It essentially fills all the gaps in computer science knowledge that you might have as a data scientist so that you can make your code hundreds of times faster. And I know that sounds like a strong sell, but in computer science, that's what you're dealing with. Using the wrong data structure to write the wrong thing can result in something taking hundreds or thousands of times longer than necessary to compute. Really short book. Really valuable. Highly recommended. I'm about to publish the Financial Data Playbook which presents the arguments that you just saw, for example, about the noisy X problem and how to think about financial data in a machine learning context. Coming out soon. And then here are some books written by other people that I think are important. This is a very important book by a very esteemed professor and career quant. I don't recommend the code examples, but there are a lot of really important ideas in it. Despite it being a big and frustrating book, I still reference it a lot. Then this one's similar to the previous one, but it's more specific to stocks. 10 Cues in 10 Cases of Investigating Fundamentals, huge survey of academic literature in here. Not all of it's useful, so it's extremely pessimistic, but it is a good quality book and it's written in R, so might be more relevant to this class.