Goodbye Power BI

by Chris Conlan

Transcription:

Chris Conlan: My name is Chris Conlan. I am the president of a data science company here in Bethesda called Conlan Scientific. I will get my bragging and self-promotion out of the way. We do financial data, and we were recently ranked the number one artificial intelligence company in the Washington DC area by Clutch, which we're pretty proud of. They put us on this fancy chart here and then they put us at the top of the list. So, we've been really proud of that. But that's not what I'm here to talk about today.

My presentation is called Goodbye Power BI because I recently had the opportunity to build a data visualization tool for the government agencies HHS and the CDC. It’s about the HIV epidemic. I know, it's not about the coronavirus, which is what seems to be on everyone's mind when they think of the CDC lately, but they're still handling the HIV epidemic, which is ongoing. We had an opportunity in a really good budget to build a data visualization dashboard for those agencies. And when I surveyed the landscape of data viz tools out there, I saw a lot of what I thought were sort of clunky dashboard projects on a lot of these federal and state agency sites. And I'm going to show you some of the worst examples of them now. I'm intent on making some enemies tonight because I will probably be criticizing Power BI and Tableau and ArcGIS a lot. It's not because of the products specifically or anything specifically wrong with the products. There's a common design paradigm within those products that, in the web-based versions and the web-based versions of those products where people host these dashboards, there's a lot of unnecessary lag, unnecessary data manipulation, unnecessary clunkiness that they really could just do away with. We built something from scratch using just vanilla JavaScript. Vanilla. HTML in vanilla SVG that did away with all that clunkiness and created an incredibly snappy appearance. I'm going to show you how I did that and hopefully inspire some people that are in the data of this world to maybe not use some of those off the shelf tools and use some of the more bedrock technologies of the web, like HTML and SVG and vanilla JavaScript to make really sharp, snappy experiences, especially when the data is relatively small.

We'll start looking at some of those sites and we'll start, we'll start by looking at some of the performance issues on those, those dashboards sites. So I have not discriminated here. I've just googled Maryland data dashboards and clicked through some of the first results because if I do that, I can find some of these slow and clunky dashboards. This is the Maryland Covid 19 data dashboard. I'm going to go ahead and open up chrome dev tools and I'm going to open up the network tab. I'm going to look at this little number on the bottom left, this little set of numbers. It says two requests. It says amount of data transferred and it will tell me how long it took to load when I loaded. But I'm going to go ahead and hard refresh this page. I'm going to load everything from scratch and we're going to look at how many requests made another http requests, and we’re going to look at how long it took to initialize the page. So, you see the spinning thing, you see the power BI spending thing, you see that network absolutely filling up. And we finished in 5.3 seconds and it's still making more requests and stuff, still populating on power BI calm down. Let me catch up, and we're still making requests. So it's now taken 20 full seconds to render the page fully so that someone can start interacting with it and it's made 305 HTTP requests. Now, that's a big clunky website, I don't have a huge problem with that because now that that 20 seconds is up, I can actually start exploring the site. Here's where it really starts to bother me. Every single thing I do in this dashboard generates more http requests. So, I'm just scrolling in and out of the map and it generates more requests. I'll click on anything, and it generates more requests. That’s why it feels clunky, because at the end of the day when I'm doing something that requires me to make an http request from my computer here in Bethesda to wherever this data center is in Northern Virginia, Frederick, or wherever, that takes a fixed amount of time. And there's no way to speed that up, no matter how fast your system is.

So I'll show you some more. Like I said, I don't know anything about these programs with these dashboards. I'm indiscriminately picking from the top of google and I'm showing you how every single one of them, even though they all operate on relatively small data, insists on making hundreds and hundreds of continuous requests and providing a really clunky end user experience. So I'm just clicking on different counties. It's making more, it's making about 12 requests every time I click on a county and it's taking about half a second to populate about four floating point numbers, which is bonkers. It's the only reasonable word to describe what's happening here. Another argument I have about this is that you've got to think about: where is the work happening? Where is the data processing happening that's allowing me to view this information? It's happening on the server side. So, every time I click some of this, click anything here, it's dispatching a request to the server. It's a query with a long query string, the query, the servers filtering through the data and it's sending me back some information. My problem with that is that JavaScript on a single CPU core using this data set, which is probably two MB, could do that faster than the HTTP request. Way faster; 100 times faster. So, you're starting to get the idea of how we designed our site, and you'll see how we designed our site.

I'll just, you know, scroll through some more atrocious examples to really hammer home my point. Other people suffer while waiting for these things to load all the time. So, I figure we should at least suffer a little bit to experience their pain. Wow. This one's taking a really long time. Still going… Great, 16 seconds. 270 requests. We stopped. You guys know what tool tips are? Right. I mean you hover over a data and you see the data point. Why on earth is this thing making a request to show me a tool tip? Like the data is already there. I click anywhere on this line and it makes four requests. Computers are fast. Networks are still slow. They'll always be slow because they have a lot of physical space to traverse. This is our [widget], I’ll give you some context. This is this new initiative called The Ahead Initiative. It's America's HIV epidemic analysis dashboard. I didn't build this website. I just built the little widget. Me and my colleague built this widget, just everything above this green blob and below this blue blob we built and we very carefully engineered it to be one of those interactive, have-it-your-way, government data analysis, high transparency dashboards. Except we very carefully engineered it to be really fast, and it follows all the principles that I just complained about for the past five minutes.

We'll open the network tab and I'll show you the proof, I will load this from scratch. 131 requests. That's a lot. At least, it only took 1.9 seconds. But that's not even my fault. I'm not the web developer. I'm just the chart developer. But here's the kicker. When I click around anywhere there is no lag. I'm going to select a bunch of counties. I'm going to select a data stream and there will be no lag and there will be no web requests. You see how that request counter is not going up. You see how that 4.2 MB is not going up. No matter what I click. I'll select a bunch of completely different information. I'll select different indicators and I know you guys might be interested in what story this data is telling. There’s a lot of interesting stuff to dig into here, digging to hear about HIV and about the initiative to end it. That's the real point but I'm going to be a nerd over here and just freak out over the architecture that allows it to be fast. I focus on engineering it to make it fast and snappy and then epidemiologists and public health officials can use it to solve the problem. But that's what we’re bragging about today so you'll see that request counter never went up. I switch between a ton of different data streams. It happens in a snap, there's no waiting. The key to doing that is that there is a two MB payload of json data that is loaded into this browser. When you open it up and you load more than you need, we loaded every inch of its government data, it's public anyway so we load every inch of that data. When we load that data, if you want to filter it or query it or dissect it or view it in a different format, view it as a chart, view it as a table, downloaded as a PNG, etc. We can do everything. No matter how you want to slice that we're not going to make additional requests on you because the data is already there, so we don't need to ask for it from the server. Every time we're not making the server do any work, we're just doing it in your web browser.

Theoretically, we're being a little mean because we're offloading the work to your web browser rather than our server, but your web browser can handle it because there's one CPU core running it. 2.53 GHz can easily filter through this two MB data set in a snap and a click. That is the meat of the presentation, but I would like to also show you what that looks like, what that data set looks like. This is all public data. Here’s the json, one of the json datasets, one of three that's generated for the charts. It's a bunch of time series. It might seem big. It's about 20,000 rows if you open it and it would be kind of big because it will be 20,000 rows. But ultimately this payload is just 50 kB, sorry 500 kilobytes, half a megabyte. Also, I can show some of the CS and DS if I can guess the names of them; working within zoom is a little difficult. It just keeps downloading it on me. Let's try statedata.csv… Well… chrome is not letting me display it in my zoom window, but trust me it's a 500 kilobytes CSV file and it can be mangled and translated into this so you might ask yourself, why doesn't everyone do this? Is it easy? Why aren't all these clunky dashboards just doing the same thing and you've all worked with power via tableau to some degree these power BI Tableau platforms, they are drag and drop to an extent. There is no code to an extent so they can be deployed that way. The other reason is the data wrangling and the mangling.

We do limit you somewhat in how you query and filter this data because of the structure of the json that we compute ahead of time. In other words, we can send you the data payload, but what we can't do is send you a relational database as a payload. I don't know if technology will ever be there, but we can't send you a relational database to your browser in order to query it on the fly and do that manipulation. So, we do somewhat limit you and what you're allowed to query. I've very thoughtfully laid that out based on the way that json is structured, it's structured so that the json object, when it's in memory, is ultimately a hash table. So, there's json data structured in a very specific way to allow you to query specific things that we thought would be useful to people. So, you have six data streams which are these indicators, you have them across time. There's really only six distinct time units that are here that you can use and then you have them filtered across things like demographics and things like states and things like counties. One drawback is that if this data gets too big it might not be snappy. If you're trying to get out every single number above 50, that might not be very snappy because that requires us to iterate over every single number in this 20,000-line file in order to pull those numbers out for you. That might not be fast. It still might be faster than http requests but it wouldn't scale 2 to 10 megabytes. My core argument here is that this is the way to do things if the total size of all the data that you're going to send to the client's computer is less than 10 MB because that's what browsers can handle and that's what a single CPU core can effectively filtered through. If you thoughtfully structure the data that you send to the end user. I will turn it around to the audience for questions.

Q: So one question was, do you deal with live data that's updated continuously or does your platform deal with completely static data?

A: It's completely static in the sense that we received quarterly updates from the CDC and US as the developers, we are solely responsible for integrating that new data and preparing essentially this, this new json payload. So, it doesn't happen at the speed of real time. It happens at the speed of agile and the speed of DevOps because we essentially push this json payload up to get repository and that's what causes it to be updated in the site.

Q: How can you convince somebody to use your platform or care about the difference between your platform and other platforms that are drag and dropped or are heavier on the back end?

A: I admitted early on in my talk that this was a well-funded project and that we had the appropriate time available to do this. I'd like to believe that my company has the specific experience necessary to accomplish this quicker than other people. So hopefully we can convince people that it's a good idea. But I would like to see this design pattern done more. So, I would encourage people to just acquaint themselves with the bedrock technologies of the web, like SVG, because it can accomplish the needs of the vast majority of data visualization projects.

Goodbye Power BI

About