Analyzing Analytical Databases

Jess Archer

Jess demonstrates the problems you can run into when you have a lot of data in an OLTP database like MySQL and how OLAP databases like ClickHouse can save the day.

Transcript

00:00:00 I'm Jess Archer and today I'm here to talk to you about analytical databases these are a bit different from the type of databases we would normally use when building a web application I'm going to demonstrate the types of query performance problems that they're designed to solve a little bit about how they work how they're different to uh databases like MySQL uh my goal is to show you what is possible so that you'll know what to look for if you come across the types of problems that they're designed to solve

00:00:30 it's a pretty Niche topic so half the battle is just knowing what even exists out there and what magical Search terms you've got to enter to find them now I'm going to be running database queries live on my laptop today uh I've got two different database servers my SQL and an analytical one we'll talk about later I'm going to have a bunch of SQL queries but don't worry about the the syntax specifics just come along for the ride so first let's talk about the data set we're going to be using stack

00:01:01 Overflow uh publishes their data export of all of their posts online free for anyone to download so uh I've downloaded that the um the data uh up to the end of March this year it's 21.4 gig compressed using szip and so as I said I've loaded that into two different databases both running on my laptops so same hardware and we're going to use that to test our queries so first let's talk about MySQL and uh we'll just explore the schema a little bit so that we know what we're working with so we're going to run uh

00:01:34 show create table posts and we get the results now there's a lot of columns here this data set's got a lot of different columns so we're just going to focus on a couple of them so we're going to look at the uh the ID column so the primary key uh we're going to look at the post type ID so in this uh data export they've got all the questions all the answers Wiki posts everything all in this one export I don't know if this is how they're running it in production or if this is just how they have it in the

00:02:00 export for uh a quick note here I wouldn't normally advocate for an enum um they're a little bit annoying to change uh you've got to run a migration if you want to like change the values in them but if you've got a really really big database then this can actually be a performance Improvement because you're not storing these big long strings over and over and over again uh we've got the creation date so the date that the post was created the question or the answer so on we've got the view count so they've uh generously

00:02:26 given us the count of views each post has had accurate up until when the data was exported we've got the full body of the post uh so a big medium text column we've also got the title and then we've got a couple of indexes down the bottom here that I've added to try and give my SQL um the best helping hand possible so we've got a primary key on our ID column we've got a ID on our post type ID so that's defining what type of uh whether it's a question and answer and so on we've got an index on The View count an

00:02:56 index on the creation date and a compound index on the creation date and the view camp and the engine here is INB that's kind of the default one so first off let's try and find out how many rows we've got in this table I'm going to get this query running now you'll see I've got a timeout of 10 seconds uh this query takes anywhere between 2 and a half minutes to 4 minutes to run and we don't have all day so I've set a Timeout on that one uh and we'll find that we cannot find this out very quickly so how else can we find out how

00:03:26 many rows are in this export we can use the uh MySQL information schema table so if we run this query we'll see that there's 66 million rows we've got 134 GB of data and the indexes I've created are about 6 GB this row count though is completely wrong it's just an estimation you can ask MySQL to recompute it but for this data set it just flip-flo between two values that were both completely wrong so I've also got this data set loaded up into another database that we'll look at more later and I'm going to run that

00:04:00 same select count star and in 2.1 milliseconds we can see that there's actually 59 uh million rows in here so nearly 60 million rows so first I want to uh introduce you to a term that is pretty new to me it's oltp and it stands for online transaction processing now the key word here is transaction now although it's a new term to me it describes something that I think most of us are familiar with which is databases like MySQL postgress sqlite Etc these databases are pretty general purpose they're the ones

00:04:37 we kind of reach for by default but they've been optimized for short Atomic changes of state so the kind of create read update delete sort of workflows so the word transaction in oltp doesn't refer to like begin transaction end transaction in SQL queries it's more like a a bank transaction or something where you're just operating on generally one row at a time so let's run some o TP or transactional style queries these are the queries MySQL is really really good at so first up let's just have a look at

00:05:05 the very first post in this data set so we're just going to select a couple of columns we're going to order by our primary key and limit one and if we run this query in 1.2 milliseconds with 60 million rows uh we find out that the first post is how to convert decimal to double in C now the ID is four so presumably there were some maybe test posts beforehand I don't know a whole lot about the history of this data set but we can make some assump uh and we can see the view count and the creation date was uh March 2008 so

00:05:35 that's the very beginning of this data set uh I can probably run this query again and it might even get under a millisecond if we're lucky not today but still very very fast again 60 million rows now let's have a look at the top five viewed posts of all time on stack Overflow so we're going to select the same columns but this time we're going to order by The View count and we had an index on that view count uh so we're going to order by that descending and let's just grab the top five so 1.1 milliseconds and we can see

00:06:05 that the most popular most viewed post on all of stack Overflow is how do I undo the most recent local commits in git I think this is one we have probably all been through so this view count of uh 13 million nearly 14 million I imagine nearly every one of us in this audience has contributed to that number on stage we've also got uh how can I remove a specific item from an array in JavaScript I've been there how do I delete a g branch locally and remotely okay yep uh find all files containing specific text on Linux and how do I

00:06:35 revert a git repository to a previous commit so a lot of git questions gets pretty hard especially when you make a mistake these are all kind of how do I undo something that I broke um but the main thing here is we got this result back very very quick with mySQL how about looking up by primary key this is again another thing MySQL really excels out so we're going to select all of the columns for that most popular post and again we can run this one really quickly I'll try and run it a few more times this one can run in under

00:07:04 a millisecond as well but not today uh how about we try and figure out the extent of this data set so let's find the minimum and the maximum creation date so we're doing some kind of aggregation style queries here uh but we have an index on that column so we can run this one super super fast so this one did Run in under a millisecond and we can see that our date range for our posts is from uh the 31st of July 2008 up until the 31st of March this year how about the highest ID doing a Max on the ID again my SQL is really

00:07:37 really fast at this because we have indexes we can see our Max ID is uh over 78 million so my assumption is that there's a lot of posts that have been deleted because we only have about 60 million rows in this table how about uh doing an average on The View counts we have an index on that one I'm going to get this one running this one can take 4 and a half seconds to like 10 seconds depending on if my laptop's plugged in to power uh so 5.9 seconds to calculate the average view count even when we have an index now

00:08:07 depending on why you need this value maybe this is fine if you have your boss asking you for this value that amount of time is absolutely fine but if you're building a dashboard or realtime application this is going to be far too slow how about something a little bit more complicated we're going to do a group Buy on the creation year and just find what's the highest view count for a Post in each year so we saw that the view count was very fast to do Max on but once we do this group buy I should

00:08:36 have kicked this one off again as well this one takes it can take about 12 seconds but sometimes over 20 seconds there's a lot of kind of variation depending on what else is running if I'm doing screen sharing all sorts of different things um but we have yeah there we go so 13.7 seconds to run that query again this depending on the use Cas is may be fast enough but maybe it is not um what else have we got all right so I want to introduce you to a another new term I learned which is oap or olap this one stands for online

00:09:10 analytical processing and the key word here of course is analytical so o AP databases have been optimized for aggregating large amounts of data they're for the use cases of things like reporting and analytics and business intelligence logs all those sorts of things some examples of these types of databases are things like red shift snowflake big query uh and single store now there's a lot of proprietary databases here generally pretty expensive you've got to run them in the cloud uh single store is kind of an

00:09:43 interesting one it's a hybrid oap oltp they've kind of figured out how to make things work but today I want to talk to you about my favorite olap database and this one's called click housee so click house the name comes from clickstream data Warehouse clickstream is an analytics term so basically uh where people click on web pages storing that in a data warehouse uh it was originally built for a company called Yandex metrica which is the second largest analytics company after Google analytics but these days it's used by companies

00:10:16 like Cloud flare eBay Lyft Vimeo Microsoft Etc in the case of cloud flare they're ingesting 11 million rows per second into click house so it's designed for these very heavy workloads and they're using it for analytics but why do I like cook house it is free and open source they've got 36,000 stars on GitHub it's about half of the laravel framework but it's a pretty Niche tool so I think that's very very respectable it also has excellent documentation which we love in the laravel community and it's faster and cheaper to

00:10:50 run uh according to them but also in my own tests uh from pretty much all of the other databases I've tried and the ones that were mentioned earlier there's some links down there I don't know if you can see them but backs up those things so let's run some o AP some analytical queries now we're going to run them on click housee so let's first just do that select average view count I haven't got any indexes set up on this uh and if we run this one 27 milliseconds so as a reminder this took what like 6 seconds 5 and a half seconds

00:11:18 on MySQL uh 27 milliseconds on click house how about we throw a bunch more aggregations at it so we'll do min max average we'll get the 95th percentile the 99th percentile and we'll get a sum and we'll run this one 52 milliseconds to get those results back we can see our Max there was the max of our highest post about git we've got our average the P95 and P99 are kind of interesting we can see that 95% of posts have a view count of less than 3,600 and 99% of posts have a view count of less than

00:11:52 18,000 so the big numbers these 13 millions are in the top 1% and probably the top 1 0.00001% um oh I also was going to show you the sum of the view counts 76 billion so that's how many times posts have been viewed on stack Overflow so how about uh doing that group by aggregation this was another thing MySQL really struggled with uh this is a similar query we ran in MySQL but I've chucked in a few more Aggregates here we'll run this one 98 milliseconds to get those results there's a few interesting things in here

00:12:27 so we can kind of scroll down where's my cursor so we can see the count of posts uh for the first year that number doesn't really count because we don't have data from the start of that year um but we can see that over time the counts have kind of gone up and up and up um but from like 2018 onwards they've started to go down and dropped pretty dramatically 2024 doesn't really count because we don't have a full year um but it is kind of interesting that they're getting uh less uh posts over time and I

00:12:58 wonder whether that's something to do with AI now answering a lot more questions I'm not sure we can kind of draw our own conclusions from these uh these results how about finding out how many unique users there are in this data set so we're going to do a count distinct again this is a really really tricky run for MySQL when you've got a lot of data so if we run this one 260 milliseconds and it tells us that there are 6.2 unique users in this data set an interesting thing with Click housee by the way is see it's telling us

00:13:28 how many rows it red so every time you run a query with click house it gives you extra metadata it tells us how long the query took how many rows it read and how much data it had to read in to do that so we can see that it's scanning all 60 million uh rows to do these calculations if I run this one again it might get faster yeah as it builds up a bit of a c it can get faster uh how about finding the top five users by how many answers they've posted so we're going to filter by just answers and we're going to group by the user ID

00:14:00 and we're going to get the count we're going to just get the top five and this one runs in 185 milliseconds but what's really impressive here is that the database has to calculate the counts for all 6.2 million users to figure out just the top five and it's doing that in an incredibly short amount of time we can see we're reading less rows now because we've added a filter uh that's one kind of secret to uh unlock cre performance with click house is filtering when you can how about the most viewed posts so

00:14:30 this time we're going to do a text search case insensitive on the title column and find the top five posts that mention the string laravel anywhere in the title so 217 milliseconds again super super fast there's no indexes on that column but it's able to scan through them and find those there's nothing particularly exciting in the titles here how can I remove a package from laravel using composer standard sort of stuff one thing we can do to improve the performance here is we're doing a case insensitive search if we

00:14:58 change this to just like so we're doing a case sensitive search this query should run a fair bit faster I can't remember what it ran on previously but generally this will run faster because when we ask for a case and sensitive search databases and computers they kind of see capital L lowercase L as two different things um so when you do a case sensitive search it just looks for the exact string whereas case insensitive it has to kind of look for more uh variation so that can be something you can do to improve

00:15:25 performance if that doesn't matter how about the count of how many questions reference laravel so we're going to do uh just count the number of posts where the title includes the string laravel and we get 173,000 uh posts so this is uh just questions uh and yeah 169 milliseconds very very fast how about if we do the same for rails so for rails we get 21,000 so there's a few more questions about rails but it has been around longer uh I also like to think that maybe L's got better documentation so people don't need to

00:16:01 ask as many questions but I'll I'll leave that up to the audience to decide how they want to interpret the results uh how about looking up I ID so this is back to more of a transactional style query this is something that MySQL was really good at sometimes under a millisecond so I'm just going to select all of the columns for that one ID and this one took 49 milliseconds so this is still really really fast but it is a lot slower than my SQL at just looking up an individual row there are some things we can do to

00:16:31 improve the performance here although it is still pretty fast this is probably the fastest I've actually seen this I'll be interested to see if these things uh improve this so first of all we're selecting all of the columns and we'll see why that could be a problem later on so let's just select the title and the view count and we didn't shave off too much time this time but often this will uh save a lot of time we can also see that we're reading 59 million rows we're reading the entire database so something

00:16:57 that's interesting about click house is that the primary key is not unique because when you're inserting data in bulk click house doesn't have time to figure out can I insert this is this have I seen this ID before it just wants to dump data straight in there so it doesn't have a concept of unique constraints so it doesn't know that there's only one post with this ID so it has to scan everything to find all of the posts that have this ID so we can put a limit one on here and now we're down to 8.9 milliseconds uh and we only

00:17:26 read 4.6 million rows this time the amount of we read is going to depend where that post exists but we can add extra filters to help give it a if we know the date that it was posted we could add that as a filter and that would help things along but 8.99 milliseconds is still very very fast but nowhere near as fast as MySQL so how is Click house so fast at running these queries and oap databases in general there's all sorts of different things that uh each database will have their own secret Source but the main thing that I think

00:18:01 they all have in common is how they store the data on disk so it's whether it's row oriented or column oriented so databases like MySQL store data row oriented which means they store all the data for each row together whereas in a column oriented Database The Columns are stored together now to help illustrate this I've done this kind of pseudo MySQL style uh database file my scoll doesn't really store things in CSV but it's just to illustrate things so imagine this is the uh the mySQL database file so we can

00:18:33 see we've got all of the columns for Row one column one column two column three then all of the columns for row two and so on so when we want to look up data for a specific row we can jump straight to that all that data is together we can kind of skip over all the irrelevant stuff that's what the index helps us do it helps us find where in this file to find the rows that I care about so that's how it's so fast at doing that but when we want to do these aggregation queries with mySQL we've got to so yeah when we going to do the my

00:19:05 the aggregation queries of myc we've got to kind of skip over all these irrelevant bits of data to grab column two then skip over some more irrelevant stuff grab column two from that row so on and so forth so we can kind of understand why this might be a fair bit slower with mySQL if we compare this with a column orientated database every column is stored in a separate file at least that's the case with click house so we've got column one column two column 3 entirely separate file files and each of these files

00:19:31 contains just that column sta for Row one row two Row three so when we want to aggregate on a specific column we don't even have to skip over the other data we don't even read those files we're just reading this one file but conversely when we want to access a single row out of Click house we've now got to go read three different files and pull out just the row two data from each column now click house is still was still a lot faster comparatively and that's because uh when we asked SQL to do its

00:20:00 inefficient thing we were asking it to do 60 million rows whereas when we're asking uh click housee to do its inefficient thing we're only asking for one row so it has to go to all the different column files but we're only asking for a little bit of data out of each so it's still going to be pretty fast at that now another benefit of uh column store databases or column oriented databases is compression so if I run this query we can see that in click housee this database is uh 69.2 gig uncompressed but the data is stored

00:20:30 compressed uh which is only 34.8 GB so when you store data in columns you've got data that's very similar all living kind of side by side and that really helps compression algorithms to kind of uh compress all of that down now the compression ratio here is pretty good um this data set though has a lot of really big strings which are sometimes harder to compress uh so I've seen generally better compression rates with uh other types of data sets I've looked at but this is still very very respectable and

00:20:58 it's really good when you're using the cloud version of uh click housee because you only pay for uh the compressed amount of data that you're storing so what's the catch why wouldn't we want to use a database like clickhouse uh for everything so there's a few things the first one is around ordering so when we Define our table structure we have to tell it how we want the data to be sorted in these column files and you can only store the data in one order so we have to kind of optimize for our main query patterns so in this

00:21:30 particular data set we've ordered by the post type ID and then the creation date so if we're filtering on the post type post type ID that's really good because we can basically jump to just that part of the file that contains all of the questions and then if we were filtering by the creation date we could narrow down even faster but if we were using other types of filters it wouldn't be able to leverage that as easily and we'd have to kind of scan a lot more now in uh click housee you can define an order key and a primary key

00:21:58 you only have to Define one uh we don't have time to get into all the specifics of how indexes work in Click house um but this should hopefully kind of give you a little bit of a taste of what is possible so another thing kind of limitation I guess with uh the analytical databases is generally that they're not um they're not really good at inserting like rows one at a time they prefer bulk inserts and that's because every time you do an insert into click housee it creates a brand new folder on the hard drive with colum

00:22:29 files for each of the columns you've inserted so if you have 10,000 rows to insert and you insert them one query at a time you're going to end up with 10,000 what they call parts or folders on the hard drive with those columns whereas if you inserted them in a single insert then it's going to create one part however click house merges these uh Parts together on a schedule and that's where the main table engine in Click house gets its name so it's called the merge tree engine so even though you might insert those 10,000 one at a time

00:22:56 they will eventually get compressed into a single part uh and then the compression gets better so when you want when you're inserting you generally want to try and insert in bulk if you can now they do have a little feature that I kind of help you with this an async insert feature and with this feature you can insert data and it will buffer it until uh the data has reached a certain amount of size or a certain amount of time has passed so they've kind of got some little helpers here to work around some of these

00:23:23 limitations Now updates with databases like click housee are generally uh I wouldn't say frowned upon but they're not ideal so when you run an update it has to rewrite those part files uh effectively from scratch and so what they've done to kind of signal that to the developer is the update query is actually an alter table query so if you run regular update it's not going to work it's an alter table statement kind of indicating that you shouldn't do this um unless you need to it shouldn't be part of your regular

00:23:53 application now deletes are kind of a little bit different from updates when you're storing uh let's say analytical data often times you only want to keep it for a certain amount of time so they have kind of optimized for deleting data especially based on time there's two ways you can delete data uh via queries so you can do a lightweight delete using a standard delete query with this query it does a lightweight delete so it just marks that the the rows that have been deleted but it doesn't actually delete

00:24:19 them from disk it does that when it does the merge process Now by default it will only remove the data once there's a part that contains where it can delete any entire part in one go it doesn't want to delete uh portions of a part so you can Define your own petitioning strategy um based on your kind of needs for this there's also a heavyweight delete which is again signified with an alter table command uh where it will go and delete the data right then and there and you might need this if you've got regulatory

00:24:47 requirements or you've got to remove data but again it's a pretty heavyweight operation another cool feature is that deletes can be automated so I mentioned that maybe we want to delete all data so we can define a TT or time to live in our table structure so in this case I've said delete the data after 31 days and to kind of help with that being able to delete uh petitions one at a time I've specifically told it here to petition by the date so once it can basically every day we'll have a we'll have a petition

00:25:17 when the data is inserted it'll be like lots of petitions but when it merges them it'll go no smaller than one day's worth of data in each petition so what else can click as to it has a ton of other cool features uh Beyond just kind of regular SQL queries so one of the ones I want to talk to you about is the replacing merge tree so this Builds on the regular merg tree engine and this is kind of their answer to uh running update queries in a more efficient way so in this case we're just going to create a table called my

00:25:45 replacing table with a key column and a value column uh one cool thing here is we've defined that the key is a low cardinality string this is an optimization click house offers that in many cases is better than an enum but it works similar so it will create a dictionary of all the unique values and then just store the IDS in this table which again can help when you've got a lot of data they recommend if the uh cardinality or the number of unique uh items you have is less than 10,000 then this is a performance increase but over

00:26:14 that maybe not so much so let's go ahead and we'll create this one it's ordered by the key the thing to know with this table is that uh rows with the same key uh only the the newest row should survive the latest one with that key so we'll create this table and then we're going to insert some data into it so we're going to just insert a key of LaRon and a value of initial value so we'll insert that one and then we'll insert uh the same key LaRon but this time the value of updated value and we've got those two so now let's just

00:26:45 select star from that table and we'll see we get two results back which is kind of like different than what I was saying right we expected to only have one now this is again to do with the way that click housee inserts data the data both rows were inserted separately so they both exist on disk until the uh merge process happens and then it will collapse those into a single row so when we're cying this we kind of have to plan for the fact that it might not have been merged yet now they have a little helper here if we uh

00:27:14 add the keyword Final on the end it will go ahead and resolve that for us and only give us the latest value but we can see that it's still red two rows because until the merge happens those two rows still exist so it's the kind of thing that uh it will get faster over time as the merge process happens um so yeah another cool feature is the aggregating merge tree which we often use uh in combination with materialized views so we're going to create a table called temperatures we're going to store

00:27:43 uh the location again using a low cardinality string the time stamp and then the temperature that we're reading so this one is just into a merge tree table this is kind of our source data or our raw data so we'll go ahead and we'll create this table and then we'll create another table this is going to have our our pre-aggregated data uh so this will be our daily temperatures so again the location this time just the date and we're going to store the Max and the average temperature for each day and

00:28:07 we've got our order Buy on the location and the date so that's kind of defining what should be kind of unique in this data set one of the reasons you'd want to do this is Click house is really fast uh with 60 million rows it's a pretty small data set by click house standards but you do get to a point where querying this raw data can still eventually get slow so these pre-aggregation uh tables can help you when you get to even bigger amounts of data so we go ahead and create this table so we now have two

00:28:34 tables the last thing we need to add is a materialized view now materialized view in Click house works pretty differently from most databases in Click house it's effectively just an insert trigger so when data is inserted into our temperatures table it's going to run this query uh to insert data into our daily temperatures table there's some little help we need to do here to tell it how we want to aggregate it all that sort of stuff but don't worry too much about specific so we'll go ahead and create that

00:29:00 materialized view and now we're going to insert data into our temperatures table so this is the raw data table so we're going to uh insert uh the location of Dallas the time is now and the temperature of 100° I'm not too familiar with Fahrenheit but it's pretty hot and I think that's a pretty hot number let's also now go and insert another temperature so now the temperature's gone up to 110° I think that's actually a pretty significant jump but again I'm not sure familiar with Fahrenheit so we'll go and insert that one and now

00:29:32 let's just select star from our temperatur table so this is again our raw data table and we can see we've got the two rows we expect to have those ones aren't going to get merged together but now let's query our daily temperatures table our Aggregates table so we have to write a query that is effectively like a pretty standard aggregation query there's a few little interesting things like average merge again don't worry too much about that but if we run this query we'll see we get the results we expect so for today

00:29:58 we've got a Max of 110 and an average of 105 but we can see that we read two rows still so that tells us there's still two rows in this pre-aggregated table and that's again because the data hasn't been merged yet it hasn't gone ahead and kind of collapsed those into one row yet now we can tell it to perform an unscheduled merge so we can do optimized table with the final keyword on the end again this is not something you want to do like out of habit but for demonstration purposes we can tell it to

00:30:24 go and do that so if we run the same query again then we get the same results but now we've only read one row from disk so as new data is coming in let's say for the whole of the day we that table's probably going to be pretty big towards the the current day but then as the merge happens it's going to get smaller and smaller until there's uh only one entry per location per day another cool feature of Click house is that it can fill gaps in Time series data so let's say we want to plot these temperatures on a graph we want to plot

00:30:57 the temperatures for the current week so I'm going to run this query and it's going to give us just one result so the Max and the average for today but we want to have the results for the week so normally you'd have to if you knew you had missing uh Missing data you'd kind of have to maybe in PHP go and try and Pad in all the missing days so that you can pass it to your charting library but with click housee we can say with fill and Define we want it from the start of the week until the end of the week with

00:31:23 an interval of one day so if we run this query we now get results for the full 7day days with zeros for all the days where we don't have data and then our results down there now you can tell it how you want to interpolate missing values by default it'll just give you zeros but you can use the previous day value and add things to that all sorts of interesting things so a few performance considerations and this applies to uh oltp databases or transactional databases and analytical databases so the first thing is that everything is

00:31:54 fast with limited data and a fast laptop so if you've got a lot of data in production you kind of need to test locally with a lot of data so always test with a production like data set and a production like environment if your laptop's really fast but your database in production is a teach2 micro something like or other uh you're probably going to want to test with that if performance is a concern and you have a lot of data regardless of which database you're using the other thing with performance

00:32:21 is the best way to get the most out of any tool is to understand the internals so click housee has some kind interesting ways that it does stuff but in order to fully leverage them and not kind of walk into any foot guns you need to understand how it works how its indexing strategy Works how it stores the data and their documentation goes into a lot of depth about how the data is stored how the indexes work all that sort of stuff so it's the kind of thing that if you don't have a lot of data it

00:32:47 doesn't really matter how fast it is but when you start to get a lot of data millions of rows billions of rows then you really need to kind of figure these things out because small things things that are like fast with a small number of rows once you add a lot of row with small numbers multiplied add up to a lot now the story for working with click house in laravel is pretty good there's a couple of uh interesting packages there's a PHP specific package there's a query Builder package and then there's

00:33:13 this PHP click house laravel package and it kind of wraps those first two packages and lets us use click house with laravel's query Builder um and migrations and eloquent models all that sort of stuff so in summary oltp databases are really good for transactional workloads where you've got kind of row-level operations where you've got data like mutable data data that needs to change when you need to enforce constraints like uniqueness unique Keys unique email addresses those sorts of things and they're also great

00:33:45 at light aggregation so don't necessarily go and reach for an analytical database if MySQL is fast enough for you oh well AP databases on the other hand they're really good for large data sets for things like real-time Analytics business intelligence logs events and traces and even machine learning and generative AI uh click housee has some specific engines for doing Vector searches I haven't played with any of that stuff but uh apparently it's pretty good the kind of key thing here though is that

00:34:12 it's not really a case of oltp versus oap oap databases they're not a silver bullet that's just going to replace MySQL or postgress for you them are a tool you bring in for specific scenarios I think every application is going to still need need an olp database uh you just bring in an oap database for certain things now you might be wondering why I am so interested in analytical databases and click house well I will tell you all about that at LaRon Australia in November thank you very much nicely done am I under am I to

00:34:57 understand that those slides you were actually running those queries live live queries two databases both running on my laptop same hardware and they ran live on slide click and how did you do that uh I used a tool called slide dodev uh so it's basically a markdown slide generator uh that uses V to do all the rendering and they did most of the hard work but I do have a whole lot of hacks in my node modules directory so if I ran mpm update my whole Talk would be ruined glad you did not do that before we came

00:35:25 on um last question we the last time you and I talked you were like deep in the internals of redis and now you're deep in the internals of Click house is this a part of the stack that you prefer or are you just the type of person that just likes to go all the way to the core of it uh I like to go anywhere I mean generally I'm trying to solve a problem and these are ways to solve the problem so they're kind of means to an end but they're very very fascinating in their own right and I like to kind of yeah go

00:35:54 deep and learn everything I can about them so that I can kind of get the best out of them especially when I'm dealing with things like a lot of data and needed to be fast


Highlights

🔍 Analytical Databases: Focus on aggregating large datasets for reporting and analytics.
🚀 Performance Comparison: Analytical databases like ClickHouse significantly outperform MySQL in query speed.
📊 Dataset Example: Using StackOverflow data to illustrate differences in query efficiency.
🛠️ OLTP vs. OLAP: OLTP databases are optimized for transaction processing; OLAP databases are tailored for analytics.
💾 Data Storage: ClickHouse uses column-oriented storage, enhancing aggregation performance and compression.
📈 Real-Time Analytics: Analytical databases excel in real-time data analysis, crucial for business intelligence.
🌐 Use Cases: Each database type serves unique purposes; understanding their strengths is essential for application design.

Key Insights

📈 Understanding Database Types: Knowing when to use OLTP versus OLAP is crucial for optimal performance in data-heavy applications. OLTP excels in transactional tasks, while OLAP shines in analytics.
⚡ Impact of Query Design: The design of your queries and data structure can drastically affect performance. Analytical databases benefit from well-structured data for faster aggregations.
📊 ClickHouse Advantages: ClickHouse offers significant speed improvements for analytical queries, making it a preferable choice for large datasets and real-time analytics.
💡 Importance of Live Testing: Running live queries on actual datasets helps in understanding performance differences in real-world scenarios, aiding in better database selection.
🗄️ Data Compression Benefits: Column-oriented databases like ClickHouse achieve better compression rates, reducing storage costs and improving data retrieval speeds.
🔄 Bulk Inserts vs. Single Inserts: Analytical databases perform better with bulk data inserts, highlighting the need for efficient data handling strategies during large data operations.
🚀 Scalability Challenges: While analytical databases offer speed, they require careful management of data organization and query types to maintain performance as data scales.