Are you over 18 and want to see adult content?
More Annotations

A complete backup of ihydroflaskshop.com
Are you over 18 and want to see adult content?

A complete backup of healthhuntsville.tk
Are you over 18 and want to see adult content?

A complete backup of marshmellomusic.com
Are you over 18 and want to see adult content?

A complete backup of smithsonianapa.org
Are you over 18 and want to see adult content?

A complete backup of chicagowatertaxi.com
Are you over 18 and want to see adult content?

A complete backup of tetracycline0l1.com
Are you over 18 and want to see adult content?

A complete backup of theawesomegreen.com
Are you over 18 and want to see adult content?
Favourite Annotations

A complete backup of separarredamenti.it
Are you over 18 and want to see adult content?

A complete backup of optimus-education.com
Are you over 18 and want to see adult content?

A complete backup of enquete-debat.fr
Are you over 18 and want to see adult content?

A complete backup of creditsesame.com
Are you over 18 and want to see adult content?
Text
DATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: C++ CONCURRENCY MODEL ON X86 FOR DUMMIES Summary. The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model ( x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect theDATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: C++ CONCURRENCY MODEL ON X86 FOR DUMMIES Summary. The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model ( x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect theDATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: 2020 The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: JUNE 2014 We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: 2015 A blog by and for database architects. It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O(1), And even when we need range queries n-ary search structures like B-Trees are much DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: JANUARY 2015 The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a string data type its semantics DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. DATABASE ARCHITECTS: JUNE 2015 After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor. DATABASE ARCHITECTS: MAY 2014 The easiest way to play with it is the online demo. It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. All queries are evaluated against a SF1 TPC-H database which contains roughly 1GB of data. HyPer webinterface.
DATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: C++ CONCURRENCY MODEL ON X86 FOR DUMMIES Summary. The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model ( x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect theDATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: C++ CONCURRENCY MODEL ON X86 FOR DUMMIES Summary. The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model ( x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect theDATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: 2020 The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: JUNE 2014 We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: 2015 A blog by and for database architects. It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O(1), And even when we need range queries n-ary search structures like B-Trees are much DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: JANUARY 2015 The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a string data type its semantics DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. DATABASE ARCHITECTS: JUNE 2015 After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor. DATABASE ARCHITECTS: MAY 2014 The easiest way to play with it is the online demo. It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. All queries are evaluated against a SF1 TPC-H database which contains roughly 1GB of data. HyPer webinterface.
DATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: MAIN-MEMORY VS. DISK BASED We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: TRYING TO SPEED UP BINARY SEARCH Trying to speed up Binary Search. It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O (1), And even when we need range queries n-ary search structures like B-Trees are much faster than binary search or binary search trees. Still, there is a certain charm to binary search. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect the DATABASE ARCHITECTS: 2014 A blog by and for database architects. Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to bereally good.
DATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: MAIN-MEMORY VS. DISK BASED We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: TRYING TO SPEED UP BINARY SEARCH Trying to speed up Binary Search. It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O (1), And even when we need range queries n-ary search structures like B-Trees are much faster than binary search or binary search trees. Still, there is a certain charm to binary search. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect the DATABASE ARCHITECTS: 2014 A blog by and for database architects. Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to bereally good.
DATABASE ARCHITECTS: APRIL 2020 0.04. 0.19. 34.99. > 1h. The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. This is disastrous when processing generated code, where we cannot easily DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: FUN WITH CHAR Fun with CHAR. The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a stringdata
DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: COMPARING JOIN IMPLEMENTATIONS We see that, overall, NOP, i.e., a non-partitioning join, is the fastest for this query.Still, the authors argue that NOP is the slowest of all four alternatives, because the colored bar of it is the highest. I tend to believe that this is an artifact of measurement of the colored bars: They were computed by running the joins on pre-filtered input, without the subsequent aggregation. DATABASE ARCHITECTS: 2014 A blog by and for database architects. Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to bereally good.
DATABASE ARCHITECTS: JUNE 2014 We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. DATABASE ARCHITECTS: JULY 2015 As prerequisite we need the clang compiler in the search path, more precisely a binary called clang-3.5. Then, we can start the server daemon with a debug flag like that: bin/hyperd mydatabase -xcompilestatic. Now every statement that generates code (e.g., queries, or create table statements) writes the code to the local directory and then call DATABASE ARCHITECTS: JUNE 2015 After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor.DATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: MAIN-MEMORY VS. DISK BASED We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: TRYING TO SPEED UP BINARY SEARCH Trying to speed up Binary Search. It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O (1), And even when we need range queries n-ary search structures like B-Trees are much faster than binary search or binary search trees. Still, there is a certain charm to binary search. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect the DATABASE ARCHITECTS: 2014 A blog by and for database architects. Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to bereally good.
DATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: MAIN-MEMORY VS. DISK BASED We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: TRYING TO SPEED UP BINARY SEARCH Trying to speed up Binary Search. It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O (1), And even when we need range queries n-ary search structures like B-Trees are much faster than binary search or binary search trees. Still, there is a certain charm to binary search. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect the DATABASE ARCHITECTS: 2014 A blog by and for database architects. Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to bereally good.
DATABASE ARCHITECTS: APRIL 2020 0.04. 0.19. 34.99. > 1h. The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. This is disastrous when processing generated code, where we cannot easily DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: FUN WITH CHAR Fun with CHAR. The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a stringdata
DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: COMPARING JOIN IMPLEMENTATIONS We see that, overall, NOP, i.e., a non-partitioning join, is the fastest for this query.Still, the authors argue that NOP is the slowest of all four alternatives, because the colored bar of it is the highest. I tend to believe that this is an artifact of measurement of the colored bars: They were computed by running the joins on pre-filtered input, without the subsequent aggregation. DATABASE ARCHITECTS: JANUARY 2015 The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a string data type its semantics DATABASE ARCHITECTS: 2014 A blog by and for database architects. Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to bereally good.
DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. DATABASE ARCHITECTS: JULY 2015 As prerequisite we need the clang compiler in the search path, more precisely a binary called clang-3.5. Then, we can start the server daemon with a debug flag like that: bin/hyperd mydatabase -xcompilestatic. Now every statement that generates code (e.g., queries, or create table statements) writes the code to the local directory and then call DATABASE ARCHITECTS: JUNE 2015 After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor.DATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: C++ CONCURRENCY MODEL ON X86 FOR DUMMIES Summary. The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model ( x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that DATABASE ARCHITECTS: MAIN-MEMORY VS. DISK BASED We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect theDATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: C++ CONCURRENCY MODEL ON X86 FOR DUMMIES Summary. The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model ( x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that DATABASE ARCHITECTS: MAIN-MEMORY VS. DISK BASED We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2020 The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: 2016 Originally HyPer had a very simple model for strings: We made sure that all strings are valid UTF-8, but otherwise did not really care about the intrinsics of Unicode.And that is actually a quite sane model for most applications.Usually we do not care about the precise string structure anyway, and in the few places that we do (e.g., strpos and substr), we add some extra logic to handle UTF-8 DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: FUN WITH CHAR Fun with CHAR. The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a stringdata
DATABASE ARCHITECTS: JANUARY 2015 The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a string data type its semantics DATABASE ARCHITECTS: 2015 A blog by and for database architects. It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O(1), And even when we need range queries n-ary search structures like B-Trees are much DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. DATABASE ARCHITECTS: JUNE 2015 After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor. DATABASE ARCHITECTS: MAY 2014 The easiest way to play with it is the online demo. It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. All queries are evaluated against a SF1 TPC-H database which contains roughly 1GB of data. HyPer webinterface.
DATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: C++ CONCURRENCY MODEL ON X86 FOR DUMMIES Summary. The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model ( x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that DATABASE ARCHITECTS: MAIN-MEMORY VS. DISK BASED We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect theDATABASE ARCHITECTS
The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memoryorder
DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: C++ CONCURRENCY MODEL ON X86 FOR DUMMIES Summary. The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model ( x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that DATABASE ARCHITECTS: MAIN-MEMORY VS. DISK BASED We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads.The disk-based engine includes a regular buffer manager, locking/latching, a column store DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? 1) we sort all data and keep it in an array, just like with learned indexes. 2) we build the CDF. 3) we fit a linear spline to the CDF minimizing the Chebyshev norm. 4) we fit a polynomial function to the spline nodes. 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the valuesfrom
DATABASE ARCHITECTS: 2017 Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences: Consider for example the query. select x=a and x=b. clearly, it is equivalent to the following query. select x=a and a=b. after all, x=a, and thus we can substitute a with x in the second term. DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation. Initially I naively though that this would just affect the DATABASE ARCHITECTS: 2020 The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. DATABASE ARCHITECTS: 2019 However, we can nevertheless alternate between positions because the following holds. i1 = i2 xor hash (signature (x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash (signature (x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. DATABASE ARCHITECTS: 2016 Originally HyPer had a very simple model for strings: We made sure that all strings are valid UTF-8, but otherwise did not really care about the intrinsics of Unicode.And that is actually a quite sane model for most applications.Usually we do not care about the precise string structure anyway, and in the few places that we do (e.g., strpos and substr), we add some extra logic to handle UTF-8 DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: FUN WITH CHAR Fun with CHAR. The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a stringdata
DATABASE ARCHITECTS: JANUARY 2015 The CHAR (n) data type is one of the more bizarre features of SQL. It is supposed to represent a fixed-length string, but as we will see in a moment, it behaves odd in all kinds of ways. IMHO it should never be used for anything. There might be use cases for CHAR (1), i.e., for storing a single character, but as a string data type its semantics DATABASE ARCHITECTS: 2015 A blog by and for database architects. It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O(1), And even when we need range queries n-ary search structures like B-Trees are much DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. DATABASE ARCHITECTS: JUNE 2015 After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor. DATABASE ARCHITECTS: MAY 2014 The easiest way to play with it is the online demo. It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. All queries are evaluated against a SF1 TPC-H database which contains roughly 1GB of data. HyPer webinterface.
DATABASE ARCHITECTS
The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes.As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence.But nevertheless, the talk made me curious: The model they are learning is in the endvery primitive.
DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: 2017 The inner pages consist of separator values, and offsets to the next lower level. Well, we can interpret that as a spline. Instead of just going down to the next level and then doing binary search in the next node, we can interpret our search key as position between the two separators, and then interpolate the position of our search key onethe next level.
DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 Originally HyPer had a very simple model for strings: We made sure that all strings are valid UTF-8, but otherwise did not really care about the intrinsics of Unicode.And that is actually a quite sane model for most applications.Usually we do not care about the precise string structure anyway, and in the few places that we do (e.g., strpos and substr), we add some extra logic to handle UTF-8 DATABASE ARCHITECTS: FUN WITH CHAR I have just re-read the SQL standard (Subclause 8.2). And it says:"" If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters.DATABASE ARCHITECTS
The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. DATABASE ARCHITECTS: ALL HASH TABLE SIZES YOU WILL EVER NEED When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset ofbits is ok.
DATABASE ARCHITECTS: WHY USE LEARNING WHEN YOU CAN FIT? We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes.As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence.But nevertheless, the talk made me curious: The model they are learning is in the endvery primitive.
DATABASE ARCHITECTS: TRYING OUT HYPER At TUM we have built a very fast main-memory database system named HyPer.It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems. The easiest way to play with it is the online demo.It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. DATABASE ARCHITECTS: 2017 The inner pages consist of separator values, and offsets to the next lower level. Well, we can interpret that as a spline. Instead of just going down to the next level and then doing binary search in the next node, we can interpret our search key as position between the two separators, and then interpolate the position of our search key onethe next level.
DATABASE ARCHITECTS: THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures.It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. DATABASE ARCHITECTS: THE PRICE OF CORRECTNESS I know, VectorWise for example keeps tracks of the domains of individual attributes and avoids overflow checks if the result cannot overflow. But there are two problems with that approach: 1) maintaining the domain is problematic in the presence of high-update rates (admittedly not a use case for VectorWise), and 2) this can lead you to use larger data types than necessary. DATABASE ARCHITECTS: 2016 Originally HyPer had a very simple model for strings: We made sure that all strings are valid UTF-8, but otherwise did not really care about the intrinsics of Unicode.And that is actually a quite sane model for most applications.Usually we do not care about the precise string structure anyway, and in the few places that we do (e.g., strpos and substr), we add some extra logic to handle UTF-8 DATABASE ARCHITECTS: FUN WITH CHAR I have just re-read the SQL standard (Subclause 8.2). And it says:"" If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. DATABASE ARCHITECTS: 2019 However all this hold only because the original cuckoo filters use power-of-two hash tables. If our hash table size is not a power of 2, the xor can place the alternative position beyond the size of the hash table, which breaks the filter. DATABASE ARCHITECTS: OCTOBER 2020 Since C++11, multi-threaded C++ code has been governed by a rigorous memory model. The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in aportable fashion.
DATABASE ARCHITECTS: FUN WITH CHAR I have just re-read the SQL standard (Subclause 8.2). And it says:"" If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the DATABASE ARCHITECTS: MAY 2019 A blog by and for database architects. We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes.As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence.But nevertheless, the talk made me curious: The model they are learning is in the end DATABASE ARCHITECTS: CUCKOO FILTERS WITH ARBITRARILY SIZED However all this hold only because the original cuckoo filters use power-of-two hash tables. If our hash table size is not a power of 2, the xor can place the alternative position beyond the size of the hash table, which breaks the filter. DATABASE ARCHITECTS: DECEMBER 2017 The inner pages consist of separator values, and offsets to the next lower level. Well, we can interpret that as a spline. Instead of just going down to the next level and then doing binary search in the next node, we can interpret our search key as position between the two separators, and then interpolate the position of our search key onethe next level.
DATABASE ARCHITECTS: VORTEX: VECTORWISE GOES HADOOP Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. DATABASE ARCHITECTS: 2014 A blog by and for database architects. Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to bereally good.
DATABASE ARCHITECTS: JUNE 2015 After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor. DATABASE ARCHITECTS: MAY 2014 Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement.For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters.DATABASE ARCHITECTS
A blog by and for database architects. WEDNESDAY, JULY 24, 2019 CUCKOO FILTERS WITH ARBITRARILY SIZED TABLESCuckoo Filters
are an interesting alternative to Bloom filters . Instead of maintaining a filter bitmap, they maintain a small (cuckoo-)hash table of key signatures, which has several good properties. For example is stores just the signature of a key instead of the key itself, but is nevertheless able to move an element to a different position in thecase of conflicts.
This conflict resolution mechanism is quite interesting: Just like regular cuckoo hash tables each element has two potential positions where is be placed, a primary position i1 and a secondary position i2. These can be computed as follows:i1 = hash(x)
i2 = i1 xor hash(signature(x)) Remember that the cuckoo filter stores only the (small) signature(x), not x itself. Thus, when we encounter a value, we cannot know if it is at its position i1 or position i2. However, we can nevertheless alternate between positions because the following holds i1 = i2 xor hash(signature(x)) and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash(signature(x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. This allows is to switch between positions, which is used in the cuckoo filter conflict resolution logic. However all this hold only because the original cuckoo filters use power-of-two hash tables. If our hash table size is not a power of 2, the xor can place the alternative position beyond the size of the hash table, which breaks the filter. Thus cuckoo filter tables always had to be powers of two, even if that wasted a lot of memory. In more recent work Lang et al. proposed using cuckoo filters with size C, where C did not have to be a power of two, offering much better space utilization. They achieved this by using a different self-inverse function:i1 = hash(x) mod C
i2 = -(i1 + hash(signature(x)) mod C Note that the modulo computation can be made reasonable efficient by using magic numbers , which can be precomputed when allocating the filter. A slightly different way to formulate this is to introduce a switch function f, which switches between positions: f(i,sig,C) = -(i + hash(sig)) mod Ci1 = hash(x) mod C
i2 = f(i1, signature(x), C) i1 = f(i2, signature(x), C) All this works because f is _self-inverse_, i.e., i = f(f(i, signature(x), C), signature(x), C) for all C>0, i between 0 and C-1, and signature(x)>0. The only problem is: Is this true? In a purely mathematical sense it is, as you can convince yourself by expanding the formula, but the cuckoo filters are not executed on abstract machines but on real CPUs. And there something unpleasant happens: We can get numerical overflows of our integer registers, which implicitly introduces a modulo 2^32 into our computation. Which breaks the self-inverseness of f in some cases, except when C is power of two itself. Andreas Kipf noticed this problem when using the cuckoo filters with real world data. Which teaches us not to trust in formulas without additional extensive empirical validation... Fortunately we can repair the function f by using proper modular arithmetic. In pseudo-code this looks like thisf(i,sig,C)
x=C-(hash(sig) mod C)if (x>=i)
return (x-i);
// The natural formula would be C-(i-x), but we prefer thisone...
return C+(x-i);
This computes the correct wrap-around module C, at the cost of one additional if. We can avoid the if by using predication, as shownbelow
f(i,sig,C)
x=C-(hash(sig) mod C) m = (xsized tables.Posted by Thomas Neumannat 3:40 PM
No comments:
Email This
BlogThis!
Share
to Twitter
Share
to Facebook
Share
to Pinterest
THURSDAY, MAY 16, 2019 WHY USE LEARNING WHEN YOU CAN FIT? We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes . As I had mentioned before, I am more in favor of using suitably implemented b-trees,
for reasons like update friendliness and distribution independence. But nevertheless, the talk made me curious: The model they are learning is in the end very primitive. It is a two-level linear model, i.e., they are using a linear function to select another linear function. But if that is enough, why do we need machine learning? A simple function fit should work just as well. Thus, I tried the following: 1) we sort all data and keep it in an array, just like with learnedindexes
2) we build the CDF
3) we fit a linear spline to the CDF minimizing the Chebyshev norm 4) we fit a polynomial function to the spline nodes 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in thenext step.
As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes. Now admittedly there was some weasel wording in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learnedindexes, of course.
Now do we find a good fit? In theory we know how to construct the optimal fit in O(n log n), but that paper
is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in O(n log n).
Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data. The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound. How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbersare from here
,
and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).Map data
size (MB)
avg error
Learned Index (10,000)0.15
8 ± 45
Learned Index (100,000)1.53
2 ± 36
B-tree (10,000)
0.15
225
B-tree (100,000)
1.53
22
Spline (10,000)
0.15
193
Spline (100,000)
1.53
22
Log normal data
size (MB)
avg error
Learned Index (10,000)0.15
17,060 ± 61,072
Learned Index (100,000)1.53
17,005 ± 60,959
B-tree (10,000)
0.15
1330
B-tree (100,000)
1.53
3
Spline (10,000)
0.15
153
Spline (100,000)
1.53
1
Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high. For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution. But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for theresults.
Posted by Thomas Neumannat 4:44 PM
No comments:
Email This
BlogThis!
Share
to Twitter
Share
to Facebook
Share
to Pinterest
FRIDAY, FEBRUARY 1, 2019 HONEST ASYMPTOTIC COMPLEXITY FOR SEARCH TREES AND TRIES Fun fact that was pointed out to me by Viktor : All traditional books on algorithms and data structures that we could find gave the lookup costs of balanced search treesas _O(log n)_
(i.e., the depth of the search tree), and the lookup costs of tries as _O(k)_ (i.e., the length of the key). At a first glance this is a logarithmic time lookup against a linear time lookup, which makes people nervous when thinking aboutlong keys.
But that argument is very unfair: Either we consider a key comparison a _O(1)_ operation, then a tree lookup is indeed in _O(log n)_, but then a trie lookup is in _O(1)_! As the length of the key has to be bounded by a constant to get _O(1)_ comparisons, the depth of the trie is bounded, too. Or the length of the key matters, then a trie lookup is indeed in _O(k)_, but then a tree lookup is in _O(k log n)_. We have to compare with the key on every level, and if we are unlucky we have to look at the whole key, which gives the factor _k_. Which of course makes tries much more attractive asymptotic wise. Note that we ignore wall clock times here, which are much more difficult to predict, but in many if not most cases tries are indeed much fasterthan search trees .
I believe the reason why text books get away with this unfair comparison is that they all present balanced search trees with integerkeys:
image/svg+xml 8 10 4 1 6 While tries are traditionally introduced with string examples. If they had used string keys for balanced search trees instead it would have been clear that the key length matters: image/svg+xml ABCD8 ABCD10 ABCD4 ABCD1 ABCD6 The trie examines every key byte at most once, while the search tree can examine every key byte _log n_ times. Thus, the asymptotic complexity of tries is actually better than that of balanced searchtries.
Posted by Thomas Neumannat 9:47 AM
No comments:
Email This
BlogThis!
Share
to Twitter
Share
to Facebook
Share
to Pinterest
FRIDAY, JUNE 8, 2018 PROPAGATION OF MISTAKES IN PAPERS While reading papers on cardinality estimation I noticed something odd: The seminal paper by Flajolet and Martin on probabilisticcounting gives
a bias correction constant as 0.77351, while a more recent (and veryuseful) paper
by Scheuermann and Mauve gives the constant as 0.775351. Was this a mistake? Or did they correct a mistake in the original paper? I started searching, and there is a largenumber
of
papers that uses the value 0.775351, but there is also a numberof
papers
that uses the value 0.77351. Judging by the number of Google hits for "Flajolet 0.77351" vs. "Flajolet 0.775351" the 0.77351 group seems to be somewhat larger, but both camps have a significant number of publications. Interestingly, not a single paper mentions both constants, and thus no paper explains what the correct constant shouldbe.
In the end I repeated the constant computation as explained by Flajolet, and the correct value is 0.77351. We can even derive one digit more when using double arithmetic (i.e., 0.773516), but that makes no difference in practice. Thus, the original paper was correct. But why do so many paper use the incorrect value 0.775351 then? My guess is that at some point somebody made a typo while writing a paper, introducing the superfluous digit 5, and that all other authors copied the constant from that paper without re-checking its value. I am not 100% sure what the origin of the mistake is. The incorrect value seems to appear first in the year 2007, showing up in multiple publications from that year. Judging by publication date the source seems to be this paper (also it did not cite any other papers with the incorrect value, as far as I know). And everybody else just copied the constant from somewhere else, propagating it from paper to paper. If you find this web page because you are searching for the correct Flajolet/Martin bias correction constant, I can assure you that the original paper was correct, and that the value is 0.77351. But you do not have to trust me on this, you can just repeat the computationyourself.
Posted by Thomas Neumannat 9:06 AM
No comments:
Email This
BlogThis!
Share
to Twitter
Share
to Facebook
Share
to Pinterest
MONDAY, APRIL 23, 2018 ADDENDUM TO THE MILP OPTIMIZATION TIMES In our upcoming SIGMOD paper on Adaptive Optimization of Very Large Join Querieswe compared result
quality and optimization times of over a dozen approaches for join ordering, over a wide range of query sizes. Which is quite a challenging problem, as the different algorithms often work under different assumptions, usually no reference implementation is available, and we had to unify them all into one framework that can handle joins from 10 to 5,000 relations. One of the approaches that we included was Solving the Join Ordering Problem via Mixed Integer Linear Programmingby Immanuel
Trummer and Christoph Koch. Our implementation tries to follow the original paper faithfully, implementing the mapping from query graph to MILP problem just as described in the original paper. For some benchmarks like TPC-DS (up to 18 relations in a join, with a median of 3) that implementation worked fine. But for some other benchmarks like the Join Order Benchmark (up to 17 relations, median 8) and the SQLitejoin set
(up to
64 relations, median 34) we saw significant optimization times on our Xeon E7-4870 system: In total 290s for the JOB queries, and 5,100s for the SQLite queries. (Note that the JOB times do not include queries with non-inner join edges, as these currently cannot be handled by theMILP approach).
Immanuel pointed out to me that we can improve the optimization time quite a bit by initializing the start position for the Gurobi solver to a solution constructed by a greedy heuristic. In particular for the SQLite queries that helps a lot, as the greedy solution works very well there and thus the start position is already very good. The optimization times for his implementation (on a weaker hardware, a 2.2 GHz Intel Core i7 laptop) are 52s for the Join Order Benchmark and 44s for theSQLite queries.
I am glad for the hint, and thus amend the numbers here. As far as a I can see it this initialization trick was not mentioned in the original SIGMOD17 paper. But of course a carefully tuned implementation will often have tricks that are unfortunately not described in detail in the corresponding publication. If there are any more comments about any of the approaches we measured I am happy to hear them. Posted by Thomas Neumannat 2:11 PM
No comments:
Email This
BlogThis!
Share
to Twitter
Share
to Facebook
Share
to Pinterest
SATURDAY, DECEMBER 23, 2017 THE CASE FOR B-TREE INDEX STRUCTURES Recently a very interesting paper made a Case for Learned Index Structures . It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. Which means that it essentially describes the cummulative distribution function(CDF)
,
mapping from keys to array positions. And the argument of that paper was that using machine learning we can do that mapping much better because a) the learned model (in this case neuronal network) is much smaller than a traditional b-tree, and b) the learned model can predict the CDF value much more accurately than a simple b-tree, which improves performance. Now I am all in favor of trying out new ideas, and adapting to the data distribution is clearly a good idea, but do we really need a neural network for that? Because, after all, the neuronal network is just an approximation of the CDF function. There are many other ways to approximate a function, for example spline interpolation: We define a few
knots of the spline, and then interpolate between the knots. For example (picture by D.J. Graham) Thus, what we need for a spline are a sequence of knots where we can interpolate between, i.e., a sequence of (x,y) values. Now, if we think back about traditional index structures, in particular B-trees, we see that they have something similar: The inner pages consist of separator values, and offsets to the next lower level. Well, we can interpret that as a spline. Instead of just going down to the next level and then doing binary search in the next node, we can interpret our search key as position between the two separators, and then _interpolate_ the position of our search key one the next level. This estimate will be slightly off, of course, but the same is true for the machine learning approach, and we can use the same binary search strategy starting from our estimated position. We can use that interpolation strategy on all levels, both when navigating the inner pages and then going down to the leaf nodes. How well does that work in practice? The learned indexes paper gives accuracy results and performance results for different sizes of neuronal network models. In the paper the b-trees are depicted as being very large, but in reality that is a parameter, of course. We can get arbitrarily sized b-trees by modifying the page size of the b-tree. For comparisons we chose the b-trees to have the same size (in KB) as the neuronal networks reported in the paper. The source code of the learned indexes approach is not available, thus we only report the original numbers for the neuronal networks. Our own proof of concept code is available upon request. As data sets we used the map set and the lognormal set mentioned in the paper, as we could not obtain the other data sets. If we just look at the accuracy of the prediction of the final tuple we get as average error the number shown below. For the b-trees we report distance between the estimated position and the real tuple position, averaged over all elements in the data set. For the neuronal networks the wording in the paper is a bit unclear, we think the numbers are the average of the average errors of the second level models, which might be slightly different.Map data
size (MB)
avg error
Learned Index (10,000)0.15
8 ± 45
Learned Index (100,000)1.53
2 ± 36
Complex Learned Index1.53
2 ± 30
B-tree (10,000)
0.15
225
B-tree (100,000)
1.53
22
Log normal data
size (MB)
avg error
Learned Index (10,000)0.15
17,060 ± 61,072
Learned Index (100,000)1.53
17,005 ± 60,959
Complex Learned Index1.53
8 ± 33
B-tree (10,000)
0.15
1,330
B-tree (100,000)
1.53
3
If we look at the numbers, the interpolating b-tree doesn't perform that bad. For the map data the learned index is a bit more accurate, but the difference is small. For the log normal data the interpolating b-tree is in fact much more accurate than the learned index, being able to predict the final position very accurately. What does that mean for index performance? That is a complicated topic, as we do not have the source code of the learned index and we do not even know precisely on which hardware the experiments were run. We thus only give some indicative numbers, being fully aware that we might be comparing apples with oranges due to various differences in hardware and implementation. If we compare the reported numbers from the paper for lognormal with our proof of concept implementation (running on a i7-5820K @ 3.30GHz, searching for every element in the data set in shuffled order) we getLog normal data
Total (ns)
Model (ns)
Search (ns)
Learned Index (10,000)178
26
152
Learned Index (100,000)152
36
127
Complex Learned Index178
110
67
B-tree (10,000)
156
101
54
B-tree (100,000)
171
159
12
Again, the b-tree does not perform that bad, being virtually identical to the reported learned index performance (remember the caveat about hardware differences!). And the b-tree is a very well understood data structure, well tested, with efficient update support etc., while the machine learning model will have great difficulties if the data is updated later on. Thus, I would argue that traditional index structures, in particular b-trees, are still the method of choice, and will probably remain so in the foreseeable future. Does this mean we should not consider machine learning for indexing? No, we should consider everything that helps. It is just that "everything that helps" does not just include fashionable trends like machine learning, but also efficient implementations of well knowndata structures.
Posted by Thomas Neumannat 12:19 PM
15 comments:
Email This
BlogThis!
Share
to Twitter
Share
to Facebook
Share
to Pinterest
FRIDAY, FEBRUARY 17, 2017 REASONING IN THE PRESENCE OF NULLS One of the defining characteristics of SQL is that it is adeclarative
query language. That is, the user does not specify how the result should be computed, but instead specifies what conditions the result should satisfy. Within these constraints the database is free to choose between execution alternatives. This has some interestingconsequences:
Consider for example the query select x=a and x=b clearly, it is equivalent to the following query select x=a and a=b after all, x=a, and thus we can substitute a with x in the secondterm.
And of course the same is true is we replace a and b with constants: select x=1 and x=2is equivalent to
select x=1 and 1=2 Now the really interesting question is: Is a database system allowed to infer that this is equivalent toselect false
? After all, x cannot be equal 1 and equal to 2 simultaneously, and the second formulation is x=1 and false, which is clearly false. But what looks intuitive is not necessarily true, if we consider the result of the following two queries (with PostgreSQLresults):
postgres=> select x,x=1 and x=2 from (values(1),(null)) s(x);x | ?column?
------+----------
1 | f
NULL | NULL
(2 rows)
postgres=> select x,x=1 and 1=2 from (values(1),(null)) s(x);x | ?column?
------+----------
1 | f
NULL | f
(2 rows)
The reason for that difference is the tricky behavior of NULL : 1) A comparison of value with NULL is NULL, and 2) NULL AND NULL is NULL, but 3) NULL AND FALSEis FALSE.
Both 1) and 3) make sense, 1) because we cannot say anything about the result of the comparison, and 3) because it doesn't matter for which value NULL really stands (either true or false), the and with FALSE will always produce a FALSE value. But in combination, they lead to very surprising behavior, namely that some databases return NULL and other databases return FALSE for the same query, depending on whether they noticed that the query contained a contradiction or not. And this kind of reasoning occurs in many other circumstances, too, for example in this query select cast(x as integer)=1.5 obviously, this condition can never be true, regardless of the value for x, as 1.5 is not an integer number. But what about NULL? Are we allowed to statically infer that the result is false, even if x mightbe a NULL value?
I tried to find an answer to that question in the SQL:2011 standard , but unfortunately it is not specified clearly. But after looking at the definition of AND, I have convinced myself that returning false here is ok. After all, the argument for NULL AND FALSE being FALSE is that NULL is an unknown value from within the domain, and any value AND FALSE is FALSE. If we use the same argument here, we can statically decide that the result is false, regardless of the value of x. But even though the argument makes sense (and it is good for the query optimizer, which exploits that fact), it is still unfortunate that the query result changes depending on how smart the database system is. A smart system can infer that the result is false, not matter what, while a more simple system might return NULL. But perhaps that is just the consequence of having a declarative system, we cannot (and should not!) control in which order expressions are evaluated. Posted by Thomas Neumannat 3:22 PM
No comments:
Email This
BlogThis!
Share
to Twitter
Share
to Facebook
Share
to Pinterest
Older Posts
Home
Subscribe to: Posts (Atom)CONTRIBUTORS
* Peter Boncz
* Thomas Neumann
* Unknown
BLOG ARCHIVE
* ▼ 2019 (3)
* ▼ July (1)
* Cuckoo Filters with arbitrarily sized tables* ► May (1)
* ► February
(1)
* ► 2018 (2)
* ► June (1)
* ► April (1)
* ► 2017 (2)
* ► December
(1)
* ► February
(1)
* ► 2016 (2)
* ► August
(1)
* ► April (1)
* ► 2015 (7)
* ► December
(1)
* ► September
(1)
* ► July (1)
* ► June (1)
* ► April (1)
* ► February
(1)
* ► January
(1)
* ► 2014 (11)
* ► December
(1)
* ► September
(1)
* ► August
(1)
* ► July (2)
* ► June (2)
* ► May (4)
Simple theme. Powered by Blogger .Details
Copyright © 2023 ArchiveBay.com. All rights reserved. Terms of Use | Privacy Policy | DMCA | 2021 | Feedback | Advertising | RSS 2.0