How to choose the right db_block_size

Posted on 24 January 2008 by Praveen

What block size should I use when creating a database? 

To begin with, the answer to that question depends entirely on the answer to another: are you using buffered I/O or not?

If you are buffering your I/O (that is, your file system has its own buffering mechanism, as most do by default), then your Oracle block size should match your file system’s buffer size exactly. For ext3, for example, that would mean 4K blocks by default, whilst for Tru64, it would mean 8K blocks -because those are the buffer sizes for the respective file systems. NTFS, by contrast, is a little unusual for mainstream file systems in that it does not have its own buffer, so reads from and writes to a volume formatted with NTFS are done directly -and at that point you can pretty much pick any Oracle block size you like.

If you are using buffered I/O, then all is not lost, because you usually have control over how big the filesystem’s buffers will be at the time of creating the new file system (that is, formatting the volume). In Linux, for example, when you format a new drive with ext2 or ext3, you can issue the command

mke2fs -b 2048 /dev/hdc1

…and the -b switch there sets up the “block size” for the new partition, and hence the file system’s buffer size is now 2K, not the standard 4K. At that point, 2K Oracle blocks become feasible, too

Partitions which don’t have a file system at all -that is, they are raw (the Unix name) or unformatted logical partitions (the Windows one)- will obviously not have a file system buffer to worry about, so again you can pretty much pick whatever Oracle block size you like.

What if I am using something like Oracle 10g’s ASM or the Oracle Cluster File System?

Well, ASM usually wraps itself around raw devices, so if you’re using ASM, you’re actually using raw under the hood, and you therefore have direct I/O at your disposal, and thus have complete freedom over which Oracle block size to use.

The Oracle Cluster File System (OCFS) is also a direct I/O technology, so once again any Oracle block size is acceptable. Just make sure you don’t get your block sizes confused, though! When you format a volume with OCFS, you do so with a command like this:

mkfs.ocfs -b 128 -L CFS_VOL

…and, once again, the -b switch there establishes the internal file system block size used to organise things on disk. But because OCFS is not a buffered file system, this does not (unlike with ext2 or ext3, for example) also set up a file system buffer size that you would then have to match when setting DB_BLOCK_SIZE. So don’t go thinking that having formatted the OCFS volume in this way that you’d then need to create a database with 128K for its DB_BLOCK_SIZE!! Apart from anything else, of course, DB_BLOCK_SIZE can only be a maximum of 32K on some platforms, and on common platforms (like Linux and Windows) can’t actually exceed 16K.

When you say I have complete freedom over the Oracle block size if I am using direct I/O technologies, do you really mean complete freedom?

 

In theory, yes. In practice, no. If you are using direct I/O, then you certainly have the freedom to set DB_BLOCK_SIZE to anything between 2K and 32K, though there are operating system limitations to take into account (Windows and Linux not permitting anything bigger than 16K, as I’ve already mentioned, for example). So certainly: in theory, you can set your DB_BLOCK_SIZE to anything in that range, and not have to worry about whether the file system is managing to buffer the reads and writes optimally.

But in practice, if you stuff lots of data into a block because it is big, you make the block an attractive target for lots of users performing DML, who will all collide on the same block, and thus contend for access to the block’s transaction slots. Conversely, if you make your Oracle blocks tiny, so that they each contain very few rows, then performing a full table scan is going to involve reading (perhaps) thousands of blocks, rather than mere tens or hundreds. 

In short, big blocks increase the risk of contention, and small blocks result in lots of I/O to retrieve small data sets. So, practically, when deciding on what Oracle block size to adopt, assuming you have the freedom to choose because you’re using direct I/O, you need to know  whether your database is mostly going to be reading large data sets (in which case, big blocks would be a good idea), or whether it is mostly going to be doing lots of DML (in which case, contention for a block’s transaction slots is a worry, and smaller blocks might help alleviate that).

This is the origin of the old myth (which is still unfortunately peddled without qualification) that data warehouses should use big blocks and OLTP databases should use small ones.{mospagebreak}

Now you’re confusing me. Is this a myth or not?! You’ve already agreed with its basic premise, it seems to me!

 

 I’ve agreed with the physics of the case - that big blocks are likely to cause contention, and small blocks won’t. And I’ve also agreed with the commonsense proposition that small blocks will have fewer rows in them than big blocks, and hence more I/O will be required to retrieve a given number of rows if they are stored in small blocks than in big ones. These are facts, after all, and fairly obvious ones at that.

The trouble starts when you then turn those commonsense observations into simplistic doctrines or rules of thumb such as ‘warehouses should use big blocks and OLTP should use small blocks’.

In the first place, such a rule of thumb ignores what this article has gone to some lengths to clear up in the first place: if you are using buffered I/O, the simplistic doctrine does not, cannot and must not apply to you. Data warehouse or OLTP shop, it makes no difference: if you’re using buffered I/O, you must use an Oracle block size that matches the file system’s buffer, or else suffer the inefficiencies in your I/O subsystem that arise when Oracle’s reads and writes are done in chunks which don’t neatly fit what the file system uses internally.

Secondly, the simplistic doctrine ignores the fact that there are other ways of dealing with contention than simply using smaller blocks. If you increase PCTFREE, for example, to (say) 50 rather than its default of 10, you have certainly “wasted” half your block -but in the process, an 8K block -which might have been subject to DML contention- has effectively been turned into a 4K block, which might not. The point is, you can set PCTFREE to different values for different tables and indexes, and can thus adapt it depending on what segments are actually suffering from contention -but DB_BLOCK_SIZE is set for the entire database, with no scope for subtlety on a case-by-case basis.

Similarly, if you have the money, you can buy the partitioning option for the Enterprise Edition -and hash partitioning, for one, can scatter rows which might otherwise be physically adjacent with each other, into widely-separated locations. Once rows are no longer co-located, they cannot be the subject of contention. Obviously, this isn’t a cheap option, and it’s not a panacea without consequences of its own, either. My point is simply that there are more ways to deal with contention than just settling for a one-size-fits-all small block size. 

But in 9i and later, I can use the multiple block size feature to achive the sort of case-by-case subtelty you’re talking about, can’t I?

 

You can certainly try! Obviously, if you’re using buffered I/O, then multiple block sizes, whilst physically do-able, is a really bad idea. But if you’re using direct I/O (raw, ASM or OCFS, for example), then it’s certainly a feasible approach.

But do bear in mind that in Oracle 10g, the parameters which set up the multiple block size feature (DB_nK_CACHE_SIZE) cannot then be subject to automatic tuning. Only the standard buffer cache (DB_CACHE_SIZE), using the standard database block size (DB_BLOCK_SIZE) can be the subject of automatic tuning. 

So, if you merrily adopt the multiple block size feature thinking that it’s a whiz-bang way of dealing with contention issues, do please bear in mind that you are actually making a rod for your own performance tuning back.

Conversely, if you plan on taking full advantage of the automated features in 10g, then you should stick to using DB_BLOCK_SIZE alone, and that means dealing with contention issues by tweaking PCTFREE on a segment-by-segment basis.{mospagebreak}

Well, now I’m getting lost. Assume I’m using direct I/O, in a 10g OLTP database, and I want automatic tuning to work properly. What DB_BLOCK_SIZE should I use?!

 

OK, taking that apart, bit by bit. The use of 10g means you could use multiple block sizes, but the desire to use automatic tuning means that you shouldn’t. The use of direct I/O means you could pick any block size you like, but the fact that you are an OLTP database, with lots of potential for contention, means that you should probably go smaller rather than bigger… except that OLTP databases will use lots of indexes, and if there’s one thing a big block size will do very nicely is ‘flatten’ your indexes: with fewer leaf nodes, there will be fewer branch nodes, and the height of your indexes can be reduced accordingly.

So we are at an impasse: the non-contending DML requirement suggests something like 2K or 4K, but the need for flat, efficient indexes suggests something more like 8K or 16K.

Well, my answer would be to probably go for 8K or 16K, and where contention actually arises (hopefully on just a handful of key tables and indexes), I’d be looking to use mechanisms such as increased PCTFREE, hash partitioning, reverse key indexes or hash clusters to deal with the problem. If you forced me to choose between 8K and 16K, then given that this is an OLTP database, with an ever-present risk of contention issues arising, I’d opt for 8K.

The “probably” there means my advice would change if it so happened that 90% of your tables and indexes were subject to such high levels of concurrent access that most of your database was going to be in need of a fix for contention. In that case, it would be more sensible to opt for a database-wide default of smaller block sizes.   

And if I were a 10g data warehouse with direct I/O and automatic tuning?

 

I’d still suggest 8K or 16K, though this time I’d be more likely to insist on 16K rather than 8K. The large block size will be good for full table scans as well as flattening and nicely compacting any indexes that are in use, and in a read-mostly environment like a warehouse, DML-induced contention issues are unlikely to be much of a problem.

Sounds like you’re saying 8K-ish in either case, and deal with contention as and when it arises?

 

Pretty much. The general rule is that big blocks are better for you than smaller ones. That’s because big blocks can store big rows without causing them to split (or chain, as it’s more accurately known); a table can be stored in fewer big blocks, and that means less I/O; indexes are more efficient when stored in big blocks; and the Oracle storage overhead is proportionately much smaller for a big block than for a small one.

So go big -that is, at least 8K, but 16K or better if possible.

Except that big blocks can suffer from more DML contention than small ones, so if contention is a concern, smaller it has to be. And that means thinking 8K and down. Whether you go 8K and use tricks like increased PCTFREE to deal with contention on a segment-by-segment basis, or whether you go 2K or 4K for the entire database; well, that would depend entirely on how widespread your contention risks are -but the advantages of big blocks are so extensive that I’d really be wanting to go generally large with specific exceptions if that was at all possible.

So there are no really simple answers?

 

Well, that depends! If you’re using buffered I/O, use your file system’s buffer size as your block size. That’s a fairly simple answer!

If you’re using direct I/O, use 8K or bigger, unless you are likely to have very widespread contention issues as a result of highly-concurrent massive DML. That’s a fairly simple answer too!

As is its corollary: if you are going to have widespread contention issues, affecting a large proportion of the segments in your database, then go for a smaller blocksize. 

So that’s three pretty straightforward statements of principle, it seems to me. But no, you won’t find me suggesting simplistic answers like “All indexes should be built in 32K blocks”, because that sort of answer ignores the buffered I/O factor, says nothing about the risks of DML-induced contention, and ignores the fact that using multiple block sizes means kissing goodbye to 10g’s automatic tuning features. Simple I am happy to do whenever I can; simplistic I won’t do, though

Leave a Reply