In order to find out the size of the entire database, or, for example, a single table, we have the command 'sp_spaceused'. But if you go further, the question arises: how to get statistics on the fields? How to find out which field (column) in the table contains the most difficult data, and which one is the easiest?

The question is not so much theoretical as practical: there is a huge base that needs to be analyzed (gradually I immerse myself in Data-Mining).

    2 answers 2

    Use the DATALENGTH() function

    The following example finds the length of the ProductName column in the table MyOrderTable.

     SELECT DATALENGTH(ProductName) as [SizeInBytes] FROM MyOrderTable 

    For reference: DATALENGTH

      How to find out which field (column) in the table contains the most difficult data, and which one is the easiest?

      Are you interested in the size in bytes, occupied by one or another column on the disk (apparently, since you mentioned sp_spaceused )?

      Not sure if it is possible to determine it accurately. You can see exactly how many pages (blocks of 8Kb by which SqlServer stores data) occupy the data of the entire table (or index).

      The place on the disk allocated for the storage of a specific column, apparently, can only be given a certain estimate (which, however, will not always be adequate). And with datalength not quite simple. Further a little more.


      Take a request

       /*1*/ select sum(datalength([Column])) from [Table]; 

      for evaluation basis.

      First of all . In addition to the actual data of the column, there is always additional service information that may be related to the column, but its volume between the columns can logically be disproportionate to their number in the table (for example, the row header data). Those. size estimate /*1*/ should be taken as "no less than".

      The smaller the column in the table, and the shorter the record, the greater the overhead of the service data, the further the estimate /*1*/ of the reality. Thus, for a table with one short column, the total size of the data (including service information) can significantly exceed the “logical” data size of the column itself. Compare, for example, for a table

       create table tiny_nums (value tinyint); insert into tiny_nums (value) select top (1048576) 1 from master.sys.all_columns a cross join master.sys.all_columns b; 

      the result returned by the query /*1*/ so that sp_spaceused shows.

      Secondly . The value returned by datalength does not always correspond to reality. In particular, if datalength([Column]) returns NULL , then physically it may not be zero at all.

      This is because column types are divided into fixed-length (eg int , char(20) , datetime2(0) , uniqueidentifier , etc.) and variable-length (eg varbinary(64) , nvarchar(30) , etc.). And if for variable-length the estimate /*1*/ approximately valid, then for fixed-length columns, space is reserved for storing the value, even if the value itself is NULL .

      Those. for fixed-length columns, the estimate /*1*/ should be adjusted using instead of NULL (if possible) any non-empty value corresponding to the column type (for example, 0 for int ):

       /*2*/ select sum(datalength(isnull([IntColumn], 0))) from [Table]; 

      You should also take into account that for bit type columns, the returned datalength value is 1. However, if there are several bit columns in the table (or index), then SqlServer combines them by 8 to 1 byte.

      Also, columns can be sparse , which means 0 bytes to store NULL (even for fixed-length), but plus 4 extra bytes to store values ​​if it is not NULL :

       /*3*/ select sum(datalength([SparseColumn]) + 4) from [TableName]; 

      Thirdly . If the column is not just present in the table, but also participates in the indices, then it is "weighted" multiple of the number of indices in which it participates. If the column is key in a clustered index, then you need to add an estimate multiple to the number of all nonclustered indexes (since the leaf-level pages of nonclustered indexes contain the values ​​of the keys of the cluster index). So in the table

       create table SomeTable ( PK_ID bigint primary key, FK_1_ID int, FK_2_ID int, ... FK_10_ID int, UID uniqueidentifier ); create index IX_1 on SomeTable (FK_1_ID); ... create index IX_10 on SomeTable (FK_10_ID); 

      the most "heavy" is likely to be not a UID column at all, but a PK_ID , since (besides participation in the cluster primary key) PK_ID values ​​will be present in 10 more nonclustered indexes.

      It should also be noted that if a nonclustered index is a filtered index , then the corresponding estimate ( /*1*/ , /*2*/ or /*3*/ ) should be taken not across the table, but along the lines corresponding to the filter of such an index.

      Fourthly (refers to the Enterprise edition). If compressing rows or pages of a table

       alter table [TableName] rebuild partition = all with (data_compression = row); 

      either index

       alter index [IX_Name] on [TableName] rebuild with (data_compression = page); 

      then assessments using datalength cease to be adequate and the no less than factor ceases to work.

      Compare for tables

       create table strings (value char(2000)); insert into strings (value) select top (10000) replicate('a', 2000) from master.sys.all_columns a cross join master.sys.all_columns b; 

      and

       create table strings_page (value char(2000)) with (data_compression = page); insert into strings_page with (tablock) (value) select top (10000) replicate('a', 2000) from master.sys.all_columns a cross join master.sys.all_columns b; 

      values ​​for estimating the size of a column using datalength with what sp_spaceused shows. For the first table, the “readings” of datalength and sp_spaceused will be close (since the row of the table is “wide” and the amount of overhead information is not so great), and for the second, they will diverge very much.

      Fifth . All that has been said up to this point is true for SqlServer 2008. In later versions, COLUMNSTORE indexes appeared, which, due to the characteristics of their device, can store data in a substantially compressed form. For them, estimating the size of a column using datalength can also give inadequate results. If for the table

       create table strings_col (value char(2000)); insert into strings_col (value) select top (10000) replicate('a', 2000) from master.sys.all_columns a cross join master.sys.all_columns b; create clustered columnstore index ix_clcs on strings_col; 

      sp_spaceused compare sp_spaceused with datalength , then again you can see a strong discrepancy.

      I believe that this list of factors that should be considered when assessing the place occupied by a particular column is not exhaustive.