MSSQL Server. There is a huge table (> 15 000 000 records). The records contain a unique record identifier (table key), a non-unique identifier (software that writes to the database) as a string, timestamp, and other less interesting data. In the stored procedure (CLR SP), I first want to collect all non-unique identifiers, then, iterating over them, collect the data of interest:

// Псевдокод list_of_serials = SELECT [SerialNo] FROM [Table] GROUP BY [SerialNo]; foreach( serial in list_of_serials ) { rowset = SELECT * FROM [Table] WHERE [SerialNo] = @serial AND [Timestamp] BETWEEN @startDate AND @endDate; ORDER BY [Timestamp] // Обработка результатов } 

The trouble is that each request lasts at least 10 minutes with a local database. I am trying to smoke indices, but I have not yet achieved tangible results. How to deal with this problem? Additional meta-tables and triggers are already too late =) I would appreciate any ideas.

Along the way, another question on .Net: does it make sense to first collect the data in an IEnumerable<...> , then processing it with Linq means, or make a request to the database each time? (We are dealing with CLR SP ).

  • And one SELECT * FROM [Table] WHERE [Timestamp] query BETWEEN @startDate AND @endDate GROUP BY [SerialNo]; with the recording of the result in the file also goes 10 minutes? - avp
  • Forgot to add another ORDER BY-condition. @avp Yes. Even the first request takes more than 5-7 minutes. - free_ze
  • @Free_ze, and how many records are obtained as a result (for all SerialNo for one BETWEEN @startDate AND @endDate)? Maybe there are just a lot of them (in your case, say, 1,000,000). Then speed up anyway fail. - avp
  • @avp SELECT COUNT (*) FROM [Table] WHERE [Timestamp] BETWEEN '20130101' AND '20140101' GROUP BY [SerialNumber] max out - 450,000, the query took 18 minutes =) That is, even the correct indexes can not save me with such a database architecture? - free_ze
  • one
    Today I started reading the book Refactoring of SQL applications. Authors: Pascal Lermi, Stefan Faro, that's where the description of methods for solving such problems occurs. Look at your leisure, I think useful. - NMD

1 answer 1

I solved the problem myself. The tool " Display Estimated Execution Plan " (the item in the context menu within the text query editor in SQL Management Studio ), which graphically displays the planned stages of query processing (including the used indices) and the percentage of time, helped. In such a large table, like mine, you need to get away from complete brute force (by key, clustered index). The structure is as follows:

 [ID] [uniqueidentifier] NOT NULL, [SerialNo] [varchar](60) NOT NULL, [Timestamp] [datetime] NOT NULL, [Variable1] [varchar](100) NOT NULL, [Variable2] [varchar](250) NOT NULL, ..... 

I need to first select all the values ​​of [SerialNo], then, for them, find the values ​​of [Variable2], where [Timestamp] is from @startDate to @endDate.

 -- Step #1 SELECT [SerialNo] FROM [Table] WHERE [Variable1] = 'needed_value' AND [Timestamp] BETWEEN @startDate AND @endDate GROUP BY [SerialNo] ------------------------------------- -- Step #2 SELECT [Timestamp] ,[Variable2] FROM [Table] WHERE [SerialNo] = @serial AND [Variable1] = 'needed_value' AND [Timestamp] BETWEEN @startDate AND @endDate ORDER BY [Timestamp] 

I need my index. It will not be clustered (for it can be only one and this is the key of the table). I did not hesitate to stuff it (in order):

 [SerialNo], [Timestamp], [Variable1], [Variable2] 

As a result: the list of serials is retrieved in 2 seconds (the first request), information on each of them (the second request) is retrieved in less than a second.