blob: 6793f25b3630adb53d0707600d99ec5913e3537f [file] [log] [blame]
drha059ad02001-04-17 20:09:11 +00001/*
drh9e572e62004-04-23 23:43:10 +00002** 2004 April 6
drha059ad02001-04-17 20:09:11 +00003**
drhb19a2bc2001-09-16 00:13:26 +00004** The author disclaims copyright to this source code. In place of
5** a legal notice, here is a blessing:
drha059ad02001-04-17 20:09:11 +00006**
drhb19a2bc2001-09-16 00:13:26 +00007** May you do good and not evil.
8** May you find forgiveness for yourself and forgive others.
9** May you share freely, never taking more than you give.
drha059ad02001-04-17 20:09:11 +000010**
11*************************************************************************
drh9e572e62004-04-23 23:43:10 +000012** $Id: btree.c,v 1.104 2004/04/23 23:43:10 drh Exp $
drh8b2f49b2001-06-08 00:21:52 +000013**
14** This file implements a external (disk-based) database using BTrees.
15** For a detailed discussion of BTrees, refer to
16**
17** Donald E. Knuth, THE ART OF COMPUTER PROGRAMMING, Volume 3:
18** "Sorting And Searching", pages 473-480. Addison-Wesley
19** Publishing Company, Reading, Massachusetts.
20**
21** The basic idea is that each page of the file contains N database
22** entries and N+1 pointers to subpages.
23**
24** ----------------------------------------------------------------
25** | Ptr(0) | Key(0) | Ptr(1) | Key(1) | ... | Key(N) | Ptr(N+1) |
26** ----------------------------------------------------------------
27**
28** All of the keys on the page that Ptr(0) points to have values less
29** than Key(0). All of the keys on page Ptr(1) and its subpages have
30** values greater than Key(0) and less than Key(1). All of the keys
31** on Ptr(N+1) and its subpages have values greater than Key(N). And
32** so forth.
33**
drh5e00f6c2001-09-13 13:46:56 +000034** Finding a particular key requires reading O(log(M)) pages from the
35** disk where M is the number of entries in the tree.
drh8b2f49b2001-06-08 00:21:52 +000036**
37** In this implementation, a single file can hold one or more separate
38** BTrees. Each BTree is identified by the index of its root page. The
drh9e572e62004-04-23 23:43:10 +000039** key and data for any entry are combined to form the "payload". A
40** fixed amount of payload can be carried directly on the database
41** page. If the payload is larger than the preset amount then surplus
42** bytes are stored on overflow pages. The payload for an entry
43** and the preceding pointer are combined to form a "Cell". Each
44** page has a small header which contains the Ptr(N+1) pointer and other
45** information such as the size of key and data.
drh8b2f49b2001-06-08 00:21:52 +000046**
drh9e572e62004-04-23 23:43:10 +000047** FORMAT DETAILS
48**
49** The file is divided into pages. The first page is called page 1,
50** the second is page 2, and so forth. A page number of zero indicates
51** "no such page". The page size can be anything between 512 and 65536.
52** Each page can be either a btree page, a freelist page or an overflow
53** page.
54**
55** The first page is always a btree page. The first 100 bytes of the first
56** page contain a special header that describes the file. The format
57** of that header is as follows:
58**
59** OFFSET SIZE DESCRIPTION
60** 0 16 Header string: "SQLite version 3"
61** 16 2 Page size in bytes.
62** 18 1 File format write version
63** 19 1 File format read version
64** 20 2 Bytes of unused space at the end of each page
65** 22 2 Maximum allowed local payload per entry
66** 24 8 File change counter
67** 32 4 First freelist page
68** 36 4 Number of freelist pages in the file
69** 40 60 15 4-byte meta values passed to higher layers
70**
71** All of the integer values are big-endian (most significant byte first).
72** The file change counter is incremented every time the database is changed.
73** This allows other processes to know when the file has changed and thus
74** when they need to flush their cache.
75**
76** Each btree page begins with a header described below. Note that the
77** header for page one begins at byte 100. For all other btree pages, the
78** header begins on byte zero.
79**
80** OFFSET SIZE DESCRIPTION
81** 0 1 Flags. 01: leaf, 02: zerodata, 04: intkey, F8: type
82** 1 2 byte offset to the first freeblock
83** 3 2 byte offset to the first cell
84** 5 1 number of fragmented free bytes
85** 6 4 Right child (the Ptr(N+1) value). Omitted if leaf
86**
87** The flags define the format of this btree page. The leaf flag means that
88** this page has no children. The zerodata flag means that this page carries
89** only keys and no data. The intkey flag means that the key is a single
90** variable length integer at the beginning of the payload.
91**
92** A variable-length integer is 1 to 9 bytes where the lower 7 bits of each
93** byte are used. The integer consists of all bytes that have bit 8 set and
94** the first byte with bit 8 clear. Unlike fixed-length values, variable-
95** length integers are little-endian. Examples:
96**
97** 0x00 becomes 0x00000000
98** 0x1b becomes 0x0000001b
99** 0x9b 0x4a becomes 0x00000dca
100** 0x80 0x1b becomes 0x0000001b
101** 0xf8 0xac 0xb1 0x91 0x01 becomes 0x12345678
102** 0x81 0x81 0x81 0x81 0x01 becomes 0x10204081
103**
104** Variable length integers are used for rowids and to hold the number of
105** bytes of key and data in a btree cell.
106**
107** Unused space within a btree page is collected into a linked list of
108** freeblocks. Each freeblock is at least 4 bytes in size. The byte offset
109** to the first freeblock is given in the header. Freeblocks occur in
110** increasing order. Because a freeblock is 4 bytes in size, the minimum
111** size allocation on a btree page is 4 bytes. Because a freeblock must be
112** at least 4 bytes in size, any group of 3 or fewer unused bytes cannot
113** exist on the freeblock chain. The total number of such fragmented bytes
114** is recorded in the page header at offset 5.
115**
116** SIZE DESCRIPTION
117** 2 Byte offset of the next freeblock
118** 2 Bytes in this freeblock
119**
120** Cells are of variable length. The first cell begins on the byte defined
121** in the page header. Cells do not necessarily occur in order - they can
122** skip around on the page.
123**
124** SIZE DESCRIPTION
125** 2 Byte offset of the next cell. 0 if this is the last cell
126** 4 Page number of the left child. Omitted if leaf flag is set.
127** var Number of bytes of data. Omitted if the zerodata flag is set.
128** var Number of bytes of key. Omitted if the intkey flag is set.
129** * Payload
130** 4 First page of the overflow chain. Omitted if no overflow
131**
132** Overflow pages form a linked list. Each page except the last is completely
133** filled with data (pagesize - 4 bytes). The last page can have as little
134** as 1 byte of data.
135**
136** SIZE DESCRIPTION
137** 4 Page number of next overflow page
138** * Data
139**
140** Freelist pages come in two subtypes: trunk pages and leaf pages. The
141** file header points to first in a linked list of trunk page. Each trunk
142** page points to multiple leaf pages. The content of a leaf page is
143** unspecified. A trunk page looks like this:
144**
145** SIZE DESCRIPTION
146** 4 Page number of next trunk page
147** 4 Number of leaf pointers on this page
148** * zero or more pages numbers of leaves
drha059ad02001-04-17 20:09:11 +0000149*/
150#include "sqliteInt.h"
151#include "pager.h"
152#include "btree.h"
153#include <assert.h>
154
paulb95a8862003-04-01 21:16:41 +0000155/* Forward declarations */
156static BtOps sqliteBtreeOps;
157static BtCursorOps sqliteBtreeCursorOps;
158
drh8c42ca92001-06-22 19:15:00 +0000159/*
drhbd03cae2001-06-02 02:40:57 +0000160** This is a magic string that appears at the beginning of every
drh8c42ca92001-06-22 19:15:00 +0000161** SQLite database in order to identify the file as a real database.
drh9e572e62004-04-23 23:43:10 +0000162** 0123456789 123456 */
163static const char zMagicHeader[] = "SQLite version 3";
drh08ed44e2001-04-29 23:32:55 +0000164
165/*
drh9e572e62004-04-23 23:43:10 +0000166** Page type flags
drh8c42ca92001-06-22 19:15:00 +0000167*/
drh9e572e62004-04-23 23:43:10 +0000168#define PTF_LEAF 0x01
169#define PTF_ZERODATA 0x02
170#define PTF_INTKEY 0x04
drh8c42ca92001-06-22 19:15:00 +0000171
172/*
drh9e572e62004-04-23 23:43:10 +0000173** As each page of the file is loaded into memory, an instance of the following
174** structure is appended and initialized to zero. This structure stores
175** information about the page that is decoded from the raw file page.
drh18b81e52001-11-01 13:52:52 +0000176** The extra two entries in apCell[] are an allowance for this situation.
drh14acc042001-06-10 19:56:58 +0000177**
drh72f82862001-05-24 21:06:34 +0000178** The pParent field points back to the parent page. This allows us to
179** walk up the BTree from any leaf to the root. Care must be taken to
180** unref() the parent page pointer when this page is no longer referenced.
drhbd03cae2001-06-02 02:40:57 +0000181** The pageDestructor() routine handles that chore.
drh7e3b0a02001-04-28 16:52:40 +0000182*/
183struct MemPage {
drh9e572e62004-04-23 23:43:10 +0000184 struct BTree *pBt; /* Pointer back to BTree structure */
185 unsigned char *aData; /* Pointer back to the start of the page */
186 u8 idxShift; /* True if Cell indices have changed */
187 u8 isOverfull; /* Some aCell[] points outside u.aDisk[] */
188 u8 intKey; /* True if intkey flag is set */
189 u8 leaf; /* True if leaf flag is set */
190 u8 zeroData; /* True if zero data flag is set */
191 u8 hdrOffset; /* 100 for page 1. 0 otherwise */
192 Pgno pgno; /* Page number for this page */
drh72f82862001-05-24 21:06:34 +0000193 MemPage *pParent; /* The parent of this page. NULL for root */
drh428ae8c2003-01-04 16:48:09 +0000194 int idxParent; /* Index in pParent->apCell[] of this node */
drh9e572e62004-04-23 23:43:10 +0000195 int nFree; /* Number of free bytes on the page */
drh306dc212001-05-21 13:45:10 +0000196 int nCell; /* Number of entries on this page */
drh9e572e62004-04-23 23:43:10 +0000197 unsigned char **aCell; /* Pointer to start of each cell */
drh8c42ca92001-06-22 19:15:00 +0000198};
drh7e3b0a02001-04-28 16:52:40 +0000199
200/*
drh3b7511c2001-05-26 13:15:44 +0000201** The in-memory image of a disk page has the auxiliary information appended
202** to the end. EXTRA_SIZE is the number of bytes of space needed to hold
203** that extra information.
204*/
drh9e572e62004-04-23 23:43:10 +0000205#define EXTRA_SIZE sizeof(Mempage)
drh3b7511c2001-05-26 13:15:44 +0000206
207/*
drha059ad02001-04-17 20:09:11 +0000208** Everything we need to know about an open database
209*/
210struct Btree {
paulb95a8862003-04-01 21:16:41 +0000211 BtOps *pOps; /* Function table */
drha059ad02001-04-17 20:09:11 +0000212 Pager *pPager; /* The page cache */
drh306dc212001-05-21 13:45:10 +0000213 BtCursor *pCursor; /* A list of all open cursors */
drh9e572e62004-04-23 23:43:10 +0000214 MemPage *page1; /* First page of the database */
drh663fc632002-02-02 18:49:19 +0000215 u8 inTrans; /* True if a transaction is in progress */
216 u8 inCkpt; /* True if there is a checkpoint on the transaction */
drh5df72a52002-06-06 23:16:05 +0000217 u8 readOnly; /* True if the underlying file is readonly */
drh9e572e62004-04-23 23:43:10 +0000218 int pageSize; /* Number of usable bytes on each page */
219 int maxLocal; /* Maximum local payload */
drha059ad02001-04-17 20:09:11 +0000220};
221typedef Btree Bt;
222
drh365d68f2001-05-11 11:02:46 +0000223/*
224** A cursor is a pointer to a particular entry in the BTree.
225** The entry is identified by its MemPage and the index in
drh5e2f8b92001-05-28 00:41:15 +0000226** MemPage.apCell[] of the entry.
drh365d68f2001-05-11 11:02:46 +0000227*/
drh72f82862001-05-24 21:06:34 +0000228struct BtCursor {
paulb95a8862003-04-01 21:16:41 +0000229 BtCursorOps *pOps; /* Function table */
drh5e2f8b92001-05-28 00:41:15 +0000230 Btree *pBt; /* The Btree to which this cursor belongs */
drh14acc042001-06-10 19:56:58 +0000231 BtCursor *pNext, *pPrev; /* Forms a linked list of all cursors */
drhf74b8d92002-09-01 23:20:45 +0000232 BtCursor *pShared; /* Loop of cursors with the same root page */
drh8b2f49b2001-06-08 00:21:52 +0000233 Pgno pgnoRoot; /* The root page of this tree */
drh5e2f8b92001-05-28 00:41:15 +0000234 MemPage *pPage; /* Page that contains the entry */
drh8c42ca92001-06-22 19:15:00 +0000235 int idx; /* Index of the entry in pPage->apCell[] */
drhecdc7532001-09-23 02:35:53 +0000236 u8 wrFlag; /* True if writable */
drh2dcc9aa2002-12-04 13:40:25 +0000237 u8 eSkip; /* Determines if next step operation is a no-op */
drh5e2f8b92001-05-28 00:41:15 +0000238 u8 iMatch; /* compare result from last sqliteBtreeMoveto() */
drh365d68f2001-05-11 11:02:46 +0000239};
drh7e3b0a02001-04-28 16:52:40 +0000240
drha059ad02001-04-17 20:09:11 +0000241/*
drh2dcc9aa2002-12-04 13:40:25 +0000242** Legal values for BtCursor.eSkip.
243*/
244#define SKIP_NONE 0 /* Always step the cursor */
245#define SKIP_NEXT 1 /* The next sqliteBtreeNext() is a no-op */
246#define SKIP_PREV 2 /* The next sqliteBtreePrevious() is a no-op */
247#define SKIP_INVALID 3 /* Calls to Next() and Previous() are invalid */
248
paulb95a8862003-04-01 21:16:41 +0000249/* Forward declarations */
drh144f9ea2003-04-16 01:28:16 +0000250static int fileBtreeCloseCursor(BtCursor *pCur);
paulb95a8862003-04-01 21:16:41 +0000251
drh2dcc9aa2002-12-04 13:40:25 +0000252/*
drh9e572e62004-04-23 23:43:10 +0000253** Read or write a two-, four-, and eight-byte integer values
drh0d316a42002-08-11 20:10:47 +0000254*/
drh9e572e62004-04-23 23:43:10 +0000255static u32 get2byte(unsigned char *p){
256 return (p[0]<<8) | p[1];
drh0d316a42002-08-11 20:10:47 +0000257}
drh9e572e62004-04-23 23:43:10 +0000258static u32 get4byte(unsigned char *p){
259 return (p[0]<<24) | (p[1]<<16) | (p[2]<<8) | p[3];
260}
261static u64 get4byte(unsigned char *p){
262 u64 v = get4byte(p);
263 return (v<<32) | get4byte(&p[4]);
264}
265static void put2byte(unsigned char *p, u32 v){
266 p[0] = v>>8;
267 p[1] = v;
268}
269static void put4byte(unsigned char *p, u32 v){
270 p[0] = v>>24;
271 p[1] = v>>16;
272 p[2] = v>>8;
273 p[3] = v;
274}
275static void put8byte(unsigned char *p, u64 v){
276 put4byte(&p[4], v>>32);
277 put4byte(p, v);
278}
279
280/*
281** Read a variable-length integer. Store the result in *pResult.
282** Return the number of bytes in the integer.
283*/
284static unsigned int getVarint(unsigned char *p, u64 *pResult){
285 u64 x = p[0] & 0x7f;
286 int n = 0;
287 while( (p[n++]&0x80)!=0 ){
288 x |= (p[n]&0x7f)<<(n*7);
289 }
290 *pResult = x;
291 return n;
292}
293
294/*
295** Write a variable length integer with value v into p[]. Return
296** the number of bytes written.
297*/
298static unsigned int putVarint(unsigned char *p, u64 v){
299 int i = 0;
300 do{
301 p[i++] = v & 0x7f;
302 v >>= 7;
303 }while( v!=0 );
304 p[i-1] |= 0x80;
305 return i;
drh0d316a42002-08-11 20:10:47 +0000306}
307
308/*
drh3b7511c2001-05-26 13:15:44 +0000309** Compute the total number of bytes that a Cell needs on the main
drh5e2f8b92001-05-28 00:41:15 +0000310** database page. The number returned includes the Cell header,
311** local payload storage, and the pointer to overflow pages (if
drh8c42ca92001-06-22 19:15:00 +0000312** applicable). Additional space allocated on overflow pages
drhbd03cae2001-06-02 02:40:57 +0000313** is NOT included in the value returned from this routine.
drh3b7511c2001-05-26 13:15:44 +0000314*/
drh9e572e62004-04-23 23:43:10 +0000315static int cellSize(MemPage *pPage, unsigned char *pCell){
316 int n, nPayload;
317 u64 nData, nKey;
318 int maxPayload;
319 if( pPage->leaf ){
320 n = 2;
drh3b7511c2001-05-26 13:15:44 +0000321 }else{
drh9e572e62004-04-23 23:43:10 +0000322 n = 6;
drh3b7511c2001-05-26 13:15:44 +0000323 }
drh9e572e62004-04-23 23:43:10 +0000324 if( pPage->zeroData ){
325 nData = 0;
326 }else{
327 n += getVarint(&pCell[n], &nData);
328 }
329 if( pPage->intKey ){
330 u64 dummy;
331 nKey = getVarint(&pCell[n], &dummy);
332 }else{
333 n += getVarint(pCell, &nKey);
334 }
335 nPayload = nKey + nData;
336 maxPayload = pPage->pBt->maxPayload;
337 if( nPayload>maxPayload ){
338 nPayload = maxPayload + 4;
339 }
340 return n + nPayload;
drh3b7511c2001-05-26 13:15:44 +0000341}
342
343/*
drh72f82862001-05-24 21:06:34 +0000344** Defragment the page given. All Cells are moved to the
345** beginning of the page and all free space is collected
346** into one big FreeBlk at the end of the page.
drh365d68f2001-05-11 11:02:46 +0000347*/
drh9e572e62004-04-23 23:43:10 +0000348static void defragmentPage(MemPage *pPage){
drh14acc042001-06-10 19:56:58 +0000349 int pc, i, n;
drh9e572e62004-04-23 23:43:10 +0000350 int start, hdr;
351 int leftover;
352 unsigned char *oldPage;
353 unsigned char newPage[SQLITE_PAGE_SIZE];
drh2af926b2001-05-15 00:39:25 +0000354
drh9e572e62004-04-23 23:43:10 +0000355 assert( sqlitepager_iswriteable(pPage->aData) );
356 assert( pPage->pBt!=0 );
357 assert( pPage->pageSize <= SQLITE_PAGE_SIZE );
358 oldPage = pPage->aData;
359 hdr = pPage->hdrOffset;
360 ptr = 3+hdr;
361 n = 6+hdr;
362 if( !pPage->leaf ){
363 n += 4;
drh2af926b2001-05-15 00:39:25 +0000364 }
drh9e572e62004-04-23 23:43:10 +0000365 start = n;
366 pc = get2byte(&oldPage[ptr]);
367 i = 0;
368 while( pc>0 ){
369 assert( n<pPage->pageSize );
370 size = cellSize(pPage, &oldPage[pc]);
371 memcpy(&newPage[n], &oldPage[pc], size);
372 put2byte(&newPage[ptr],n);
373 pPage->aCell[i] = &oldPage[n];
374 n += size;
375 ptr = pc;
376 pc = get2byte(&oldPage[pc]);
drh2aa679f2001-06-25 02:11:07 +0000377 }
drh9e572e62004-04-23 23:43:10 +0000378 leftover = pPage->pageSize - n;
379 assert( leftover>=0 );
380 assert( pPage->nFree==leftover );
381 if( leftover<4 ){
382 oldPage[hdr+5] = leftover;
383 leftover = 0;
384 n = pPage->pageSize;
385 }
386 memcpy(&oldPage[start], &newPage[start], n-start);
387 if( leftover==0 ){
388 put2byte(&oldPage[hdr+3], 0);
389 }else if( leftover>=4 ){
390 put2byte(&oldPage[hdr+3], n);
391 put2byte(&oldPage[n], 0);
392 put2byte(&oldPage[n+2], leftover);
393 memset(&oldPage[n+4], 0, leftover-4);
394 }
drh365d68f2001-05-11 11:02:46 +0000395}
396
drha059ad02001-04-17 20:09:11 +0000397/*
drh9e572e62004-04-23 23:43:10 +0000398** Allocate nByte bytes of space on a page. If nByte is less than
399** 4 it is rounded up to 4.
drhbd03cae2001-06-02 02:40:57 +0000400**
drh9e572e62004-04-23 23:43:10 +0000401** Return the index into pPage->aData[] of the first byte of
drhbd03cae2001-06-02 02:40:57 +0000402** the new allocation. Or return 0 if there is not enough free
403** space on the page to satisfy the allocation request.
drh2af926b2001-05-15 00:39:25 +0000404**
drh72f82862001-05-24 21:06:34 +0000405** If the page contains nBytes of free space but does not contain
drh8b2f49b2001-06-08 00:21:52 +0000406** nBytes of contiguous free space, then this routine automatically
407** calls defragementPage() to consolidate all free space before
408** allocating the new chunk.
drh9e572e62004-04-23 23:43:10 +0000409**
410** Algorithm: Carve a piece off of the first freeblock that is
411** nByte in size or that larger.
drh7e3b0a02001-04-28 16:52:40 +0000412*/
drh9e572e62004-04-23 23:43:10 +0000413static int allocateSpace(MemPage *pPage, int nByte){
414 int ptr, pc, hdr;
415 int size;
416 unsigned char *data;
drh44ce7e22003-06-17 02:57:17 +0000417#ifndef NDEBUG
418 int cnt = 0;
419#endif
drh72f82862001-05-24 21:06:34 +0000420
drh9e572e62004-04-23 23:43:10 +0000421 data = pPage->aData;
422 assert( sqlitepager_iswriteable(data) );
423 assert( pPage->pBt );
424 if( nByte<4 ) nByte = 4;
drh14acc042001-06-10 19:56:58 +0000425 if( pPage->nFree<nByte || pPage->isOverfull ) return 0;
drh9e572e62004-04-23 23:43:10 +0000426 hdr = pPage->hdrOffset;
427 if( data[hdr+5]>=252 ){
428 defragmentPage(pPage);
drh2af926b2001-05-15 00:39:25 +0000429 }
drh9e572e62004-04-23 23:43:10 +0000430 ptr = hdr+1;
431 pc = get2byte(&data[ptr]);
432 assert( ptr<pc );
433 assert( pc<=pPage->pageSize-4 );
434 while( (size = get2byte(&data[pc+2])<nByte ){
435 ptr = pc;
436 pc = get2byte(&data[ptr]);
437 assert( pc<=pPage->pageSize-4 );
438 assert( pc>=ptr+size+4 || pc==0 );
439 if( pc==0 ){
440 assert( (cnt++)==0 );
441 defragmentPage(pPage);
442 assert( data[hdr+5]==0 );
443 ptr = pPage->hdrOffset+1;
444 pc = get2byte(&data[ptr]);
445 }
446 }
447 assert( pc>0 && size>=nByte );
448 assert( pc+size<=pPage->pageSize );
449 if( size>nByte+4 ){
450 put2byte(&data[ptr], pc+nByte);
451 put2byte(&data[pc+size], get2byte(&data[pc]));
452 put2byte(&data[pc+size+2], size-nByte);
drh2af926b2001-05-15 00:39:25 +0000453 }else{
drh9e572e62004-04-23 23:43:10 +0000454 put2byte(&data[ptr], get2byte(&data[pc]));
455 data[hdr+5] += size-nByte;
drh2af926b2001-05-15 00:39:25 +0000456 }
457 pPage->nFree -= nByte;
drh9e572e62004-04-23 23:43:10 +0000458 assert( pPage->nFree>=0 );
459 return pc;
drh7e3b0a02001-04-28 16:52:40 +0000460}
461
462/*
drh9e572e62004-04-23 23:43:10 +0000463** Return a section of the pPage->aData to the freelist.
464** The first byte of the new free block is pPage->aDisk[start]
465** and the size of the block is "size" bytes.
drh306dc212001-05-21 13:45:10 +0000466**
467** Most of the effort here is involved in coalesing adjacent
468** free blocks into a single big free block.
drh7e3b0a02001-04-28 16:52:40 +0000469*/
drh9e572e62004-04-23 23:43:10 +0000470static void freeSpace(MemPage *pPage, int start, int size){
471 int end = start + size; /* End of the segment being freed */
472 int ptr, pbegin, pend;
473#ifndef NDEBUG
474 int tsize = 0; /* Total size of all freeblocks */
475#endif
476 unsigned char *data = pPage->aData;
drh2af926b2001-05-15 00:39:25 +0000477
drh9e572e62004-04-23 23:43:10 +0000478 assert( pPage->pBt!=0 );
479 assert( sqlitepager_iswriteable(data) );
480 assert( start>=pPage->hdrOffset+6+(pPage->leaf?0:4) );
481 assert( end<=pPage->pBt->pageSize );
482 if( size<4 ) size = 4;
483
484 /* Add the space back into the linked list of freeblocks */
485 ptr = pPage->hdrOffset + 1;
486 while( (pbegin = get2byte(&data[ptr]))<start && pbegin>0 ){
487 assert( pbegin<=pPage->pBt->pageSize-4 );
488 assert( pbegin>ptr );
489 ptr = pbegin;
drh2af926b2001-05-15 00:39:25 +0000490 }
drh9e572e62004-04-23 23:43:10 +0000491 assert( pbegin<=pPage->pBt->pageSize-4 );
492 assert( pbegin>ptr || pbegin==0 );
493 put2bytes(&data[ptr], start);
494 put2bytes(&data[start], pbegin);
495 put2bytes(&data[start+2], size);
drh2af926b2001-05-15 00:39:25 +0000496 pPage->nFree += size;
drh9e572e62004-04-23 23:43:10 +0000497
498 /* Coalesce adjacent free blocks */
499 ptr = pPage->hdrOffset + 1;
500 while( (pbegin = get2byte(&data[ptr]))>0 ){
501 int pnext, psize;
502 assert( pbegin>ptr );
503 assert( pbegin<pPage->pBt->pageSize-4 );
504 pnext = get2byte(&data[pbegin]);
505 psize = get2byte(&data[pbegin+2]);
506 if( pbegin + psize + 3 >= pnext && pnext>0 ){
507 int frag = pnext - (pbegin+psize);
508 assert( frag<=data[pPage->hdrOffset+5] );
509 data[pPage->hdrOffset+5] -= frag;
510 put2byte(&data[pbegin], get2byte(&data[pnext]));
511 put2byte(&data[pbegin+2], pnext+get2byte(&data[pnext+2])-pbegin);
512 }else{
513 assert( (tsize += psize)>0 );
514 ptr = pbegin;
515 }
516 }
517 assert( tsize+data[pPage->hdrOffset+5]==pPage->nFree );
drh7e3b0a02001-04-28 16:52:40 +0000518}
519
520/*
drh9e572e62004-04-23 23:43:10 +0000521** The following is the default comparison function for (non-integer)
522** keys in the btrees. This function returns negative, zero, or
523** positive if the first key is less than, equal to, or greater than
524** the second.
525**
526*/
527static int keyComp(
528 void *userData,
529 int nKey1, const unsigned char *aKey1,
530 int nKey2, const unsigned char *aKey2,
531){
532 KeyClass *pKeyClass = (KeyClass*)userData;
533 i1 = i2 = 0;
534 for(i1=i2=0; pKeyClass!=0; pKeyClass=pKeyClass->pNext){
535 if( varint32(aKey1, &i1, nKey1, &n1) ) goto bad_key;
536 if( varint32(aKey2, &i2, nKey2, &n2) ) goto bad_key;
537 if( n1==0 ){
538 if( n2>0 ) return -1;
539 /* both values are NULL. consider them equal for sorting purposes. */
540 }else if( n2==0 ){
541 /* right value is NULL but the left value is not. right comes first */
542 return +1;
543 }else if( n1<=5 ){
544 if( n2>5 ) return -1;
545 /* both values are numbers. sort them numerically */
546 ...
547 }else if( n2<=5 ){
548 /* right value is numeric and left is TEXT or BLOB. right comes first */
549 return +1;
550 }else if( n1<12 || n2<12 ){
551 /* bad coding for either the left or the right value */
552 goto bad_key;
553 }else if( (n1&0x01)==0 ){
554 if( n2&0x01)!=0 ) return -1;
555 /* both values are BLOB. use memcmp() */
556 n1 = (n1-12)/2;
557 n2 = (n2-12)/2;
558 if( i1+n1>nKey1 || i2+n2>nKey2 ) goto bad_key;
559 c = memcmp(&aKey1[i1], &aKey2[i2], n1<n2 ? n1 : n2);
560 if( c!=0 ){
561 return c | 1;
562 }
563 if( n1!=n2 ){
564 return (n1-n2) | 1;
565 }
566 i1 += n1;
567 i2 += n2;
568 }else if( n2&0x01)!=0 ){
569 /* right value if BLOB and left is TEXT. BLOB comes first */
570 return +1;
571 }else{
572 /* both values are TEXT. use the supplied comparison function */
573 n1 = (n1-13)/2;
574 n2 = (n2-13)/2;
575 if( i1+n1>nKey1 || i2+n2>nKey2 ) goto bad_key;
576 c = pKeyClass->xCompare(pKeyClass->pUser, n1, &aKey1[i1], n2, &aKey2[i2]);
577 if( c!=0 ){
578 return c | 1;
579 }
580 i1 += n1;
581 i2 += n2;
582 }
583 }
584 return 0;
585
586bad_key:
587 return 1;
588}
589
590
591/*
drh7e3b0a02001-04-28 16:52:40 +0000592** Initialize the auxiliary information for a disk block.
drh72f82862001-05-24 21:06:34 +0000593**
drhbd03cae2001-06-02 02:40:57 +0000594** The pParent parameter must be a pointer to the MemPage which
drh9e572e62004-04-23 23:43:10 +0000595** is the parent of the page being initialized. The root of a
596** BTree has no parent and so for that page, pParent==NULL.
drh5e2f8b92001-05-28 00:41:15 +0000597**
drh72f82862001-05-24 21:06:34 +0000598** Return SQLITE_OK on success. If we see that the page does
drhda47d772002-12-02 04:25:19 +0000599** not contain a well-formed database page, then return
drh72f82862001-05-24 21:06:34 +0000600** SQLITE_CORRUPT. Note that a return of SQLITE_OK does not
601** guarantee that the page is well-formed. It only shows that
602** we failed to detect any corruption.
drh7e3b0a02001-04-28 16:52:40 +0000603*/
drh9e572e62004-04-23 23:43:10 +0000604static int initPage(
605 Bt *pBt, /* The Btree */
606 unsigned char *data, /* Start of the data for the page */
607 Pgno pgnoThis, /* The page number */
608 MemPage *pParent /* The parent. Might be NULL */
609){
610 MemPage *pPage;
611 int c, pc, i;
612 int sumCell = 0; /* Total size of all cells */
613 unsigned char *data;
drh2af926b2001-05-15 00:39:25 +0000614
drh9e572e62004-04-23 23:43:10 +0000615 pPage = (MemPage*)&aData[pBt->pageSize];
drh5e2f8b92001-05-28 00:41:15 +0000616 if( pPage->pParent ){
617 assert( pPage->pParent==pParent );
618 return SQLITE_OK;
619 }
620 if( pParent ){
621 pPage->pParent = pParent;
drh9e572e62004-04-23 23:43:10 +0000622 sqlitepager_ref(pParent->aData);
drh5e2f8b92001-05-28 00:41:15 +0000623 }
drh9e572e62004-04-23 23:43:10 +0000624 if( pPage->pBt!=0 ) return SQLITE_OK;
625 pPage->pBt = pBt;
drh7e3b0a02001-04-28 16:52:40 +0000626 pPage->nCell = 0;
drh9e572e62004-04-23 23:43:10 +0000627 pPage->pgno = pgnoThis;
628 pPage->hdrOffset = hdr = pgnoThis==1 ? 100 : 0;
629 c = data[pPage->hdrOffset];
630 pPage->intKey = (c & PTF_INTKEY)!=0;
631 pPage->zeroData = (c & PTF_ZERODATA)!=0;
632 pPage->leaf = (c & PTF_INTKEY)!=0;
drh2af926b2001-05-15 00:39:25 +0000633
drh9e572e62004-04-23 23:43:10 +0000634 /* Initialize the cell count and cell pointers */
635 pc = get2byte(&data[hdr+3]);
636 while( pc>0 ){
637 if( pc>=pBt->pageSize ) return SQLITE_CORRUPT;
638 if( pPage->nCell>pBt->pageSize ) return SQLITE_CORRUPT;
639 pPage->nCell++;
640 pc = get2byte(&data[pc]);
641 }
642 pPage->aCell = sqlite_malloc( sizeof(pPage->aCell[0])*pPage->nCell );
643 if( pPage->aCell==0 ){
644 return SQLITE_NOMEM;
645 }
646 pc = get2byte(&data[hdr+3]);
647 for(i=0; pc>0; i++){
648 pPage->aCell[i] = &data[pc];
649 pc = get2byte(&data[pc]);
650 sumCell += cellSize(pPage, &data[pc]);
651 }
652
653 /* Compute the total free space on the page */
654 pPage->nFree = data[hdr+5];
655 pc = get2byte(&data[hdr+1]);
656 while( pc>0 ){
657 int next, size;
658 if( pc>=pBt->pageSize ) return SQLITE_CORRUPT;
659 next = get2byte(&data[pc]);
660 size = get2byte(&data[pc+2]);
661 if( next>0 && next<=pc+size+3 ) return SQLITE_CURRUPT;
662 pPage->nFree += size;
663 pc = next;
664 }
665 if( pPage->nFree>=pBt->pageSize ) return SQLITE_CORRUPT;
666
667 /* Sanity check: Cells and freespace and header must sum to the size
668 ** a page. */
669 if( sumCell+pPage->nFree+hdr+10-pPage->leaf*4 != pBt->pageSize ){
670 return CORRUPT;
671 }
672
673 return SQLITE_OK;
drh7e3b0a02001-04-28 16:52:40 +0000674}
675
676/*
drh8b2f49b2001-06-08 00:21:52 +0000677** Set up a raw page so that it looks like a database page holding
678** no entries.
drhbd03cae2001-06-02 02:40:57 +0000679*/
drh9e572e62004-04-23 23:43:10 +0000680static void zeroPage(MemPage *pPage, int flags){
681 unsigned char *data = pPage->aData;
682 Btree *pBt = pPage->pBt;
683 int hdr = pPage->pgno==1 ? 100 : 0;
684 int first;
685
686 assert( sqlitepager_iswriteable(data) );
687 memset(&data[hdr], 0, pBt->pageSize - hdr);
688 data[hdr] = flags;
689 first = hdr + 6 + 4*((flags&0x01)!=0);
690 put2byte(&data[hdr+1], first);
691 put2byte(&data[first+2], pBt->pageSize - first);
692 sqliteFree(pPage->aCell);
693 pPage->aCell = 0;
drh8c42ca92001-06-22 19:15:00 +0000694 pPage->nCell = 0;
drh9e572e62004-04-23 23:43:10 +0000695 pPage->nFree = pBt->pageSize - first;
696 pPage->intKey = (flags & PTF_INTKEY)!=0;
697 pPage->leaf = (flags & PTF_LEAF)!=0;
698 pPage->zeroData = (flags & PTF_ZERODATA)!=0;
699 pPage->hdrOffset = hdr;
drhbd03cae2001-06-02 02:40:57 +0000700}
701
702/*
drh72f82862001-05-24 21:06:34 +0000703** This routine is called when the reference count for a page
704** reaches zero. We need to unref the pParent pointer when that
705** happens.
706*/
707static void pageDestructor(void *pData){
drh9e572e62004-04-23 23:43:10 +0000708 MemPage *pPage = (MemPage*)&((char*)pData)[SQLITE_PAGE_SIZE];
drh72f82862001-05-24 21:06:34 +0000709 if( pPage->pParent ){
710 MemPage *pParent = pPage->pParent;
711 pPage->pParent = 0;
drh9e572e62004-04-23 23:43:10 +0000712 sqlitepager_unref(pParent->aData);
drh72f82862001-05-24 21:06:34 +0000713 }
drh9e572e62004-04-23 23:43:10 +0000714 sqliteFree(pPage->aCell);
715 pPage->aCell = 0;
drh72f82862001-05-24 21:06:34 +0000716}
717
718/*
drh306dc212001-05-21 13:45:10 +0000719** Open a new database.
720**
721** Actually, this routine just sets up the internal data structures
drh72f82862001-05-24 21:06:34 +0000722** for accessing the database. We do not open the database file
723** until the first page is loaded.
drh382c0242001-10-06 16:33:02 +0000724**
725** zFilename is the name of the database file. If zFilename is NULL
drh1bee3d72001-10-15 00:44:35 +0000726** a new database with a random name is created. This randomly named
727** database file will be deleted when sqliteBtreeClose() is called.
drha059ad02001-04-17 20:09:11 +0000728*/
drh6019e162001-07-02 17:51:45 +0000729int sqliteBtreeOpen(
730 const char *zFilename, /* Name of the file containing the BTree database */
drhda47d772002-12-02 04:25:19 +0000731 int omitJournal, /* if TRUE then do not journal this file */
drh6019e162001-07-02 17:51:45 +0000732 int nCache, /* How many pages in the page cache */
733 Btree **ppBtree /* Pointer to new Btree object written here */
734){
drha059ad02001-04-17 20:09:11 +0000735 Btree *pBt;
drh8c42ca92001-06-22 19:15:00 +0000736 int rc;
drha059ad02001-04-17 20:09:11 +0000737
drhd62d3d02003-01-24 12:14:20 +0000738 /*
739 ** The following asserts make sure that structures used by the btree are
740 ** the right size. This is to guard against size changes that result
741 ** when compiling on a different architecture.
742 */
drh9e572e62004-04-23 23:43:10 +0000743 assert( sizeof(u64)==8 );
drhd62d3d02003-01-24 12:14:20 +0000744 assert( sizeof(u32)==4 );
745 assert( sizeof(u16)==2 );
746 assert( sizeof(Pgno)==4 );
drhd62d3d02003-01-24 12:14:20 +0000747 assert( sizeof(ptr)==sizeof(char*) );
748 assert( sizeof(uptr)==sizeof(ptr) );
749
drha059ad02001-04-17 20:09:11 +0000750 pBt = sqliteMalloc( sizeof(*pBt) );
751 if( pBt==0 ){
drh8c42ca92001-06-22 19:15:00 +0000752 *ppBtree = 0;
drha059ad02001-04-17 20:09:11 +0000753 return SQLITE_NOMEM;
754 }
drh6019e162001-07-02 17:51:45 +0000755 if( nCache<10 ) nCache = 10;
drhda47d772002-12-02 04:25:19 +0000756 rc = sqlitepager_open(&pBt->pPager, zFilename, nCache, EXTRA_SIZE,
757 !omitJournal);
drha059ad02001-04-17 20:09:11 +0000758 if( rc!=SQLITE_OK ){
759 if( pBt->pPager ) sqlitepager_close(pBt->pPager);
760 sqliteFree(pBt);
761 *ppBtree = 0;
762 return rc;
763 }
drh72f82862001-05-24 21:06:34 +0000764 sqlitepager_set_destructor(pBt->pPager, pageDestructor);
drha059ad02001-04-17 20:09:11 +0000765 pBt->pCursor = 0;
766 pBt->page1 = 0;
drh5df72a52002-06-06 23:16:05 +0000767 pBt->readOnly = sqlitepager_isreadonly(pBt->pPager);
paulb95a8862003-04-01 21:16:41 +0000768 pBt->pOps = &sqliteBtreeOps;
drh9e572e62004-04-23 23:43:10 +0000769 pBt->pageSize = SQLITE_PAGE_SIZE;
770 pBt->maxLocal = (SQLITE_PAGE_SIZE-10)/4-12;
drha059ad02001-04-17 20:09:11 +0000771 *ppBtree = pBt;
772 return SQLITE_OK;
773}
774
775/*
776** Close an open database and invalidate all cursors.
777*/
drh144f9ea2003-04-16 01:28:16 +0000778static int fileBtreeClose(Btree *pBt){
drha059ad02001-04-17 20:09:11 +0000779 while( pBt->pCursor ){
drh144f9ea2003-04-16 01:28:16 +0000780 fileBtreeCloseCursor(pBt->pCursor);
drha059ad02001-04-17 20:09:11 +0000781 }
782 sqlitepager_close(pBt->pPager);
783 sqliteFree(pBt);
784 return SQLITE_OK;
785}
786
787/*
drhda47d772002-12-02 04:25:19 +0000788** Change the limit on the number of pages allowed in the cache.
drhcd61c282002-03-06 22:01:34 +0000789**
790** The maximum number of cache pages is set to the absolute
791** value of mxPage. If mxPage is negative, the pager will
792** operate asynchronously - it will not stop to do fsync()s
793** to insure data is written to the disk surface before
794** continuing. Transactions still work if synchronous is off,
795** and the database cannot be corrupted if this program
796** crashes. But if the operating system crashes or there is
797** an abrupt power failure when synchronous is off, the database
798** could be left in an inconsistent and unrecoverable state.
799** Synchronous is on by default so database corruption is not
800** normally a worry.
drhf57b14a2001-09-14 18:54:08 +0000801*/
drh144f9ea2003-04-16 01:28:16 +0000802static int fileBtreeSetCacheSize(Btree *pBt, int mxPage){
drhf57b14a2001-09-14 18:54:08 +0000803 sqlitepager_set_cachesize(pBt->pPager, mxPage);
804 return SQLITE_OK;
805}
806
807/*
drh973b6e32003-02-12 14:09:42 +0000808** Change the way data is synced to disk in order to increase or decrease
809** how well the database resists damage due to OS crashes and power
810** failures. Level 1 is the same as asynchronous (no syncs() occur and
811** there is a high probability of damage) Level 2 is the default. There
812** is a very low but non-zero probability of damage. Level 3 reduces the
813** probability of damage to near zero but with a write performance reduction.
814*/
drh144f9ea2003-04-16 01:28:16 +0000815static int fileBtreeSetSafetyLevel(Btree *pBt, int level){
drh973b6e32003-02-12 14:09:42 +0000816 sqlitepager_set_safety_level(pBt->pPager, level);
817 return SQLITE_OK;
818}
819
820/*
drh306dc212001-05-21 13:45:10 +0000821** Get a reference to page1 of the database file. This will
822** also acquire a readlock on that file.
823**
824** SQLITE_OK is returned on success. If the file is not a
825** well-formed database file, then SQLITE_CORRUPT is returned.
826** SQLITE_BUSY is returned if the database is locked. SQLITE_NOMEM
827** is returned if we run out of memory. SQLITE_PROTOCOL is returned
828** if there is a locking protocol violation.
829*/
830static int lockBtree(Btree *pBt){
831 int rc;
drh9e572e62004-04-23 23:43:10 +0000832 unsigned char *data;
drh306dc212001-05-21 13:45:10 +0000833 if( pBt->page1 ) return SQLITE_OK;
drh9e572e62004-04-23 23:43:10 +0000834 rc = sqlitepager_get(pBt->pPager, 1, (void**)&data);
drh306dc212001-05-21 13:45:10 +0000835 if( rc!=SQLITE_OK ) return rc;
drh306dc212001-05-21 13:45:10 +0000836
837 /* Do some checking to help insure the file we opened really is
838 ** a valid database file.
839 */
840 if( sqlitepager_pagecount(pBt->pPager)>0 ){
drh9e572e62004-04-23 23:43:10 +0000841 if( memcmp(data, zMagicHeader, 16)!=0 ){
drhc602f9a2004-02-12 19:01:04 +0000842 rc = SQLITE_NOTADB;
drh72f82862001-05-24 21:06:34 +0000843 goto page1_init_failed;
drh306dc212001-05-21 13:45:10 +0000844 }
drh9e572e62004-04-23 23:43:10 +0000845 /*** TBD: Other header checks such as page size ****/
drh306dc212001-05-21 13:45:10 +0000846 }
drh9e572e62004-04-23 23:43:10 +0000847 pBt->page1 = (MemPage*)&data[SQLITE_PAGE_SIZE];
drh306dc212001-05-21 13:45:10 +0000848 return rc;
849
drh72f82862001-05-24 21:06:34 +0000850page1_init_failed:
drh9e572e62004-04-23 23:43:10 +0000851 sqlitepager_unref(pBt->data);
drh306dc212001-05-21 13:45:10 +0000852 pBt->page1 = 0;
drh72f82862001-05-24 21:06:34 +0000853 return rc;
drh306dc212001-05-21 13:45:10 +0000854}
855
856/*
drhb8ca3072001-12-05 00:21:20 +0000857** If there are no outstanding cursors and we are not in the middle
858** of a transaction but there is a read lock on the database, then
859** this routine unrefs the first page of the database file which
860** has the effect of releasing the read lock.
861**
862** If there are any outstanding cursors, this routine is a no-op.
863**
864** If there is a transaction in progress, this routine is a no-op.
865*/
866static void unlockBtreeIfUnused(Btree *pBt){
867 if( pBt->inTrans==0 && pBt->pCursor==0 && pBt->page1!=0 ){
drh9e572e62004-04-23 23:43:10 +0000868 sqlitepager_unref(pBt->page1->aData);
drhb8ca3072001-12-05 00:21:20 +0000869 pBt->page1 = 0;
870 pBt->inTrans = 0;
drh663fc632002-02-02 18:49:19 +0000871 pBt->inCkpt = 0;
drhb8ca3072001-12-05 00:21:20 +0000872 }
873}
874
875/*
drh9e572e62004-04-23 23:43:10 +0000876** Create a new database by initializing the first page of the
drh8c42ca92001-06-22 19:15:00 +0000877** file.
drh8b2f49b2001-06-08 00:21:52 +0000878*/
879static int newDatabase(Btree *pBt){
drh9e572e62004-04-23 23:43:10 +0000880 MemPage *pP1;
881 unsigned char *data;
drh8c42ca92001-06-22 19:15:00 +0000882 int rc;
drh7c717f72001-06-24 20:39:41 +0000883 if( sqlitepager_pagecount(pBt->pPager)>1 ) return SQLITE_OK;
drh8b2f49b2001-06-08 00:21:52 +0000884 pP1 = pBt->page1;
drh9e572e62004-04-23 23:43:10 +0000885 assert( pP1!=0 );
886 data = pP1->aData;
887 rc = sqlitepager_write(data);
drh8b2f49b2001-06-08 00:21:52 +0000888 if( rc ) return rc;
drh9e572e62004-04-23 23:43:10 +0000889 memcpy(data, zMagicHeader, sizeof(zMagicHeader));
890 assert( sizeof(zMagicHeader)==16 );
891 put2byte(&data[16], SQLITE_PAGE_SIZE);
892 data[18] = 1;
893 data[19] = 1;
894 put2byte(&data[22], (SQLITE_PAGE_SIZE-10)/4-12);
895 zeroPage(pP1, PTF_INTKEY|PTF_LEAF);
drh8b2f49b2001-06-08 00:21:52 +0000896 return SQLITE_OK;
897}
898
899/*
drh72f82862001-05-24 21:06:34 +0000900** Attempt to start a new transaction.
drh8b2f49b2001-06-08 00:21:52 +0000901**
902** A transaction must be started before attempting any changes
903** to the database. None of the following routines will work
904** unless a transaction is started first:
905**
906** sqliteBtreeCreateTable()
drhc6b52df2002-01-04 03:09:29 +0000907** sqliteBtreeCreateIndex()
drh8b2f49b2001-06-08 00:21:52 +0000908** sqliteBtreeClearTable()
909** sqliteBtreeDropTable()
910** sqliteBtreeInsert()
911** sqliteBtreeDelete()
912** sqliteBtreeUpdateMeta()
drha059ad02001-04-17 20:09:11 +0000913*/
drh144f9ea2003-04-16 01:28:16 +0000914static int fileBtreeBeginTrans(Btree *pBt){
drha059ad02001-04-17 20:09:11 +0000915 int rc;
916 if( pBt->inTrans ) return SQLITE_ERROR;
drhf74b8d92002-09-01 23:20:45 +0000917 if( pBt->readOnly ) return SQLITE_READONLY;
drha059ad02001-04-17 20:09:11 +0000918 if( pBt->page1==0 ){
drh7e3b0a02001-04-28 16:52:40 +0000919 rc = lockBtree(pBt);
drh8c42ca92001-06-22 19:15:00 +0000920 if( rc!=SQLITE_OK ){
921 return rc;
922 }
drha059ad02001-04-17 20:09:11 +0000923 }
drhf74b8d92002-09-01 23:20:45 +0000924 rc = sqlitepager_begin(pBt->page1);
925 if( rc==SQLITE_OK ){
926 rc = newDatabase(pBt);
drha059ad02001-04-17 20:09:11 +0000927 }
drhb8ca3072001-12-05 00:21:20 +0000928 if( rc==SQLITE_OK ){
929 pBt->inTrans = 1;
drh663fc632002-02-02 18:49:19 +0000930 pBt->inCkpt = 0;
drhb8ca3072001-12-05 00:21:20 +0000931 }else{
932 unlockBtreeIfUnused(pBt);
drha059ad02001-04-17 20:09:11 +0000933 }
drhb8ca3072001-12-05 00:21:20 +0000934 return rc;
drha059ad02001-04-17 20:09:11 +0000935}
936
937/*
drh2aa679f2001-06-25 02:11:07 +0000938** Commit the transaction currently in progress.
drh5e00f6c2001-09-13 13:46:56 +0000939**
940** This will release the write lock on the database file. If there
941** are no active cursors, it also releases the read lock.
drha059ad02001-04-17 20:09:11 +0000942*/
drh144f9ea2003-04-16 01:28:16 +0000943static int fileBtreeCommit(Btree *pBt){
drha059ad02001-04-17 20:09:11 +0000944 int rc;
drh5df72a52002-06-06 23:16:05 +0000945 rc = pBt->readOnly ? SQLITE_OK : sqlitepager_commit(pBt->pPager);
drh7c717f72001-06-24 20:39:41 +0000946 pBt->inTrans = 0;
drh663fc632002-02-02 18:49:19 +0000947 pBt->inCkpt = 0;
drh5e00f6c2001-09-13 13:46:56 +0000948 unlockBtreeIfUnused(pBt);
drha059ad02001-04-17 20:09:11 +0000949 return rc;
950}
951
952/*
drhecdc7532001-09-23 02:35:53 +0000953** Rollback the transaction in progress. All cursors will be
954** invalided by this operation. Any attempt to use a cursor
955** that was open at the beginning of this operation will result
956** in an error.
drh5e00f6c2001-09-13 13:46:56 +0000957**
958** This will release the write lock on the database file. If there
959** are no active cursors, it also releases the read lock.
drha059ad02001-04-17 20:09:11 +0000960*/
drh144f9ea2003-04-16 01:28:16 +0000961static int fileBtreeRollback(Btree *pBt){
drha059ad02001-04-17 20:09:11 +0000962 int rc;
drhecdc7532001-09-23 02:35:53 +0000963 BtCursor *pCur;
drh7c717f72001-06-24 20:39:41 +0000964 if( pBt->inTrans==0 ) return SQLITE_OK;
965 pBt->inTrans = 0;
drh663fc632002-02-02 18:49:19 +0000966 pBt->inCkpt = 0;
drh3a840692003-01-29 22:58:26 +0000967 rc = pBt->readOnly ? SQLITE_OK : sqlitepager_rollback(pBt->pPager);
drhecdc7532001-09-23 02:35:53 +0000968 for(pCur=pBt->pCursor; pCur; pCur=pCur->pNext){
drh9e572e62004-04-23 23:43:10 +0000969 if( pCur->pPage && pCur->pPage->pBt==0 ){
drhecdc7532001-09-23 02:35:53 +0000970 sqlitepager_unref(pCur->pPage);
971 pCur->pPage = 0;
972 }
973 }
drh5e00f6c2001-09-13 13:46:56 +0000974 unlockBtreeIfUnused(pBt);
drha059ad02001-04-17 20:09:11 +0000975 return rc;
976}
977
978/*
drh663fc632002-02-02 18:49:19 +0000979** Set the checkpoint for the current transaction. The checkpoint serves
980** as a sub-transaction that can be rolled back independently of the
981** main transaction. You must start a transaction before starting a
982** checkpoint. The checkpoint is ended automatically if the transaction
983** commits or rolls back.
984**
985** Only one checkpoint may be active at a time. It is an error to try
986** to start a new checkpoint if another checkpoint is already active.
987*/
drh144f9ea2003-04-16 01:28:16 +0000988static int fileBtreeBeginCkpt(Btree *pBt){
drh663fc632002-02-02 18:49:19 +0000989 int rc;
drh0d65dc02002-02-03 00:56:09 +0000990 if( !pBt->inTrans || pBt->inCkpt ){
drhf74b8d92002-09-01 23:20:45 +0000991 return pBt->readOnly ? SQLITE_READONLY : SQLITE_ERROR;
drh0d65dc02002-02-03 00:56:09 +0000992 }
drh5df72a52002-06-06 23:16:05 +0000993 rc = pBt->readOnly ? SQLITE_OK : sqlitepager_ckpt_begin(pBt->pPager);
drh663fc632002-02-02 18:49:19 +0000994 pBt->inCkpt = 1;
995 return rc;
996}
997
998
999/*
1000** Commit a checkpoint to transaction currently in progress. If no
1001** checkpoint is active, this is a no-op.
1002*/
drh144f9ea2003-04-16 01:28:16 +00001003static int fileBtreeCommitCkpt(Btree *pBt){
drh663fc632002-02-02 18:49:19 +00001004 int rc;
drh5df72a52002-06-06 23:16:05 +00001005 if( pBt->inCkpt && !pBt->readOnly ){
drh663fc632002-02-02 18:49:19 +00001006 rc = sqlitepager_ckpt_commit(pBt->pPager);
1007 }else{
1008 rc = SQLITE_OK;
1009 }
drh0d65dc02002-02-03 00:56:09 +00001010 pBt->inCkpt = 0;
drh663fc632002-02-02 18:49:19 +00001011 return rc;
1012}
1013
1014/*
1015** Rollback the checkpoint to the current transaction. If there
1016** is no active checkpoint or transaction, this routine is a no-op.
1017**
1018** All cursors will be invalided by this operation. Any attempt
1019** to use a cursor that was open at the beginning of this operation
1020** will result in an error.
1021*/
drh144f9ea2003-04-16 01:28:16 +00001022static int fileBtreeRollbackCkpt(Btree *pBt){
drh663fc632002-02-02 18:49:19 +00001023 int rc;
1024 BtCursor *pCur;
drh5df72a52002-06-06 23:16:05 +00001025 if( pBt->inCkpt==0 || pBt->readOnly ) return SQLITE_OK;
drh3a840692003-01-29 22:58:26 +00001026 rc = sqlitepager_ckpt_rollback(pBt->pPager);
drh663fc632002-02-02 18:49:19 +00001027 for(pCur=pBt->pCursor; pCur; pCur=pCur->pNext){
drh3a840692003-01-29 22:58:26 +00001028 if( pCur->pPage && pCur->pPage->isInit==0 ){
drh663fc632002-02-02 18:49:19 +00001029 sqlitepager_unref(pCur->pPage);
1030 pCur->pPage = 0;
1031 }
1032 }
drh0d65dc02002-02-03 00:56:09 +00001033 pBt->inCkpt = 0;
drh663fc632002-02-02 18:49:19 +00001034 return rc;
1035}
1036
1037/*
drh8b2f49b2001-06-08 00:21:52 +00001038** Create a new cursor for the BTree whose root is on the page
1039** iTable. The act of acquiring a cursor gets a read lock on
1040** the database file.
drh1bee3d72001-10-15 00:44:35 +00001041**
1042** If wrFlag==0, then the cursor can only be used for reading.
drhf74b8d92002-09-01 23:20:45 +00001043** If wrFlag==1, then the cursor can be used for reading or for
1044** writing if other conditions for writing are also met. These
1045** are the conditions that must be met in order for writing to
1046** be allowed:
drh6446c4d2001-12-15 14:22:18 +00001047**
drhf74b8d92002-09-01 23:20:45 +00001048** 1: The cursor must have been opened with wrFlag==1
1049**
1050** 2: No other cursors may be open with wrFlag==0 on the same table
1051**
1052** 3: The database must be writable (not on read-only media)
1053**
1054** 4: There must be an active transaction.
1055**
1056** Condition 2 warrants further discussion. If any cursor is opened
1057** on a table with wrFlag==0, that prevents all other cursors from
1058** writing to that table. This is a kind of "read-lock". When a cursor
1059** is opened with wrFlag==0 it is guaranteed that the table will not
1060** change as long as the cursor is open. This allows the cursor to
1061** do a sequential scan of the table without having to worry about
1062** entries being inserted or deleted during the scan. Cursors should
1063** be opened with wrFlag==0 only if this read-lock property is needed.
1064** That is to say, cursors should be opened with wrFlag==0 only if they
1065** intend to use the sqliteBtreeNext() system call. All other cursors
1066** should be opened with wrFlag==1 even if they never really intend
1067** to write.
1068**
drh6446c4d2001-12-15 14:22:18 +00001069** No checking is done to make sure that page iTable really is the
1070** root page of a b-tree. If it is not, then the cursor acquired
1071** will not work correctly.
drha059ad02001-04-17 20:09:11 +00001072*/
drha0c9a112004-03-10 13:42:37 +00001073static
1074int fileBtreeCursor(Btree *pBt, int iTable, int wrFlag, BtCursor **ppCur){
drha059ad02001-04-17 20:09:11 +00001075 int rc;
drhf74b8d92002-09-01 23:20:45 +00001076 BtCursor *pCur, *pRing;
drhecdc7532001-09-23 02:35:53 +00001077
drha0c9a112004-03-10 13:42:37 +00001078 if( pBt->readOnly && wrFlag ){
1079 *ppCur = 0;
1080 return SQLITE_READONLY;
1081 }
drha059ad02001-04-17 20:09:11 +00001082 if( pBt->page1==0 ){
1083 rc = lockBtree(pBt);
1084 if( rc!=SQLITE_OK ){
1085 *ppCur = 0;
1086 return rc;
1087 }
1088 }
1089 pCur = sqliteMalloc( sizeof(*pCur) );
1090 if( pCur==0 ){
drhbd03cae2001-06-02 02:40:57 +00001091 rc = SQLITE_NOMEM;
1092 goto create_cursor_exception;
1093 }
drh8b2f49b2001-06-08 00:21:52 +00001094 pCur->pgnoRoot = (Pgno)iTable;
drh8c42ca92001-06-22 19:15:00 +00001095 rc = sqlitepager_get(pBt->pPager, pCur->pgnoRoot, (void**)&pCur->pPage);
drhbd03cae2001-06-02 02:40:57 +00001096 if( rc!=SQLITE_OK ){
1097 goto create_cursor_exception;
1098 }
drh0d316a42002-08-11 20:10:47 +00001099 rc = initPage(pBt, pCur->pPage, pCur->pgnoRoot, 0);
drhbd03cae2001-06-02 02:40:57 +00001100 if( rc!=SQLITE_OK ){
1101 goto create_cursor_exception;
drha059ad02001-04-17 20:09:11 +00001102 }
paulb95a8862003-04-01 21:16:41 +00001103 pCur->pOps = &sqliteBtreeCursorOps;
drh14acc042001-06-10 19:56:58 +00001104 pCur->pBt = pBt;
drhecdc7532001-09-23 02:35:53 +00001105 pCur->wrFlag = wrFlag;
drh14acc042001-06-10 19:56:58 +00001106 pCur->idx = 0;
drh2dcc9aa2002-12-04 13:40:25 +00001107 pCur->eSkip = SKIP_INVALID;
drha059ad02001-04-17 20:09:11 +00001108 pCur->pNext = pBt->pCursor;
1109 if( pCur->pNext ){
1110 pCur->pNext->pPrev = pCur;
1111 }
drh14acc042001-06-10 19:56:58 +00001112 pCur->pPrev = 0;
drhf74b8d92002-09-01 23:20:45 +00001113 pRing = pBt->pCursor;
1114 while( pRing && pRing->pgnoRoot!=pCur->pgnoRoot ){ pRing = pRing->pNext; }
1115 if( pRing ){
1116 pCur->pShared = pRing->pShared;
1117 pRing->pShared = pCur;
1118 }else{
1119 pCur->pShared = pCur;
1120 }
drha059ad02001-04-17 20:09:11 +00001121 pBt->pCursor = pCur;
drh2af926b2001-05-15 00:39:25 +00001122 *ppCur = pCur;
1123 return SQLITE_OK;
drhbd03cae2001-06-02 02:40:57 +00001124
1125create_cursor_exception:
1126 *ppCur = 0;
1127 if( pCur ){
1128 if( pCur->pPage ) sqlitepager_unref(pCur->pPage);
1129 sqliteFree(pCur);
1130 }
drh5e00f6c2001-09-13 13:46:56 +00001131 unlockBtreeIfUnused(pBt);
drhbd03cae2001-06-02 02:40:57 +00001132 return rc;
drha059ad02001-04-17 20:09:11 +00001133}
1134
1135/*
drh5e00f6c2001-09-13 13:46:56 +00001136** Close a cursor. The read lock on the database file is released
drhbd03cae2001-06-02 02:40:57 +00001137** when the last cursor is closed.
drha059ad02001-04-17 20:09:11 +00001138*/
drh144f9ea2003-04-16 01:28:16 +00001139static int fileBtreeCloseCursor(BtCursor *pCur){
drha059ad02001-04-17 20:09:11 +00001140 Btree *pBt = pCur->pBt;
drha059ad02001-04-17 20:09:11 +00001141 if( pCur->pPrev ){
1142 pCur->pPrev->pNext = pCur->pNext;
1143 }else{
1144 pBt->pCursor = pCur->pNext;
1145 }
1146 if( pCur->pNext ){
1147 pCur->pNext->pPrev = pCur->pPrev;
1148 }
drhecdc7532001-09-23 02:35:53 +00001149 if( pCur->pPage ){
1150 sqlitepager_unref(pCur->pPage);
1151 }
drhf74b8d92002-09-01 23:20:45 +00001152 if( pCur->pShared!=pCur ){
1153 BtCursor *pRing = pCur->pShared;
1154 while( pRing->pShared!=pCur ){ pRing = pRing->pShared; }
1155 pRing->pShared = pCur->pShared;
1156 }
drh5e00f6c2001-09-13 13:46:56 +00001157 unlockBtreeIfUnused(pBt);
drha059ad02001-04-17 20:09:11 +00001158 sqliteFree(pCur);
drh8c42ca92001-06-22 19:15:00 +00001159 return SQLITE_OK;
drha059ad02001-04-17 20:09:11 +00001160}
1161
drh7e3b0a02001-04-28 16:52:40 +00001162/*
drh5e2f8b92001-05-28 00:41:15 +00001163** Make a temporary cursor by filling in the fields of pTempCur.
1164** The temporary cursor is not on the cursor list for the Btree.
1165*/
drh14acc042001-06-10 19:56:58 +00001166static void getTempCursor(BtCursor *pCur, BtCursor *pTempCur){
drh5e2f8b92001-05-28 00:41:15 +00001167 memcpy(pTempCur, pCur, sizeof(*pCur));
1168 pTempCur->pNext = 0;
1169 pTempCur->pPrev = 0;
drhecdc7532001-09-23 02:35:53 +00001170 if( pTempCur->pPage ){
1171 sqlitepager_ref(pTempCur->pPage);
1172 }
drh5e2f8b92001-05-28 00:41:15 +00001173}
1174
1175/*
drhbd03cae2001-06-02 02:40:57 +00001176** Delete a temporary cursor such as was made by the CreateTemporaryCursor()
drh5e2f8b92001-05-28 00:41:15 +00001177** function above.
1178*/
drh14acc042001-06-10 19:56:58 +00001179static void releaseTempCursor(BtCursor *pCur){
drhecdc7532001-09-23 02:35:53 +00001180 if( pCur->pPage ){
1181 sqlitepager_unref(pCur->pPage);
1182 }
drh5e2f8b92001-05-28 00:41:15 +00001183}
1184
1185/*
drhbd03cae2001-06-02 02:40:57 +00001186** Set *pSize to the number of bytes of key in the entry the
1187** cursor currently points to. Always return SQLITE_OK.
1188** Failure is not possible. If the cursor is not currently
1189** pointing to an entry (which can happen, for example, if
1190** the database is empty) then *pSize is set to 0.
drh7e3b0a02001-04-28 16:52:40 +00001191*/
drh144f9ea2003-04-16 01:28:16 +00001192static int fileBtreeKeySize(BtCursor *pCur, int *pSize){
drh2af926b2001-05-15 00:39:25 +00001193 Cell *pCell;
1194 MemPage *pPage;
1195
1196 pPage = pCur->pPage;
drh8c1238a2003-01-02 14:43:55 +00001197 assert( pPage!=0 );
1198 if( pCur->idx >= pPage->nCell ){
drh72f82862001-05-24 21:06:34 +00001199 *pSize = 0;
1200 }else{
drh5e2f8b92001-05-28 00:41:15 +00001201 pCell = pPage->apCell[pCur->idx];
drh0d316a42002-08-11 20:10:47 +00001202 *pSize = NKEY(pCur->pBt, pCell->h);
drh72f82862001-05-24 21:06:34 +00001203 }
1204 return SQLITE_OK;
drha059ad02001-04-17 20:09:11 +00001205}
drh2af926b2001-05-15 00:39:25 +00001206
drh72f82862001-05-24 21:06:34 +00001207/*
1208** Read payload information from the entry that the pCur cursor is
1209** pointing to. Begin reading the payload at "offset" and read
1210** a total of "amt" bytes. Put the result in zBuf.
1211**
1212** This routine does not make a distinction between key and data.
1213** It just reads bytes from the payload area.
1214*/
drh2af926b2001-05-15 00:39:25 +00001215static int getPayload(BtCursor *pCur, int offset, int amt, char *zBuf){
drh5e2f8b92001-05-28 00:41:15 +00001216 char *aPayload;
drh2af926b2001-05-15 00:39:25 +00001217 Pgno nextPage;
drh8c42ca92001-06-22 19:15:00 +00001218 int rc;
drh0d316a42002-08-11 20:10:47 +00001219 Btree *pBt = pCur->pBt;
drh72f82862001-05-24 21:06:34 +00001220 assert( pCur!=0 && pCur->pPage!=0 );
drh8c42ca92001-06-22 19:15:00 +00001221 assert( pCur->idx>=0 && pCur->idx<pCur->pPage->nCell );
1222 aPayload = pCur->pPage->apCell[pCur->idx]->aPayload;
drh2af926b2001-05-15 00:39:25 +00001223 if( offset<MX_LOCAL_PAYLOAD ){
1224 int a = amt;
1225 if( a+offset>MX_LOCAL_PAYLOAD ){
1226 a = MX_LOCAL_PAYLOAD - offset;
1227 }
drh5e2f8b92001-05-28 00:41:15 +00001228 memcpy(zBuf, &aPayload[offset], a);
drh2af926b2001-05-15 00:39:25 +00001229 if( a==amt ){
1230 return SQLITE_OK;
1231 }
drh2aa679f2001-06-25 02:11:07 +00001232 offset = 0;
drh2af926b2001-05-15 00:39:25 +00001233 zBuf += a;
1234 amt -= a;
drhdd793422001-06-28 01:54:48 +00001235 }else{
1236 offset -= MX_LOCAL_PAYLOAD;
drhbd03cae2001-06-02 02:40:57 +00001237 }
1238 if( amt>0 ){
drh0d316a42002-08-11 20:10:47 +00001239 nextPage = SWAB32(pBt, pCur->pPage->apCell[pCur->idx]->ovfl);
drh2af926b2001-05-15 00:39:25 +00001240 }
1241 while( amt>0 && nextPage ){
1242 OverflowPage *pOvfl;
drh0d316a42002-08-11 20:10:47 +00001243 rc = sqlitepager_get(pBt->pPager, nextPage, (void**)&pOvfl);
drh2af926b2001-05-15 00:39:25 +00001244 if( rc!=0 ){
1245 return rc;
1246 }
drh0d316a42002-08-11 20:10:47 +00001247 nextPage = SWAB32(pBt, pOvfl->iNext);
drh2af926b2001-05-15 00:39:25 +00001248 if( offset<OVERFLOW_SIZE ){
1249 int a = amt;
1250 if( a + offset > OVERFLOW_SIZE ){
1251 a = OVERFLOW_SIZE - offset;
1252 }
drh5e2f8b92001-05-28 00:41:15 +00001253 memcpy(zBuf, &pOvfl->aPayload[offset], a);
drh2aa679f2001-06-25 02:11:07 +00001254 offset = 0;
drh2af926b2001-05-15 00:39:25 +00001255 amt -= a;
1256 zBuf += a;
drh2aa679f2001-06-25 02:11:07 +00001257 }else{
1258 offset -= OVERFLOW_SIZE;
drh2af926b2001-05-15 00:39:25 +00001259 }
1260 sqlitepager_unref(pOvfl);
1261 }
drha7fcb052001-12-14 15:09:55 +00001262 if( amt>0 ){
1263 return SQLITE_CORRUPT;
1264 }
1265 return SQLITE_OK;
drh2af926b2001-05-15 00:39:25 +00001266}
1267
drh72f82862001-05-24 21:06:34 +00001268/*
drh5e00f6c2001-09-13 13:46:56 +00001269** Read part of the key associated with cursor pCur. A maximum
drh72f82862001-05-24 21:06:34 +00001270** of "amt" bytes will be transfered into zBuf[]. The transfer
drh5e00f6c2001-09-13 13:46:56 +00001271** begins at "offset". The number of bytes actually read is
drh8c1238a2003-01-02 14:43:55 +00001272** returned.
1273**
1274** Change: It used to be that the amount returned will be smaller
1275** than the amount requested if there are not enough bytes in the key
1276** to satisfy the request. But now, it must be the case that there
1277** is enough data available to satisfy the request. If not, an exception
1278** is raised. The change was made in an effort to boost performance
1279** by eliminating unneeded tests.
drh72f82862001-05-24 21:06:34 +00001280*/
drh144f9ea2003-04-16 01:28:16 +00001281static int fileBtreeKey(BtCursor *pCur, int offset, int amt, char *zBuf){
drh72f82862001-05-24 21:06:34 +00001282 MemPage *pPage;
drha059ad02001-04-17 20:09:11 +00001283
drh8c1238a2003-01-02 14:43:55 +00001284 assert( amt>=0 );
1285 assert( offset>=0 );
1286 assert( pCur->pPage!=0 );
drh72f82862001-05-24 21:06:34 +00001287 pPage = pCur->pPage;
drh72f82862001-05-24 21:06:34 +00001288 if( pCur->idx >= pPage->nCell ){
drh5e00f6c2001-09-13 13:46:56 +00001289 return 0;
drh72f82862001-05-24 21:06:34 +00001290 }
drh8c1238a2003-01-02 14:43:55 +00001291 assert( amt+offset <= NKEY(pCur->pBt, pPage->apCell[pCur->idx]->h) );
drh5e00f6c2001-09-13 13:46:56 +00001292 getPayload(pCur, offset, amt, zBuf);
1293 return amt;
drh72f82862001-05-24 21:06:34 +00001294}
1295
1296/*
drhbd03cae2001-06-02 02:40:57 +00001297** Set *pSize to the number of bytes of data in the entry the
1298** cursor currently points to. Always return SQLITE_OK.
1299** Failure is not possible. If the cursor is not currently
1300** pointing to an entry (which can happen, for example, if
1301** the database is empty) then *pSize is set to 0.
drh72f82862001-05-24 21:06:34 +00001302*/
drh144f9ea2003-04-16 01:28:16 +00001303static int fileBtreeDataSize(BtCursor *pCur, int *pSize){
drh72f82862001-05-24 21:06:34 +00001304 Cell *pCell;
1305 MemPage *pPage;
1306
1307 pPage = pCur->pPage;
drh8c1238a2003-01-02 14:43:55 +00001308 assert( pPage!=0 );
1309 if( pCur->idx >= pPage->nCell ){
drh72f82862001-05-24 21:06:34 +00001310 *pSize = 0;
1311 }else{
drh5e2f8b92001-05-28 00:41:15 +00001312 pCell = pPage->apCell[pCur->idx];
drh0d316a42002-08-11 20:10:47 +00001313 *pSize = NDATA(pCur->pBt, pCell->h);
drh72f82862001-05-24 21:06:34 +00001314 }
1315 return SQLITE_OK;
1316}
1317
1318/*
drh5e00f6c2001-09-13 13:46:56 +00001319** Read part of the data associated with cursor pCur. A maximum
drh72f82862001-05-24 21:06:34 +00001320** of "amt" bytes will be transfered into zBuf[]. The transfer
drh5e00f6c2001-09-13 13:46:56 +00001321** begins at "offset". The number of bytes actually read is
1322** returned. The amount returned will be smaller than the
1323** amount requested if there are not enough bytes in the data
1324** to satisfy the request.
drh72f82862001-05-24 21:06:34 +00001325*/
drh144f9ea2003-04-16 01:28:16 +00001326static int fileBtreeData(BtCursor *pCur, int offset, int amt, char *zBuf){
drh72f82862001-05-24 21:06:34 +00001327 Cell *pCell;
1328 MemPage *pPage;
1329
drh8c1238a2003-01-02 14:43:55 +00001330 assert( amt>=0 );
1331 assert( offset>=0 );
1332 assert( pCur->pPage!=0 );
drh72f82862001-05-24 21:06:34 +00001333 pPage = pCur->pPage;
drh8c1238a2003-01-02 14:43:55 +00001334 if( pCur->idx >= pPage->nCell ){
drh5e00f6c2001-09-13 13:46:56 +00001335 return 0;
drh72f82862001-05-24 21:06:34 +00001336 }
drh5e2f8b92001-05-28 00:41:15 +00001337 pCell = pPage->apCell[pCur->idx];
drh8c1238a2003-01-02 14:43:55 +00001338 assert( amt+offset <= NDATA(pCur->pBt, pCell->h) );
drh0d316a42002-08-11 20:10:47 +00001339 getPayload(pCur, offset + NKEY(pCur->pBt, pCell->h), amt, zBuf);
drh5e00f6c2001-09-13 13:46:56 +00001340 return amt;
drh72f82862001-05-24 21:06:34 +00001341}
drha059ad02001-04-17 20:09:11 +00001342
drh2af926b2001-05-15 00:39:25 +00001343/*
drh8721ce42001-11-07 14:22:00 +00001344** Compare an external key against the key on the entry that pCur points to.
1345**
1346** The external key is pKey and is nKey bytes long. The last nIgnore bytes
1347** of the key associated with pCur are ignored, as if they do not exist.
1348** (The normal case is for nIgnore to be zero in which case the entire
1349** internal key is used in the comparison.)
1350**
1351** The comparison result is written to *pRes as follows:
drh2af926b2001-05-15 00:39:25 +00001352**
drh717e6402001-09-27 03:22:32 +00001353** *pRes<0 This means pCur<pKey
1354**
1355** *pRes==0 This means pCur==pKey for all nKey bytes
1356**
1357** *pRes>0 This means pCur>pKey
1358**
drh8721ce42001-11-07 14:22:00 +00001359** When one key is an exact prefix of the other, the shorter key is
1360** considered less than the longer one. In order to be equal the
1361** keys must be exactly the same length. (The length of the pCur key
1362** is the actual key length minus nIgnore bytes.)
drh2af926b2001-05-15 00:39:25 +00001363*/
drh144f9ea2003-04-16 01:28:16 +00001364static int fileBtreeKeyCompare(
drh8721ce42001-11-07 14:22:00 +00001365 BtCursor *pCur, /* Pointer to entry to compare against */
1366 const void *pKey, /* Key to compare against entry that pCur points to */
1367 int nKey, /* Number of bytes in pKey */
1368 int nIgnore, /* Ignore this many bytes at the end of pCur */
1369 int *pResult /* Write the result here */
drh5c4d9702001-08-20 00:33:58 +00001370){
drh2af926b2001-05-15 00:39:25 +00001371 Pgno nextPage;
drh8721ce42001-11-07 14:22:00 +00001372 int n, c, rc, nLocal;
drh2af926b2001-05-15 00:39:25 +00001373 Cell *pCell;
drh0d316a42002-08-11 20:10:47 +00001374 Btree *pBt = pCur->pBt;
drh717e6402001-09-27 03:22:32 +00001375 const char *zKey = (const char*)pKey;
drh2af926b2001-05-15 00:39:25 +00001376
1377 assert( pCur->pPage );
1378 assert( pCur->idx>=0 && pCur->idx<pCur->pPage->nCell );
drhbd03cae2001-06-02 02:40:57 +00001379 pCell = pCur->pPage->apCell[pCur->idx];
drh0d316a42002-08-11 20:10:47 +00001380 nLocal = NKEY(pBt, pCell->h) - nIgnore;
drh8721ce42001-11-07 14:22:00 +00001381 if( nLocal<0 ) nLocal = 0;
1382 n = nKey<nLocal ? nKey : nLocal;
drh2af926b2001-05-15 00:39:25 +00001383 if( n>MX_LOCAL_PAYLOAD ){
1384 n = MX_LOCAL_PAYLOAD;
1385 }
drh717e6402001-09-27 03:22:32 +00001386 c = memcmp(pCell->aPayload, zKey, n);
drh2af926b2001-05-15 00:39:25 +00001387 if( c!=0 ){
1388 *pResult = c;
1389 return SQLITE_OK;
1390 }
drh717e6402001-09-27 03:22:32 +00001391 zKey += n;
drh2af926b2001-05-15 00:39:25 +00001392 nKey -= n;
drh8721ce42001-11-07 14:22:00 +00001393 nLocal -= n;
drh0d316a42002-08-11 20:10:47 +00001394 nextPage = SWAB32(pBt, pCell->ovfl);
drh8721ce42001-11-07 14:22:00 +00001395 while( nKey>0 && nLocal>0 ){
drh2af926b2001-05-15 00:39:25 +00001396 OverflowPage *pOvfl;
1397 if( nextPage==0 ){
1398 return SQLITE_CORRUPT;
1399 }
drh0d316a42002-08-11 20:10:47 +00001400 rc = sqlitepager_get(pBt->pPager, nextPage, (void**)&pOvfl);
drh72f82862001-05-24 21:06:34 +00001401 if( rc ){
drh2af926b2001-05-15 00:39:25 +00001402 return rc;
1403 }
drh0d316a42002-08-11 20:10:47 +00001404 nextPage = SWAB32(pBt, pOvfl->iNext);
drh8721ce42001-11-07 14:22:00 +00001405 n = nKey<nLocal ? nKey : nLocal;
drh2af926b2001-05-15 00:39:25 +00001406 if( n>OVERFLOW_SIZE ){
1407 n = OVERFLOW_SIZE;
1408 }
drh717e6402001-09-27 03:22:32 +00001409 c = memcmp(pOvfl->aPayload, zKey, n);
drh2af926b2001-05-15 00:39:25 +00001410 sqlitepager_unref(pOvfl);
1411 if( c!=0 ){
1412 *pResult = c;
1413 return SQLITE_OK;
1414 }
1415 nKey -= n;
drh8721ce42001-11-07 14:22:00 +00001416 nLocal -= n;
drh717e6402001-09-27 03:22:32 +00001417 zKey += n;
drh2af926b2001-05-15 00:39:25 +00001418 }
drh717e6402001-09-27 03:22:32 +00001419 if( c==0 ){
drh8721ce42001-11-07 14:22:00 +00001420 c = nLocal - nKey;
drh717e6402001-09-27 03:22:32 +00001421 }
drh2af926b2001-05-15 00:39:25 +00001422 *pResult = c;
1423 return SQLITE_OK;
1424}
1425
drh72f82862001-05-24 21:06:34 +00001426/*
drh8178a752003-01-05 21:41:40 +00001427** Move the cursor down to a new child page. The newPgno argument is the
1428** page number of the child page in the byte order of the disk image.
drh72f82862001-05-24 21:06:34 +00001429*/
drh5e2f8b92001-05-28 00:41:15 +00001430static int moveToChild(BtCursor *pCur, int newPgno){
drh72f82862001-05-24 21:06:34 +00001431 int rc;
1432 MemPage *pNewPage;
drh0d316a42002-08-11 20:10:47 +00001433 Btree *pBt = pCur->pBt;
drh72f82862001-05-24 21:06:34 +00001434
drh8178a752003-01-05 21:41:40 +00001435 newPgno = SWAB32(pBt, newPgno);
drh0d316a42002-08-11 20:10:47 +00001436 rc = sqlitepager_get(pBt->pPager, newPgno, (void**)&pNewPage);
drh6019e162001-07-02 17:51:45 +00001437 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00001438 rc = initPage(pBt, pNewPage, newPgno, pCur->pPage);
drh6019e162001-07-02 17:51:45 +00001439 if( rc ) return rc;
drh428ae8c2003-01-04 16:48:09 +00001440 assert( pCur->idx>=pCur->pPage->nCell
1441 || pCur->pPage->apCell[pCur->idx]->h.leftChild==SWAB32(pBt,newPgno) );
1442 assert( pCur->idx<pCur->pPage->nCell
1443 || pCur->pPage->u.hdr.rightChild==SWAB32(pBt,newPgno) );
1444 pNewPage->idxParent = pCur->idx;
1445 pCur->pPage->idxShift = 0;
drh72f82862001-05-24 21:06:34 +00001446 sqlitepager_unref(pCur->pPage);
1447 pCur->pPage = pNewPage;
1448 pCur->idx = 0;
drh4be295b2003-12-16 03:44:47 +00001449 if( pNewPage->nCell<1 ){
1450 return SQLITE_CORRUPT;
1451 }
drh72f82862001-05-24 21:06:34 +00001452 return SQLITE_OK;
1453}
1454
1455/*
drh5e2f8b92001-05-28 00:41:15 +00001456** Move the cursor up to the parent page.
1457**
1458** pCur->idx is set to the cell index that contains the pointer
1459** to the page we are coming from. If we are coming from the
1460** right-most child page then pCur->idx is set to one more than
drhbd03cae2001-06-02 02:40:57 +00001461** the largest cell index.
drh72f82862001-05-24 21:06:34 +00001462*/
drh8178a752003-01-05 21:41:40 +00001463static void moveToParent(BtCursor *pCur){
drh72f82862001-05-24 21:06:34 +00001464 Pgno oldPgno;
1465 MemPage *pParent;
drh8178a752003-01-05 21:41:40 +00001466 MemPage *pPage;
drh428ae8c2003-01-04 16:48:09 +00001467 int idxParent;
drh8178a752003-01-05 21:41:40 +00001468 pPage = pCur->pPage;
1469 assert( pPage!=0 );
1470 pParent = pPage->pParent;
1471 assert( pParent!=0 );
1472 idxParent = pPage->idxParent;
drh72f82862001-05-24 21:06:34 +00001473 sqlitepager_ref(pParent);
drh8178a752003-01-05 21:41:40 +00001474 sqlitepager_unref(pPage);
drh72f82862001-05-24 21:06:34 +00001475 pCur->pPage = pParent;
drh428ae8c2003-01-04 16:48:09 +00001476 assert( pParent->idxShift==0 );
1477 if( pParent->idxShift==0 ){
1478 pCur->idx = idxParent;
1479#ifndef NDEBUG
1480 /* Verify that pCur->idx is the correct index to point back to the child
1481 ** page we just came from
1482 */
drh8178a752003-01-05 21:41:40 +00001483 oldPgno = SWAB32(pCur->pBt, sqlitepager_pagenumber(pPage));
drh428ae8c2003-01-04 16:48:09 +00001484 if( pCur->idx<pParent->nCell ){
1485 assert( pParent->apCell[idxParent]->h.leftChild==oldPgno );
1486 }else{
1487 assert( pParent->u.hdr.rightChild==oldPgno );
1488 }
1489#endif
1490 }else{
1491 /* The MemPage.idxShift flag indicates that cell indices might have
1492 ** changed since idxParent was set and hence idxParent might be out
1493 ** of date. So recompute the parent cell index by scanning all cells
1494 ** and locating the one that points to the child we just came from.
1495 */
1496 int i;
1497 pCur->idx = pParent->nCell;
drh8178a752003-01-05 21:41:40 +00001498 oldPgno = SWAB32(pCur->pBt, sqlitepager_pagenumber(pPage));
drh428ae8c2003-01-04 16:48:09 +00001499 for(i=0; i<pParent->nCell; i++){
1500 if( pParent->apCell[i]->h.leftChild==oldPgno ){
1501 pCur->idx = i;
1502 break;
1503 }
drh72f82862001-05-24 21:06:34 +00001504 }
1505 }
1506}
1507
1508/*
1509** Move the cursor to the root page
1510*/
drh5e2f8b92001-05-28 00:41:15 +00001511static int moveToRoot(BtCursor *pCur){
drh72f82862001-05-24 21:06:34 +00001512 MemPage *pNew;
drhbd03cae2001-06-02 02:40:57 +00001513 int rc;
drh0d316a42002-08-11 20:10:47 +00001514 Btree *pBt = pCur->pBt;
drhbd03cae2001-06-02 02:40:57 +00001515
drh0d316a42002-08-11 20:10:47 +00001516 rc = sqlitepager_get(pBt->pPager, pCur->pgnoRoot, (void**)&pNew);
drhbd03cae2001-06-02 02:40:57 +00001517 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00001518 rc = initPage(pBt, pNew, pCur->pgnoRoot, 0);
drh6019e162001-07-02 17:51:45 +00001519 if( rc ) return rc;
drh72f82862001-05-24 21:06:34 +00001520 sqlitepager_unref(pCur->pPage);
1521 pCur->pPage = pNew;
1522 pCur->idx = 0;
1523 return SQLITE_OK;
1524}
drh2af926b2001-05-15 00:39:25 +00001525
drh5e2f8b92001-05-28 00:41:15 +00001526/*
1527** Move the cursor down to the left-most leaf entry beneath the
1528** entry to which it is currently pointing.
1529*/
1530static int moveToLeftmost(BtCursor *pCur){
1531 Pgno pgno;
1532 int rc;
1533
1534 while( (pgno = pCur->pPage->apCell[pCur->idx]->h.leftChild)!=0 ){
drh8178a752003-01-05 21:41:40 +00001535 rc = moveToChild(pCur, pgno);
drh5e2f8b92001-05-28 00:41:15 +00001536 if( rc ) return rc;
1537 }
1538 return SQLITE_OK;
1539}
1540
drh2dcc9aa2002-12-04 13:40:25 +00001541/*
1542** Move the cursor down to the right-most leaf entry beneath the
1543** page to which it is currently pointing. Notice the difference
1544** between moveToLeftmost() and moveToRightmost(). moveToLeftmost()
1545** finds the left-most entry beneath the *entry* whereas moveToRightmost()
1546** finds the right-most entry beneath the *page*.
1547*/
1548static int moveToRightmost(BtCursor *pCur){
1549 Pgno pgno;
1550 int rc;
1551
1552 while( (pgno = pCur->pPage->u.hdr.rightChild)!=0 ){
drh428ae8c2003-01-04 16:48:09 +00001553 pCur->idx = pCur->pPage->nCell;
drh8178a752003-01-05 21:41:40 +00001554 rc = moveToChild(pCur, pgno);
drh2dcc9aa2002-12-04 13:40:25 +00001555 if( rc ) return rc;
1556 }
1557 pCur->idx = pCur->pPage->nCell - 1;
1558 return SQLITE_OK;
1559}
1560
drh5e00f6c2001-09-13 13:46:56 +00001561/* Move the cursor to the first entry in the table. Return SQLITE_OK
1562** on success. Set *pRes to 0 if the cursor actually points to something
drh77c679c2002-02-19 22:43:58 +00001563** or set *pRes to 1 if the table is empty.
drh5e00f6c2001-09-13 13:46:56 +00001564*/
drh144f9ea2003-04-16 01:28:16 +00001565static int fileBtreeFirst(BtCursor *pCur, int *pRes){
drh5e00f6c2001-09-13 13:46:56 +00001566 int rc;
drhecdc7532001-09-23 02:35:53 +00001567 if( pCur->pPage==0 ) return SQLITE_ABORT;
drh5e00f6c2001-09-13 13:46:56 +00001568 rc = moveToRoot(pCur);
1569 if( rc ) return rc;
1570 if( pCur->pPage->nCell==0 ){
1571 *pRes = 1;
1572 return SQLITE_OK;
1573 }
1574 *pRes = 0;
1575 rc = moveToLeftmost(pCur);
drh2dcc9aa2002-12-04 13:40:25 +00001576 pCur->eSkip = SKIP_NONE;
drh5e00f6c2001-09-13 13:46:56 +00001577 return rc;
1578}
drh5e2f8b92001-05-28 00:41:15 +00001579
drh9562b552002-02-19 15:00:07 +00001580/* Move the cursor to the last entry in the table. Return SQLITE_OK
1581** on success. Set *pRes to 0 if the cursor actually points to something
drh77c679c2002-02-19 22:43:58 +00001582** or set *pRes to 1 if the table is empty.
drh9562b552002-02-19 15:00:07 +00001583*/
drh144f9ea2003-04-16 01:28:16 +00001584static int fileBtreeLast(BtCursor *pCur, int *pRes){
drh9562b552002-02-19 15:00:07 +00001585 int rc;
drh9562b552002-02-19 15:00:07 +00001586 if( pCur->pPage==0 ) return SQLITE_ABORT;
1587 rc = moveToRoot(pCur);
1588 if( rc ) return rc;
drh7aa128d2002-06-21 13:09:16 +00001589 assert( pCur->pPage->isInit );
drh9562b552002-02-19 15:00:07 +00001590 if( pCur->pPage->nCell==0 ){
1591 *pRes = 1;
1592 return SQLITE_OK;
1593 }
1594 *pRes = 0;
drh2dcc9aa2002-12-04 13:40:25 +00001595 rc = moveToRightmost(pCur);
1596 pCur->eSkip = SKIP_NONE;
drh9562b552002-02-19 15:00:07 +00001597 return rc;
1598}
1599
drha059ad02001-04-17 20:09:11 +00001600/* Move the cursor so that it points to an entry near pKey.
drh72f82862001-05-24 21:06:34 +00001601** Return a success code.
1602**
drh5e2f8b92001-05-28 00:41:15 +00001603** If an exact match is not found, then the cursor is always
drhbd03cae2001-06-02 02:40:57 +00001604** left pointing at a leaf page which would hold the entry if it
drh5e2f8b92001-05-28 00:41:15 +00001605** were present. The cursor might point to an entry that comes
1606** before or after the key.
1607**
drhbd03cae2001-06-02 02:40:57 +00001608** The result of comparing the key with the entry to which the
1609** cursor is left pointing is stored in pCur->iMatch. The same
1610** value is also written to *pRes if pRes!=NULL. The meaning of
1611** this value is as follows:
1612**
1613** *pRes<0 The cursor is left pointing at an entry that
drh1a844c32002-12-04 22:29:28 +00001614** is smaller than pKey or if the table is empty
1615** and the cursor is therefore left point to nothing.
drhbd03cae2001-06-02 02:40:57 +00001616**
1617** *pRes==0 The cursor is left pointing at an entry that
1618** exactly matches pKey.
1619**
1620** *pRes>0 The cursor is left pointing at an entry that
drh7c717f72001-06-24 20:39:41 +00001621** is larger than pKey.
drha059ad02001-04-17 20:09:11 +00001622*/
drh144f9ea2003-04-16 01:28:16 +00001623static
1624int fileBtreeMoveto(BtCursor *pCur, const void *pKey, int nKey, int *pRes){
drh72f82862001-05-24 21:06:34 +00001625 int rc;
drhecdc7532001-09-23 02:35:53 +00001626 if( pCur->pPage==0 ) return SQLITE_ABORT;
drh2dcc9aa2002-12-04 13:40:25 +00001627 pCur->eSkip = SKIP_NONE;
drh5e2f8b92001-05-28 00:41:15 +00001628 rc = moveToRoot(pCur);
drh72f82862001-05-24 21:06:34 +00001629 if( rc ) return rc;
1630 for(;;){
1631 int lwr, upr;
1632 Pgno chldPg;
1633 MemPage *pPage = pCur->pPage;
drh1a844c32002-12-04 22:29:28 +00001634 int c = -1; /* pRes return if table is empty must be -1 */
drh72f82862001-05-24 21:06:34 +00001635 lwr = 0;
1636 upr = pPage->nCell-1;
1637 while( lwr<=upr ){
drh72f82862001-05-24 21:06:34 +00001638 pCur->idx = (lwr+upr)/2;
drh144f9ea2003-04-16 01:28:16 +00001639 rc = fileBtreeKeyCompare(pCur, pKey, nKey, 0, &c);
drh72f82862001-05-24 21:06:34 +00001640 if( rc ) return rc;
1641 if( c==0 ){
drh5e2f8b92001-05-28 00:41:15 +00001642 pCur->iMatch = c;
drh72f82862001-05-24 21:06:34 +00001643 if( pRes ) *pRes = 0;
1644 return SQLITE_OK;
1645 }
1646 if( c<0 ){
1647 lwr = pCur->idx+1;
1648 }else{
1649 upr = pCur->idx-1;
1650 }
1651 }
1652 assert( lwr==upr+1 );
drh7aa128d2002-06-21 13:09:16 +00001653 assert( pPage->isInit );
drh72f82862001-05-24 21:06:34 +00001654 if( lwr>=pPage->nCell ){
drh14acc042001-06-10 19:56:58 +00001655 chldPg = pPage->u.hdr.rightChild;
drh72f82862001-05-24 21:06:34 +00001656 }else{
drh5e2f8b92001-05-28 00:41:15 +00001657 chldPg = pPage->apCell[lwr]->h.leftChild;
drh72f82862001-05-24 21:06:34 +00001658 }
1659 if( chldPg==0 ){
drh5e2f8b92001-05-28 00:41:15 +00001660 pCur->iMatch = c;
drh72f82862001-05-24 21:06:34 +00001661 if( pRes ) *pRes = c;
1662 return SQLITE_OK;
1663 }
drh428ae8c2003-01-04 16:48:09 +00001664 pCur->idx = lwr;
drh8178a752003-01-05 21:41:40 +00001665 rc = moveToChild(pCur, chldPg);
drh72f82862001-05-24 21:06:34 +00001666 if( rc ) return rc;
1667 }
drhbd03cae2001-06-02 02:40:57 +00001668 /* NOT REACHED */
drh72f82862001-05-24 21:06:34 +00001669}
1670
1671/*
drhbd03cae2001-06-02 02:40:57 +00001672** Advance the cursor to the next entry in the database. If
drh8c1238a2003-01-02 14:43:55 +00001673** successful then set *pRes=0. If the cursor
drhbd03cae2001-06-02 02:40:57 +00001674** was already pointing to the last entry in the database before
drh8c1238a2003-01-02 14:43:55 +00001675** this routine was called, then set *pRes=1.
drh72f82862001-05-24 21:06:34 +00001676*/
drh144f9ea2003-04-16 01:28:16 +00001677static int fileBtreeNext(BtCursor *pCur, int *pRes){
drh72f82862001-05-24 21:06:34 +00001678 int rc;
drh8178a752003-01-05 21:41:40 +00001679 MemPage *pPage = pCur->pPage;
drh8c1238a2003-01-02 14:43:55 +00001680 assert( pRes!=0 );
drh8178a752003-01-05 21:41:40 +00001681 if( pPage==0 ){
drh8c1238a2003-01-02 14:43:55 +00001682 *pRes = 1;
drhecdc7532001-09-23 02:35:53 +00001683 return SQLITE_ABORT;
1684 }
drh8178a752003-01-05 21:41:40 +00001685 assert( pPage->isInit );
drh2dcc9aa2002-12-04 13:40:25 +00001686 assert( pCur->eSkip!=SKIP_INVALID );
drh8178a752003-01-05 21:41:40 +00001687 if( pPage->nCell==0 ){
drh8c1238a2003-01-02 14:43:55 +00001688 *pRes = 1;
drh2dcc9aa2002-12-04 13:40:25 +00001689 return SQLITE_OK;
1690 }
drh8178a752003-01-05 21:41:40 +00001691 assert( pCur->idx<pPage->nCell );
drh2dcc9aa2002-12-04 13:40:25 +00001692 if( pCur->eSkip==SKIP_NEXT ){
1693 pCur->eSkip = SKIP_NONE;
drh8c1238a2003-01-02 14:43:55 +00001694 *pRes = 0;
drh72f82862001-05-24 21:06:34 +00001695 return SQLITE_OK;
1696 }
drh2dcc9aa2002-12-04 13:40:25 +00001697 pCur->eSkip = SKIP_NONE;
drh72f82862001-05-24 21:06:34 +00001698 pCur->idx++;
drh8178a752003-01-05 21:41:40 +00001699 if( pCur->idx>=pPage->nCell ){
1700 if( pPage->u.hdr.rightChild ){
1701 rc = moveToChild(pCur, pPage->u.hdr.rightChild);
drh5e2f8b92001-05-28 00:41:15 +00001702 if( rc ) return rc;
1703 rc = moveToLeftmost(pCur);
drh8c1238a2003-01-02 14:43:55 +00001704 *pRes = 0;
1705 return rc;
drh72f82862001-05-24 21:06:34 +00001706 }
drh5e2f8b92001-05-28 00:41:15 +00001707 do{
drh8178a752003-01-05 21:41:40 +00001708 if( pPage->pParent==0 ){
drh8c1238a2003-01-02 14:43:55 +00001709 *pRes = 1;
drh5e2f8b92001-05-28 00:41:15 +00001710 return SQLITE_OK;
1711 }
drh8178a752003-01-05 21:41:40 +00001712 moveToParent(pCur);
1713 pPage = pCur->pPage;
1714 }while( pCur->idx>=pPage->nCell );
drh8c1238a2003-01-02 14:43:55 +00001715 *pRes = 0;
drh8178a752003-01-05 21:41:40 +00001716 return SQLITE_OK;
1717 }
1718 *pRes = 0;
1719 if( pPage->u.hdr.rightChild==0 ){
1720 return SQLITE_OK;
drh72f82862001-05-24 21:06:34 +00001721 }
drh5e2f8b92001-05-28 00:41:15 +00001722 rc = moveToLeftmost(pCur);
drh8c1238a2003-01-02 14:43:55 +00001723 return rc;
drh72f82862001-05-24 21:06:34 +00001724}
1725
drh3b7511c2001-05-26 13:15:44 +00001726/*
drh2dcc9aa2002-12-04 13:40:25 +00001727** Step the cursor to the back to the previous entry in the database. If
drh8178a752003-01-05 21:41:40 +00001728** successful then set *pRes=0. If the cursor
drh2dcc9aa2002-12-04 13:40:25 +00001729** was already pointing to the first entry in the database before
drh8178a752003-01-05 21:41:40 +00001730** this routine was called, then set *pRes=1.
drh2dcc9aa2002-12-04 13:40:25 +00001731*/
drh144f9ea2003-04-16 01:28:16 +00001732static int fileBtreePrevious(BtCursor *pCur, int *pRes){
drh2dcc9aa2002-12-04 13:40:25 +00001733 int rc;
1734 Pgno pgno;
drh8178a752003-01-05 21:41:40 +00001735 MemPage *pPage;
1736 pPage = pCur->pPage;
1737 if( pPage==0 ){
1738 *pRes = 1;
drh2dcc9aa2002-12-04 13:40:25 +00001739 return SQLITE_ABORT;
1740 }
drh8178a752003-01-05 21:41:40 +00001741 assert( pPage->isInit );
drh2dcc9aa2002-12-04 13:40:25 +00001742 assert( pCur->eSkip!=SKIP_INVALID );
drh8178a752003-01-05 21:41:40 +00001743 if( pPage->nCell==0 ){
1744 *pRes = 1;
drh2dcc9aa2002-12-04 13:40:25 +00001745 return SQLITE_OK;
1746 }
1747 if( pCur->eSkip==SKIP_PREV ){
1748 pCur->eSkip = SKIP_NONE;
drh8178a752003-01-05 21:41:40 +00001749 *pRes = 0;
drh2dcc9aa2002-12-04 13:40:25 +00001750 return SQLITE_OK;
1751 }
1752 pCur->eSkip = SKIP_NONE;
1753 assert( pCur->idx>=0 );
drh8178a752003-01-05 21:41:40 +00001754 if( (pgno = pPage->apCell[pCur->idx]->h.leftChild)!=0 ){
1755 rc = moveToChild(pCur, pgno);
drh2dcc9aa2002-12-04 13:40:25 +00001756 if( rc ) return rc;
1757 rc = moveToRightmost(pCur);
1758 }else{
1759 while( pCur->idx==0 ){
drh8178a752003-01-05 21:41:40 +00001760 if( pPage->pParent==0 ){
drh2dcc9aa2002-12-04 13:40:25 +00001761 if( pRes ) *pRes = 1;
1762 return SQLITE_OK;
1763 }
drh8178a752003-01-05 21:41:40 +00001764 moveToParent(pCur);
1765 pPage = pCur->pPage;
drh2dcc9aa2002-12-04 13:40:25 +00001766 }
1767 pCur->idx--;
1768 rc = SQLITE_OK;
1769 }
drh8178a752003-01-05 21:41:40 +00001770 *pRes = 0;
drh2dcc9aa2002-12-04 13:40:25 +00001771 return rc;
1772}
1773
1774/*
drh3b7511c2001-05-26 13:15:44 +00001775** Allocate a new page from the database file.
1776**
1777** The new page is marked as dirty. (In other words, sqlitepager_write()
1778** has already been called on the new page.) The new page has also
1779** been referenced and the calling routine is responsible for calling
1780** sqlitepager_unref() on the new page when it is done.
1781**
1782** SQLITE_OK is returned on success. Any other return value indicates
1783** an error. *ppPage and *pPgno are undefined in the event of an error.
1784** Do not invoke sqlitepager_unref() on *ppPage if an error is returned.
drhbea00b92002-07-08 10:59:50 +00001785**
drh199e3cf2002-07-18 11:01:47 +00001786** If the "nearby" parameter is not 0, then a (feeble) effort is made to
1787** locate a page close to the page number "nearby". This can be used in an
drhbea00b92002-07-08 10:59:50 +00001788** attempt to keep related pages close to each other in the database file,
1789** which in turn can make database access faster.
drh3b7511c2001-05-26 13:15:44 +00001790*/
drh199e3cf2002-07-18 11:01:47 +00001791static int allocatePage(Btree *pBt, MemPage **ppPage, Pgno *pPgno, Pgno nearby){
drhbd03cae2001-06-02 02:40:57 +00001792 PageOne *pPage1 = pBt->page1;
drh8c42ca92001-06-22 19:15:00 +00001793 int rc;
drh3b7511c2001-05-26 13:15:44 +00001794 if( pPage1->freeList ){
1795 OverflowPage *pOvfl;
drh30e58752002-03-02 20:41:57 +00001796 FreelistInfo *pInfo;
1797
drh3b7511c2001-05-26 13:15:44 +00001798 rc = sqlitepager_write(pPage1);
1799 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00001800 SWAB_ADD(pBt, pPage1->nFree, -1);
1801 rc = sqlitepager_get(pBt->pPager, SWAB32(pBt, pPage1->freeList),
1802 (void**)&pOvfl);
drh3b7511c2001-05-26 13:15:44 +00001803 if( rc ) return rc;
1804 rc = sqlitepager_write(pOvfl);
1805 if( rc ){
1806 sqlitepager_unref(pOvfl);
1807 return rc;
1808 }
drh30e58752002-03-02 20:41:57 +00001809 pInfo = (FreelistInfo*)pOvfl->aPayload;
1810 if( pInfo->nFree==0 ){
drh0d316a42002-08-11 20:10:47 +00001811 *pPgno = SWAB32(pBt, pPage1->freeList);
drh30e58752002-03-02 20:41:57 +00001812 pPage1->freeList = pOvfl->iNext;
1813 *ppPage = (MemPage*)pOvfl;
1814 }else{
drh0d316a42002-08-11 20:10:47 +00001815 int closest, n;
1816 n = SWAB32(pBt, pInfo->nFree);
1817 if( n>1 && nearby>0 ){
drhbea00b92002-07-08 10:59:50 +00001818 int i, dist;
1819 closest = 0;
drh0d316a42002-08-11 20:10:47 +00001820 dist = SWAB32(pBt, pInfo->aFree[0]) - nearby;
drhbea00b92002-07-08 10:59:50 +00001821 if( dist<0 ) dist = -dist;
drh0d316a42002-08-11 20:10:47 +00001822 for(i=1; i<n; i++){
1823 int d2 = SWAB32(pBt, pInfo->aFree[i]) - nearby;
drhbea00b92002-07-08 10:59:50 +00001824 if( d2<0 ) d2 = -d2;
1825 if( d2<dist ) closest = i;
1826 }
1827 }else{
1828 closest = 0;
1829 }
drh0d316a42002-08-11 20:10:47 +00001830 SWAB_ADD(pBt, pInfo->nFree, -1);
1831 *pPgno = SWAB32(pBt, pInfo->aFree[closest]);
1832 pInfo->aFree[closest] = pInfo->aFree[n-1];
drh30e58752002-03-02 20:41:57 +00001833 rc = sqlitepager_get(pBt->pPager, *pPgno, (void**)ppPage);
1834 sqlitepager_unref(pOvfl);
1835 if( rc==SQLITE_OK ){
1836 sqlitepager_dont_rollback(*ppPage);
1837 rc = sqlitepager_write(*ppPage);
1838 }
1839 }
drh3b7511c2001-05-26 13:15:44 +00001840 }else{
drh2aa679f2001-06-25 02:11:07 +00001841 *pPgno = sqlitepager_pagecount(pBt->pPager) + 1;
drh8c42ca92001-06-22 19:15:00 +00001842 rc = sqlitepager_get(pBt->pPager, *pPgno, (void**)ppPage);
drh3b7511c2001-05-26 13:15:44 +00001843 if( rc ) return rc;
1844 rc = sqlitepager_write(*ppPage);
1845 }
1846 return rc;
1847}
1848
1849/*
1850** Add a page of the database file to the freelist. Either pgno or
1851** pPage but not both may be 0.
drh5e2f8b92001-05-28 00:41:15 +00001852**
drhdd793422001-06-28 01:54:48 +00001853** sqlitepager_unref() is NOT called for pPage.
drh3b7511c2001-05-26 13:15:44 +00001854*/
1855static int freePage(Btree *pBt, void *pPage, Pgno pgno){
drhbd03cae2001-06-02 02:40:57 +00001856 PageOne *pPage1 = pBt->page1;
drh3b7511c2001-05-26 13:15:44 +00001857 OverflowPage *pOvfl = (OverflowPage*)pPage;
1858 int rc;
drhdd793422001-06-28 01:54:48 +00001859 int needUnref = 0;
1860 MemPage *pMemPage;
drh8b2f49b2001-06-08 00:21:52 +00001861
drh3b7511c2001-05-26 13:15:44 +00001862 if( pgno==0 ){
1863 assert( pOvfl!=0 );
1864 pgno = sqlitepager_pagenumber(pOvfl);
1865 }
drh2aa679f2001-06-25 02:11:07 +00001866 assert( pgno>2 );
drhda47d772002-12-02 04:25:19 +00001867 assert( sqlitepager_pagenumber(pOvfl)==pgno );
drh193a6b42002-07-07 16:52:46 +00001868 pMemPage = (MemPage*)pPage;
1869 pMemPage->isInit = 0;
1870 if( pMemPage->pParent ){
1871 sqlitepager_unref(pMemPage->pParent);
1872 pMemPage->pParent = 0;
1873 }
drh3b7511c2001-05-26 13:15:44 +00001874 rc = sqlitepager_write(pPage1);
1875 if( rc ){
1876 return rc;
1877 }
drh0d316a42002-08-11 20:10:47 +00001878 SWAB_ADD(pBt, pPage1->nFree, 1);
1879 if( pPage1->nFree!=0 && pPage1->freeList!=0 ){
drh30e58752002-03-02 20:41:57 +00001880 OverflowPage *pFreeIdx;
drh0d316a42002-08-11 20:10:47 +00001881 rc = sqlitepager_get(pBt->pPager, SWAB32(pBt, pPage1->freeList),
1882 (void**)&pFreeIdx);
drh30e58752002-03-02 20:41:57 +00001883 if( rc==SQLITE_OK ){
1884 FreelistInfo *pInfo = (FreelistInfo*)pFreeIdx->aPayload;
drh0d316a42002-08-11 20:10:47 +00001885 int n = SWAB32(pBt, pInfo->nFree);
1886 if( n<(sizeof(pInfo->aFree)/sizeof(pInfo->aFree[0])) ){
drh30e58752002-03-02 20:41:57 +00001887 rc = sqlitepager_write(pFreeIdx);
1888 if( rc==SQLITE_OK ){
drh0d316a42002-08-11 20:10:47 +00001889 pInfo->aFree[n] = SWAB32(pBt, pgno);
1890 SWAB_ADD(pBt, pInfo->nFree, 1);
drh30e58752002-03-02 20:41:57 +00001891 sqlitepager_unref(pFreeIdx);
1892 sqlitepager_dont_write(pBt->pPager, pgno);
1893 return rc;
1894 }
1895 }
1896 sqlitepager_unref(pFreeIdx);
1897 }
1898 }
drh3b7511c2001-05-26 13:15:44 +00001899 if( pOvfl==0 ){
1900 assert( pgno>0 );
drh8c42ca92001-06-22 19:15:00 +00001901 rc = sqlitepager_get(pBt->pPager, pgno, (void**)&pOvfl);
drh3b7511c2001-05-26 13:15:44 +00001902 if( rc ) return rc;
drhdd793422001-06-28 01:54:48 +00001903 needUnref = 1;
drh3b7511c2001-05-26 13:15:44 +00001904 }
1905 rc = sqlitepager_write(pOvfl);
1906 if( rc ){
drhdd793422001-06-28 01:54:48 +00001907 if( needUnref ) sqlitepager_unref(pOvfl);
drh3b7511c2001-05-26 13:15:44 +00001908 return rc;
1909 }
drh14acc042001-06-10 19:56:58 +00001910 pOvfl->iNext = pPage1->freeList;
drh0d316a42002-08-11 20:10:47 +00001911 pPage1->freeList = SWAB32(pBt, pgno);
drh5e2f8b92001-05-28 00:41:15 +00001912 memset(pOvfl->aPayload, 0, OVERFLOW_SIZE);
drhdd793422001-06-28 01:54:48 +00001913 if( needUnref ) rc = sqlitepager_unref(pOvfl);
drh3b7511c2001-05-26 13:15:44 +00001914 return rc;
1915}
1916
1917/*
1918** Erase all the data out of a cell. This involves returning overflow
1919** pages back the freelist.
1920*/
1921static int clearCell(Btree *pBt, Cell *pCell){
1922 Pager *pPager = pBt->pPager;
1923 OverflowPage *pOvfl;
drh3b7511c2001-05-26 13:15:44 +00001924 Pgno ovfl, nextOvfl;
1925 int rc;
1926
drh0d316a42002-08-11 20:10:47 +00001927 if( NKEY(pBt, pCell->h) + NDATA(pBt, pCell->h) <= MX_LOCAL_PAYLOAD ){
drh5e2f8b92001-05-28 00:41:15 +00001928 return SQLITE_OK;
1929 }
drh0d316a42002-08-11 20:10:47 +00001930 ovfl = SWAB32(pBt, pCell->ovfl);
drh3b7511c2001-05-26 13:15:44 +00001931 pCell->ovfl = 0;
1932 while( ovfl ){
drh8c42ca92001-06-22 19:15:00 +00001933 rc = sqlitepager_get(pPager, ovfl, (void**)&pOvfl);
drh3b7511c2001-05-26 13:15:44 +00001934 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00001935 nextOvfl = SWAB32(pBt, pOvfl->iNext);
drhbd03cae2001-06-02 02:40:57 +00001936 rc = freePage(pBt, pOvfl, ovfl);
1937 if( rc ) return rc;
drhdd793422001-06-28 01:54:48 +00001938 sqlitepager_unref(pOvfl);
drh3b7511c2001-05-26 13:15:44 +00001939 ovfl = nextOvfl;
drh3b7511c2001-05-26 13:15:44 +00001940 }
drh5e2f8b92001-05-28 00:41:15 +00001941 return SQLITE_OK;
drh3b7511c2001-05-26 13:15:44 +00001942}
1943
1944/*
1945** Create a new cell from key and data. Overflow pages are allocated as
1946** necessary and linked to this cell.
1947*/
1948static int fillInCell(
1949 Btree *pBt, /* The whole Btree. Needed to allocate pages */
1950 Cell *pCell, /* Populate this Cell structure */
drh5c4d9702001-08-20 00:33:58 +00001951 const void *pKey, int nKey, /* The key */
1952 const void *pData,int nData /* The data */
drh3b7511c2001-05-26 13:15:44 +00001953){
drhdd793422001-06-28 01:54:48 +00001954 OverflowPage *pOvfl, *pPrior;
drh3b7511c2001-05-26 13:15:44 +00001955 Pgno *pNext;
1956 int spaceLeft;
drh8c42ca92001-06-22 19:15:00 +00001957 int n, rc;
drh3b7511c2001-05-26 13:15:44 +00001958 int nPayload;
drh5c4d9702001-08-20 00:33:58 +00001959 const char *pPayload;
drh3b7511c2001-05-26 13:15:44 +00001960 char *pSpace;
drh199e3cf2002-07-18 11:01:47 +00001961 Pgno nearby = 0;
drh3b7511c2001-05-26 13:15:44 +00001962
drh5e2f8b92001-05-28 00:41:15 +00001963 pCell->h.leftChild = 0;
drh0d316a42002-08-11 20:10:47 +00001964 pCell->h.nKey = SWAB16(pBt, nKey & 0xffff);
drh80ff32f2001-11-04 18:32:46 +00001965 pCell->h.nKeyHi = nKey >> 16;
drh0d316a42002-08-11 20:10:47 +00001966 pCell->h.nData = SWAB16(pBt, nData & 0xffff);
drh80ff32f2001-11-04 18:32:46 +00001967 pCell->h.nDataHi = nData >> 16;
drh3b7511c2001-05-26 13:15:44 +00001968 pCell->h.iNext = 0;
1969
1970 pNext = &pCell->ovfl;
drh5e2f8b92001-05-28 00:41:15 +00001971 pSpace = pCell->aPayload;
drh3b7511c2001-05-26 13:15:44 +00001972 spaceLeft = MX_LOCAL_PAYLOAD;
1973 pPayload = pKey;
1974 pKey = 0;
1975 nPayload = nKey;
drhdd793422001-06-28 01:54:48 +00001976 pPrior = 0;
drh3b7511c2001-05-26 13:15:44 +00001977 while( nPayload>0 ){
1978 if( spaceLeft==0 ){
drh199e3cf2002-07-18 11:01:47 +00001979 rc = allocatePage(pBt, (MemPage**)&pOvfl, pNext, nearby);
drh3b7511c2001-05-26 13:15:44 +00001980 if( rc ){
1981 *pNext = 0;
drhbea00b92002-07-08 10:59:50 +00001982 }else{
drh199e3cf2002-07-18 11:01:47 +00001983 nearby = *pNext;
drhdd793422001-06-28 01:54:48 +00001984 }
1985 if( pPrior ) sqlitepager_unref(pPrior);
1986 if( rc ){
drh5e2f8b92001-05-28 00:41:15 +00001987 clearCell(pBt, pCell);
drh3b7511c2001-05-26 13:15:44 +00001988 return rc;
1989 }
drh0d316a42002-08-11 20:10:47 +00001990 if( pBt->needSwab ) *pNext = swab32(*pNext);
drhdd793422001-06-28 01:54:48 +00001991 pPrior = pOvfl;
drh3b7511c2001-05-26 13:15:44 +00001992 spaceLeft = OVERFLOW_SIZE;
drh5e2f8b92001-05-28 00:41:15 +00001993 pSpace = pOvfl->aPayload;
drh8c42ca92001-06-22 19:15:00 +00001994 pNext = &pOvfl->iNext;
drh3b7511c2001-05-26 13:15:44 +00001995 }
1996 n = nPayload;
1997 if( n>spaceLeft ) n = spaceLeft;
1998 memcpy(pSpace, pPayload, n);
1999 nPayload -= n;
2000 if( nPayload==0 && pData ){
2001 pPayload = pData;
2002 nPayload = nData;
2003 pData = 0;
2004 }else{
2005 pPayload += n;
2006 }
2007 spaceLeft -= n;
2008 pSpace += n;
2009 }
drhdd793422001-06-28 01:54:48 +00002010 *pNext = 0;
2011 if( pPrior ){
2012 sqlitepager_unref(pPrior);
2013 }
drh3b7511c2001-05-26 13:15:44 +00002014 return SQLITE_OK;
2015}
2016
2017/*
drhbd03cae2001-06-02 02:40:57 +00002018** Change the MemPage.pParent pointer on the page whose number is
drh8b2f49b2001-06-08 00:21:52 +00002019** given in the second argument so that MemPage.pParent holds the
drhbd03cae2001-06-02 02:40:57 +00002020** pointer in the third argument.
2021*/
drh428ae8c2003-01-04 16:48:09 +00002022static void reparentPage(Pager *pPager, Pgno pgno, MemPage *pNewParent,int idx){
drhbd03cae2001-06-02 02:40:57 +00002023 MemPage *pThis;
2024
drhdd793422001-06-28 01:54:48 +00002025 if( pgno==0 ) return;
2026 assert( pPager!=0 );
drhbd03cae2001-06-02 02:40:57 +00002027 pThis = sqlitepager_lookup(pPager, pgno);
drh6019e162001-07-02 17:51:45 +00002028 if( pThis && pThis->isInit ){
drhdd793422001-06-28 01:54:48 +00002029 if( pThis->pParent!=pNewParent ){
2030 if( pThis->pParent ) sqlitepager_unref(pThis->pParent);
2031 pThis->pParent = pNewParent;
2032 if( pNewParent ) sqlitepager_ref(pNewParent);
2033 }
drh428ae8c2003-01-04 16:48:09 +00002034 pThis->idxParent = idx;
drhdd793422001-06-28 01:54:48 +00002035 sqlitepager_unref(pThis);
drhbd03cae2001-06-02 02:40:57 +00002036 }
2037}
2038
2039/*
2040** Reparent all children of the given page to be the given page.
2041** In other words, for every child of pPage, invoke reparentPage()
drh5e00f6c2001-09-13 13:46:56 +00002042** to make sure that each child knows that pPage is its parent.
drhbd03cae2001-06-02 02:40:57 +00002043**
2044** This routine gets called after you memcpy() one page into
2045** another.
2046*/
drh0d316a42002-08-11 20:10:47 +00002047static void reparentChildPages(Btree *pBt, MemPage *pPage){
drhbd03cae2001-06-02 02:40:57 +00002048 int i;
drh0d316a42002-08-11 20:10:47 +00002049 Pager *pPager = pBt->pPager;
drhbd03cae2001-06-02 02:40:57 +00002050 for(i=0; i<pPage->nCell; i++){
drh428ae8c2003-01-04 16:48:09 +00002051 reparentPage(pPager, SWAB32(pBt, pPage->apCell[i]->h.leftChild), pPage, i);
drhbd03cae2001-06-02 02:40:57 +00002052 }
drh428ae8c2003-01-04 16:48:09 +00002053 reparentPage(pPager, SWAB32(pBt, pPage->u.hdr.rightChild), pPage, i);
2054 pPage->idxShift = 0;
drh14acc042001-06-10 19:56:58 +00002055}
2056
2057/*
2058** Remove the i-th cell from pPage. This routine effects pPage only.
2059** The cell content is not freed or deallocated. It is assumed that
2060** the cell content has been copied someplace else. This routine just
2061** removes the reference to the cell from pPage.
2062**
2063** "sz" must be the number of bytes in the cell.
2064**
2065** Do not bother maintaining the integrity of the linked list of Cells.
drh8c42ca92001-06-22 19:15:00 +00002066** Only the pPage->apCell[] array is important. The relinkCellList()
2067** routine will be called soon after this routine in order to rebuild
2068** the linked list.
drh14acc042001-06-10 19:56:58 +00002069*/
drh0d316a42002-08-11 20:10:47 +00002070static void dropCell(Btree *pBt, MemPage *pPage, int idx, int sz){
drh14acc042001-06-10 19:56:58 +00002071 int j;
drh8c42ca92001-06-22 19:15:00 +00002072 assert( idx>=0 && idx<pPage->nCell );
drh0d316a42002-08-11 20:10:47 +00002073 assert( sz==cellSize(pBt, pPage->apCell[idx]) );
drh6019e162001-07-02 17:51:45 +00002074 assert( sqlitepager_iswriteable(pPage) );
drh0d316a42002-08-11 20:10:47 +00002075 freeSpace(pBt, pPage, Addr(pPage->apCell[idx]) - Addr(pPage), sz);
drh7c717f72001-06-24 20:39:41 +00002076 for(j=idx; j<pPage->nCell-1; j++){
drh14acc042001-06-10 19:56:58 +00002077 pPage->apCell[j] = pPage->apCell[j+1];
2078 }
2079 pPage->nCell--;
drh428ae8c2003-01-04 16:48:09 +00002080 pPage->idxShift = 1;
drh14acc042001-06-10 19:56:58 +00002081}
2082
2083/*
2084** Insert a new cell on pPage at cell index "i". pCell points to the
2085** content of the cell.
2086**
2087** If the cell content will fit on the page, then put it there. If it
2088** will not fit, then just make pPage->apCell[i] point to the content
2089** and set pPage->isOverfull.
2090**
2091** Do not bother maintaining the integrity of the linked list of Cells.
drh8c42ca92001-06-22 19:15:00 +00002092** Only the pPage->apCell[] array is important. The relinkCellList()
2093** routine will be called soon after this routine in order to rebuild
2094** the linked list.
drh14acc042001-06-10 19:56:58 +00002095*/
drh0d316a42002-08-11 20:10:47 +00002096static void insertCell(Btree *pBt, MemPage *pPage, int i, Cell *pCell, int sz){
drh14acc042001-06-10 19:56:58 +00002097 int idx, j;
2098 assert( i>=0 && i<=pPage->nCell );
drh0d316a42002-08-11 20:10:47 +00002099 assert( sz==cellSize(pBt, pCell) );
drh6019e162001-07-02 17:51:45 +00002100 assert( sqlitepager_iswriteable(pPage) );
drh0d316a42002-08-11 20:10:47 +00002101 idx = allocateSpace(pBt, pPage, sz);
drh14acc042001-06-10 19:56:58 +00002102 for(j=pPage->nCell; j>i; j--){
2103 pPage->apCell[j] = pPage->apCell[j-1];
2104 }
2105 pPage->nCell++;
drh14acc042001-06-10 19:56:58 +00002106 if( idx<=0 ){
2107 pPage->isOverfull = 1;
2108 pPage->apCell[i] = pCell;
2109 }else{
2110 memcpy(&pPage->u.aDisk[idx], pCell, sz);
drh8c42ca92001-06-22 19:15:00 +00002111 pPage->apCell[i] = (Cell*)&pPage->u.aDisk[idx];
drh14acc042001-06-10 19:56:58 +00002112 }
drh428ae8c2003-01-04 16:48:09 +00002113 pPage->idxShift = 1;
drh14acc042001-06-10 19:56:58 +00002114}
2115
2116/*
2117** Rebuild the linked list of cells on a page so that the cells
drh8c42ca92001-06-22 19:15:00 +00002118** occur in the order specified by the pPage->apCell[] array.
2119** Invoke this routine once to repair damage after one or more
2120** invocations of either insertCell() or dropCell().
drh14acc042001-06-10 19:56:58 +00002121*/
drh0d316a42002-08-11 20:10:47 +00002122static void relinkCellList(Btree *pBt, MemPage *pPage){
drh14acc042001-06-10 19:56:58 +00002123 int i;
2124 u16 *pIdx;
drh6019e162001-07-02 17:51:45 +00002125 assert( sqlitepager_iswriteable(pPage) );
drh14acc042001-06-10 19:56:58 +00002126 pIdx = &pPage->u.hdr.firstCell;
2127 for(i=0; i<pPage->nCell; i++){
drh7c717f72001-06-24 20:39:41 +00002128 int idx = Addr(pPage->apCell[i]) - Addr(pPage);
drhd0ba1932004-02-10 01:54:28 +00002129 assert( idx>0 && idx<SQLITE_USABLE_SIZE );
drh0d316a42002-08-11 20:10:47 +00002130 *pIdx = SWAB16(pBt, idx);
drh14acc042001-06-10 19:56:58 +00002131 pIdx = &pPage->apCell[i]->h.iNext;
2132 }
2133 *pIdx = 0;
2134}
2135
2136/*
2137** Make a copy of the contents of pFrom into pTo. The pFrom->apCell[]
drh5e00f6c2001-09-13 13:46:56 +00002138** pointers that point into pFrom->u.aDisk[] must be adjusted to point
drhdd793422001-06-28 01:54:48 +00002139** into pTo->u.aDisk[] instead. But some pFrom->apCell[] entries might
drh14acc042001-06-10 19:56:58 +00002140** not point to pFrom->u.aDisk[]. Those are unchanged.
2141*/
2142static void copyPage(MemPage *pTo, MemPage *pFrom){
2143 uptr from, to;
2144 int i;
drhd0ba1932004-02-10 01:54:28 +00002145 memcpy(pTo->u.aDisk, pFrom->u.aDisk, SQLITE_USABLE_SIZE);
drhdd793422001-06-28 01:54:48 +00002146 pTo->pParent = 0;
drh14acc042001-06-10 19:56:58 +00002147 pTo->isInit = 1;
2148 pTo->nCell = pFrom->nCell;
2149 pTo->nFree = pFrom->nFree;
2150 pTo->isOverfull = pFrom->isOverfull;
drh7c717f72001-06-24 20:39:41 +00002151 to = Addr(pTo);
2152 from = Addr(pFrom);
drh14acc042001-06-10 19:56:58 +00002153 for(i=0; i<pTo->nCell; i++){
drh7c717f72001-06-24 20:39:41 +00002154 uptr x = Addr(pFrom->apCell[i]);
drhd0ba1932004-02-10 01:54:28 +00002155 if( x>from && x<from+SQLITE_USABLE_SIZE ){
drh8c42ca92001-06-22 19:15:00 +00002156 *((uptr*)&pTo->apCell[i]) = x + to - from;
drhdd793422001-06-28 01:54:48 +00002157 }else{
2158 pTo->apCell[i] = pFrom->apCell[i];
drh14acc042001-06-10 19:56:58 +00002159 }
2160 }
drhbd03cae2001-06-02 02:40:57 +00002161}
2162
2163/*
drhc3b70572003-01-04 19:44:07 +00002164** The following parameters determine how many adjacent pages get involved
2165** in a balancing operation. NN is the number of neighbors on either side
2166** of the page that participate in the balancing operation. NB is the
2167** total number of pages that participate, including the target page and
2168** NN neighbors on either side.
2169**
2170** The minimum value of NN is 1 (of course). Increasing NN above 1
2171** (to 2 or 3) gives a modest improvement in SELECT and DELETE performance
2172** in exchange for a larger degradation in INSERT and UPDATE performance.
2173** The value of NN appears to give the best results overall.
2174*/
2175#define NN 1 /* Number of neighbors on either side of pPage */
2176#define NB (NN*2+1) /* Total pages involved in the balance */
2177
2178/*
drh8b2f49b2001-06-08 00:21:52 +00002179** This routine redistributes Cells on pPage and up to two siblings
2180** of pPage so that all pages have about the same amount of free space.
drh14acc042001-06-10 19:56:58 +00002181** Usually one sibling on either side of pPage is used in the balancing,
drh8b2f49b2001-06-08 00:21:52 +00002182** though both siblings might come from one side if pPage is the first
2183** or last child of its parent. If pPage has fewer than two siblings
2184** (something which can only happen if pPage is the root page or a
drh14acc042001-06-10 19:56:58 +00002185** child of root) then all available siblings participate in the balancing.
drh8b2f49b2001-06-08 00:21:52 +00002186**
2187** The number of siblings of pPage might be increased or decreased by
drh8c42ca92001-06-22 19:15:00 +00002188** one in an effort to keep pages between 66% and 100% full. The root page
2189** is special and is allowed to be less than 66% full. If pPage is
2190** the root page, then the depth of the tree might be increased
drh8b2f49b2001-06-08 00:21:52 +00002191** or decreased by one, as necessary, to keep the root page from being
2192** overfull or empty.
2193**
drh14acc042001-06-10 19:56:58 +00002194** This routine calls relinkCellList() on its input page regardless of
2195** whether or not it does any real balancing. Client routines will typically
2196** invoke insertCell() or dropCell() before calling this routine, so we
2197** need to call relinkCellList() to clean up the mess that those other
2198** routines left behind.
2199**
2200** pCur is left pointing to the same cell as when this routine was called
drh8c42ca92001-06-22 19:15:00 +00002201** even if that cell gets moved to a different page. pCur may be NULL.
2202** Set the pCur parameter to NULL if you do not care about keeping track
2203** of a cell as that will save this routine the work of keeping track of it.
drh14acc042001-06-10 19:56:58 +00002204**
drh8b2f49b2001-06-08 00:21:52 +00002205** Note that when this routine is called, some of the Cells on pPage
drh14acc042001-06-10 19:56:58 +00002206** might not actually be stored in pPage->u.aDisk[]. This can happen
drh8b2f49b2001-06-08 00:21:52 +00002207** if the page is overfull. Part of the job of this routine is to
drh14acc042001-06-10 19:56:58 +00002208** make sure all Cells for pPage once again fit in pPage->u.aDisk[].
2209**
drh8c42ca92001-06-22 19:15:00 +00002210** In the course of balancing the siblings of pPage, the parent of pPage
2211** might become overfull or underfull. If that happens, then this routine
2212** is called recursively on the parent.
2213**
drh5e00f6c2001-09-13 13:46:56 +00002214** If this routine fails for any reason, it might leave the database
2215** in a corrupted state. So if this routine fails, the database should
2216** be rolled back.
drh8b2f49b2001-06-08 00:21:52 +00002217*/
drh14acc042001-06-10 19:56:58 +00002218static int balance(Btree *pBt, MemPage *pPage, BtCursor *pCur){
drh8b2f49b2001-06-08 00:21:52 +00002219 MemPage *pParent; /* The parent of pPage */
drh8b2f49b2001-06-08 00:21:52 +00002220 int nCell; /* Number of cells in apCell[] */
2221 int nOld; /* Number of pages in apOld[] */
2222 int nNew; /* Number of pages in apNew[] */
drh8b2f49b2001-06-08 00:21:52 +00002223 int nDiv; /* Number of cells in apDiv[] */
drh14acc042001-06-10 19:56:58 +00002224 int i, j, k; /* Loop counters */
2225 int idx; /* Index of pPage in pParent->apCell[] */
2226 int nxDiv; /* Next divider slot in pParent->apCell[] */
2227 int rc; /* The return code */
2228 int iCur; /* apCell[iCur] is the cell of the cursor */
drh5edc3122001-09-13 21:53:09 +00002229 MemPage *pOldCurPage; /* The cursor originally points to this page */
drh6019e162001-07-02 17:51:45 +00002230 int subtotal; /* Subtotal of bytes in cells on one page */
drh9ca7d3b2001-06-28 11:50:21 +00002231 MemPage *extraUnref = 0; /* A page that needs to be unref-ed */
drhc3b70572003-01-04 19:44:07 +00002232 MemPage *apOld[NB]; /* pPage and up to two siblings */
2233 Pgno pgnoOld[NB]; /* Page numbers for each page in apOld[] */
2234 MemPage *apNew[NB+1]; /* pPage and up to NB siblings after balancing */
2235 Pgno pgnoNew[NB+1]; /* Page numbers for each page in apNew[] */
2236 int idxDiv[NB]; /* Indices of divider cells in pParent */
2237 Cell *apDiv[NB]; /* Divider cells in pParent */
2238 Cell aTemp[NB]; /* Temporary holding area for apDiv[] */
2239 int cntNew[NB+1]; /* Index in apCell[] of cell after i-th page */
2240 int szNew[NB+1]; /* Combined size of cells place on i-th page */
2241 MemPage aOld[NB]; /* Temporary copies of pPage and its siblings */
2242 Cell *apCell[(MX_CELL+2)*NB]; /* All cells from pages being balanced */
2243 int szCell[(MX_CELL+2)*NB]; /* Local size of all cells */
drh8b2f49b2001-06-08 00:21:52 +00002244
drh14acc042001-06-10 19:56:58 +00002245 /*
2246 ** Return without doing any work if pPage is neither overfull nor
2247 ** underfull.
drh8b2f49b2001-06-08 00:21:52 +00002248 */
drh6019e162001-07-02 17:51:45 +00002249 assert( sqlitepager_iswriteable(pPage) );
drhd0ba1932004-02-10 01:54:28 +00002250 if( !pPage->isOverfull && pPage->nFree<SQLITE_USABLE_SIZE/2
drha1b351a2001-09-14 16:42:12 +00002251 && pPage->nCell>=2){
drh0d316a42002-08-11 20:10:47 +00002252 relinkCellList(pBt, pPage);
drh8b2f49b2001-06-08 00:21:52 +00002253 return SQLITE_OK;
2254 }
2255
2256 /*
drh14acc042001-06-10 19:56:58 +00002257 ** Find the parent of the page to be balanceed.
2258 ** If there is no parent, it means this page is the root page and
drh8b2f49b2001-06-08 00:21:52 +00002259 ** special rules apply.
2260 */
drh14acc042001-06-10 19:56:58 +00002261 pParent = pPage->pParent;
drh8b2f49b2001-06-08 00:21:52 +00002262 if( pParent==0 ){
2263 Pgno pgnoChild;
drh8c42ca92001-06-22 19:15:00 +00002264 MemPage *pChild;
drh7aa128d2002-06-21 13:09:16 +00002265 assert( pPage->isInit );
drh8b2f49b2001-06-08 00:21:52 +00002266 if( pPage->nCell==0 ){
drh14acc042001-06-10 19:56:58 +00002267 if( pPage->u.hdr.rightChild ){
2268 /*
2269 ** The root page is empty. Copy the one child page
drh8b2f49b2001-06-08 00:21:52 +00002270 ** into the root page and return. This reduces the depth
2271 ** of the BTree by one.
2272 */
drh0d316a42002-08-11 20:10:47 +00002273 pgnoChild = SWAB32(pBt, pPage->u.hdr.rightChild);
drh8c42ca92001-06-22 19:15:00 +00002274 rc = sqlitepager_get(pBt->pPager, pgnoChild, (void**)&pChild);
drh8b2f49b2001-06-08 00:21:52 +00002275 if( rc ) return rc;
drhd0ba1932004-02-10 01:54:28 +00002276 memcpy(pPage, pChild, SQLITE_USABLE_SIZE);
drh8b2f49b2001-06-08 00:21:52 +00002277 pPage->isInit = 0;
drh0d316a42002-08-11 20:10:47 +00002278 rc = initPage(pBt, pPage, sqlitepager_pagenumber(pPage), 0);
drh6019e162001-07-02 17:51:45 +00002279 assert( rc==SQLITE_OK );
drh0d316a42002-08-11 20:10:47 +00002280 reparentChildPages(pBt, pPage);
drh5edc3122001-09-13 21:53:09 +00002281 if( pCur && pCur->pPage==pChild ){
2282 sqlitepager_unref(pChild);
2283 pCur->pPage = pPage;
2284 sqlitepager_ref(pPage);
2285 }
drh8b2f49b2001-06-08 00:21:52 +00002286 freePage(pBt, pChild, pgnoChild);
2287 sqlitepager_unref(pChild);
drhefc251d2001-07-01 22:12:01 +00002288 }else{
drh0d316a42002-08-11 20:10:47 +00002289 relinkCellList(pBt, pPage);
drh8b2f49b2001-06-08 00:21:52 +00002290 }
2291 return SQLITE_OK;
2292 }
drh14acc042001-06-10 19:56:58 +00002293 if( !pPage->isOverfull ){
drh8b2f49b2001-06-08 00:21:52 +00002294 /* It is OK for the root page to be less than half full.
2295 */
drh0d316a42002-08-11 20:10:47 +00002296 relinkCellList(pBt, pPage);
drh8b2f49b2001-06-08 00:21:52 +00002297 return SQLITE_OK;
2298 }
drh14acc042001-06-10 19:56:58 +00002299 /*
2300 ** If we get to here, it means the root page is overfull.
drh8b2f49b2001-06-08 00:21:52 +00002301 ** When this happens, Create a new child page and copy the
2302 ** contents of the root into the child. Then make the root
drh14acc042001-06-10 19:56:58 +00002303 ** page an empty page with rightChild pointing to the new
drh8b2f49b2001-06-08 00:21:52 +00002304 ** child. Then fall thru to the code below which will cause
2305 ** the overfull child page to be split.
2306 */
drh14acc042001-06-10 19:56:58 +00002307 rc = sqlitepager_write(pPage);
2308 if( rc ) return rc;
drhbea00b92002-07-08 10:59:50 +00002309 rc = allocatePage(pBt, &pChild, &pgnoChild, sqlitepager_pagenumber(pPage));
drh8b2f49b2001-06-08 00:21:52 +00002310 if( rc ) return rc;
drh6019e162001-07-02 17:51:45 +00002311 assert( sqlitepager_iswriteable(pChild) );
drh14acc042001-06-10 19:56:58 +00002312 copyPage(pChild, pPage);
2313 pChild->pParent = pPage;
drhbb49aba2003-01-04 18:53:27 +00002314 pChild->idxParent = 0;
drhdd793422001-06-28 01:54:48 +00002315 sqlitepager_ref(pPage);
drh14acc042001-06-10 19:56:58 +00002316 pChild->isOverfull = 1;
drh5edc3122001-09-13 21:53:09 +00002317 if( pCur && pCur->pPage==pPage ){
2318 sqlitepager_unref(pPage);
drh14acc042001-06-10 19:56:58 +00002319 pCur->pPage = pChild;
drh9ca7d3b2001-06-28 11:50:21 +00002320 }else{
2321 extraUnref = pChild;
drh8b2f49b2001-06-08 00:21:52 +00002322 }
drh0d316a42002-08-11 20:10:47 +00002323 zeroPage(pBt, pPage);
2324 pPage->u.hdr.rightChild = SWAB32(pBt, pgnoChild);
drh8b2f49b2001-06-08 00:21:52 +00002325 pParent = pPage;
2326 pPage = pChild;
drh8b2f49b2001-06-08 00:21:52 +00002327 }
drh6019e162001-07-02 17:51:45 +00002328 rc = sqlitepager_write(pParent);
2329 if( rc ) return rc;
drh7aa128d2002-06-21 13:09:16 +00002330 assert( pParent->isInit );
drh14acc042001-06-10 19:56:58 +00002331
drh8b2f49b2001-06-08 00:21:52 +00002332 /*
drh14acc042001-06-10 19:56:58 +00002333 ** Find the Cell in the parent page whose h.leftChild points back
2334 ** to pPage. The "idx" variable is the index of that cell. If pPage
2335 ** is the rightmost child of pParent then set idx to pParent->nCell
drh8b2f49b2001-06-08 00:21:52 +00002336 */
drhbb49aba2003-01-04 18:53:27 +00002337 if( pParent->idxShift ){
2338 Pgno pgno, swabPgno;
2339 pgno = sqlitepager_pagenumber(pPage);
2340 swabPgno = SWAB32(pBt, pgno);
2341 for(idx=0; idx<pParent->nCell; idx++){
2342 if( pParent->apCell[idx]->h.leftChild==swabPgno ){
2343 break;
2344 }
drh8b2f49b2001-06-08 00:21:52 +00002345 }
drhbb49aba2003-01-04 18:53:27 +00002346 assert( idx<pParent->nCell || pParent->u.hdr.rightChild==swabPgno );
2347 }else{
2348 idx = pPage->idxParent;
drh8b2f49b2001-06-08 00:21:52 +00002349 }
drh8b2f49b2001-06-08 00:21:52 +00002350
2351 /*
drh14acc042001-06-10 19:56:58 +00002352 ** Initialize variables so that it will be safe to jump
drh5edc3122001-09-13 21:53:09 +00002353 ** directly to balance_cleanup at any moment.
drh8b2f49b2001-06-08 00:21:52 +00002354 */
drh14acc042001-06-10 19:56:58 +00002355 nOld = nNew = 0;
2356 sqlitepager_ref(pParent);
2357
2358 /*
2359 ** Find sibling pages to pPage and the Cells in pParent that divide
drhc3b70572003-01-04 19:44:07 +00002360 ** the siblings. An attempt is made to find NN siblings on either
2361 ** side of pPage. More siblings are taken from one side, however, if
2362 ** pPage there are fewer than NN siblings on the other side. If pParent
2363 ** has NB or fewer children then all children of pParent are taken.
drh14acc042001-06-10 19:56:58 +00002364 */
drhc3b70572003-01-04 19:44:07 +00002365 nxDiv = idx - NN;
2366 if( nxDiv + NB > pParent->nCell ){
2367 nxDiv = pParent->nCell - NB + 1;
drh8b2f49b2001-06-08 00:21:52 +00002368 }
drhc3b70572003-01-04 19:44:07 +00002369 if( nxDiv<0 ){
2370 nxDiv = 0;
2371 }
drh8b2f49b2001-06-08 00:21:52 +00002372 nDiv = 0;
drhc3b70572003-01-04 19:44:07 +00002373 for(i=0, k=nxDiv; i<NB; i++, k++){
drh14acc042001-06-10 19:56:58 +00002374 if( k<pParent->nCell ){
2375 idxDiv[i] = k;
2376 apDiv[i] = pParent->apCell[k];
drh8b2f49b2001-06-08 00:21:52 +00002377 nDiv++;
drh0d316a42002-08-11 20:10:47 +00002378 pgnoOld[i] = SWAB32(pBt, apDiv[i]->h.leftChild);
drh14acc042001-06-10 19:56:58 +00002379 }else if( k==pParent->nCell ){
drh0d316a42002-08-11 20:10:47 +00002380 pgnoOld[i] = SWAB32(pBt, pParent->u.hdr.rightChild);
drh14acc042001-06-10 19:56:58 +00002381 }else{
2382 break;
drh8b2f49b2001-06-08 00:21:52 +00002383 }
drh8c42ca92001-06-22 19:15:00 +00002384 rc = sqlitepager_get(pBt->pPager, pgnoOld[i], (void**)&apOld[i]);
drh14acc042001-06-10 19:56:58 +00002385 if( rc ) goto balance_cleanup;
drh0d316a42002-08-11 20:10:47 +00002386 rc = initPage(pBt, apOld[i], pgnoOld[i], pParent);
drh6019e162001-07-02 17:51:45 +00002387 if( rc ) goto balance_cleanup;
drh428ae8c2003-01-04 16:48:09 +00002388 apOld[i]->idxParent = k;
drh14acc042001-06-10 19:56:58 +00002389 nOld++;
drh8b2f49b2001-06-08 00:21:52 +00002390 }
2391
2392 /*
drh14acc042001-06-10 19:56:58 +00002393 ** Set iCur to be the index in apCell[] of the cell that the cursor
2394 ** is pointing to. We will need this later on in order to keep the
drh5edc3122001-09-13 21:53:09 +00002395 ** cursor pointing at the same cell. If pCur points to a page that
2396 ** has no involvement with this rebalancing, then set iCur to a large
2397 ** number so that the iCur==j tests always fail in the main cell
2398 ** distribution loop below.
drh14acc042001-06-10 19:56:58 +00002399 */
2400 if( pCur ){
drh5edc3122001-09-13 21:53:09 +00002401 iCur = 0;
2402 for(i=0; i<nOld; i++){
2403 if( pCur->pPage==apOld[i] ){
2404 iCur += pCur->idx;
2405 break;
2406 }
2407 iCur += apOld[i]->nCell;
2408 if( i<nOld-1 && pCur->pPage==pParent && pCur->idx==idxDiv[i] ){
2409 break;
2410 }
2411 iCur++;
drh14acc042001-06-10 19:56:58 +00002412 }
drh5edc3122001-09-13 21:53:09 +00002413 pOldCurPage = pCur->pPage;
drh14acc042001-06-10 19:56:58 +00002414 }
2415
2416 /*
2417 ** Make copies of the content of pPage and its siblings into aOld[].
2418 ** The rest of this function will use data from the copies rather
2419 ** that the original pages since the original pages will be in the
2420 ** process of being overwritten.
2421 */
2422 for(i=0; i<nOld; i++){
2423 copyPage(&aOld[i], apOld[i]);
drh14acc042001-06-10 19:56:58 +00002424 }
2425
2426 /*
2427 ** Load pointers to all cells on sibling pages and the divider cells
2428 ** into the local apCell[] array. Make copies of the divider cells
2429 ** into aTemp[] and remove the the divider Cells from pParent.
drh8b2f49b2001-06-08 00:21:52 +00002430 */
2431 nCell = 0;
2432 for(i=0; i<nOld; i++){
drh6b308672002-07-08 02:16:37 +00002433 MemPage *pOld = &aOld[i];
drh8b2f49b2001-06-08 00:21:52 +00002434 for(j=0; j<pOld->nCell; j++){
drh14acc042001-06-10 19:56:58 +00002435 apCell[nCell] = pOld->apCell[j];
drh0d316a42002-08-11 20:10:47 +00002436 szCell[nCell] = cellSize(pBt, apCell[nCell]);
drh14acc042001-06-10 19:56:58 +00002437 nCell++;
drh8b2f49b2001-06-08 00:21:52 +00002438 }
2439 if( i<nOld-1 ){
drh0d316a42002-08-11 20:10:47 +00002440 szCell[nCell] = cellSize(pBt, apDiv[i]);
drh8c42ca92001-06-22 19:15:00 +00002441 memcpy(&aTemp[i], apDiv[i], szCell[nCell]);
drh14acc042001-06-10 19:56:58 +00002442 apCell[nCell] = &aTemp[i];
drh0d316a42002-08-11 20:10:47 +00002443 dropCell(pBt, pParent, nxDiv, szCell[nCell]);
2444 assert( SWAB32(pBt, apCell[nCell]->h.leftChild)==pgnoOld[i] );
drh14acc042001-06-10 19:56:58 +00002445 apCell[nCell]->h.leftChild = pOld->u.hdr.rightChild;
2446 nCell++;
drh8b2f49b2001-06-08 00:21:52 +00002447 }
2448 }
2449
2450 /*
drh6019e162001-07-02 17:51:45 +00002451 ** Figure out the number of pages needed to hold all nCell cells.
2452 ** Store this number in "k". Also compute szNew[] which is the total
2453 ** size of all cells on the i-th page and cntNew[] which is the index
2454 ** in apCell[] of the cell that divides path i from path i+1.
2455 ** cntNew[k] should equal nCell.
2456 **
2457 ** This little patch of code is critical for keeping the tree
2458 ** balanced.
drh8b2f49b2001-06-08 00:21:52 +00002459 */
drh6019e162001-07-02 17:51:45 +00002460 for(subtotal=k=i=0; i<nCell; i++){
2461 subtotal += szCell[i];
2462 if( subtotal > USABLE_SPACE ){
2463 szNew[k] = subtotal - szCell[i];
2464 cntNew[k] = i;
2465 subtotal = 0;
2466 k++;
2467 }
2468 }
2469 szNew[k] = subtotal;
2470 cntNew[k] = nCell;
2471 k++;
2472 for(i=k-1; i>0; i--){
2473 while( szNew[i]<USABLE_SPACE/2 ){
2474 cntNew[i-1]--;
2475 assert( cntNew[i-1]>0 );
2476 szNew[i] += szCell[cntNew[i-1]];
2477 szNew[i-1] -= szCell[cntNew[i-1]-1];
2478 }
2479 }
2480 assert( cntNew[0]>0 );
drh8b2f49b2001-06-08 00:21:52 +00002481
2482 /*
drh6b308672002-07-08 02:16:37 +00002483 ** Allocate k new pages. Reuse old pages where possible.
drh8b2f49b2001-06-08 00:21:52 +00002484 */
drh14acc042001-06-10 19:56:58 +00002485 for(i=0; i<k; i++){
drh6b308672002-07-08 02:16:37 +00002486 if( i<nOld ){
2487 apNew[i] = apOld[i];
2488 pgnoNew[i] = pgnoOld[i];
2489 apOld[i] = 0;
2490 sqlitepager_write(apNew[i]);
2491 }else{
drhbea00b92002-07-08 10:59:50 +00002492 rc = allocatePage(pBt, &apNew[i], &pgnoNew[i], pgnoNew[i-1]);
drh6b308672002-07-08 02:16:37 +00002493 if( rc ) goto balance_cleanup;
2494 }
drh14acc042001-06-10 19:56:58 +00002495 nNew++;
drh0d316a42002-08-11 20:10:47 +00002496 zeroPage(pBt, apNew[i]);
drh6019e162001-07-02 17:51:45 +00002497 apNew[i]->isInit = 1;
drh8b2f49b2001-06-08 00:21:52 +00002498 }
2499
drh6b308672002-07-08 02:16:37 +00002500 /* Free any old pages that were not reused as new pages.
2501 */
2502 while( i<nOld ){
2503 rc = freePage(pBt, apOld[i], pgnoOld[i]);
2504 if( rc ) goto balance_cleanup;
2505 sqlitepager_unref(apOld[i]);
2506 apOld[i] = 0;
2507 i++;
2508 }
2509
drh8b2f49b2001-06-08 00:21:52 +00002510 /*
drhf9ffac92002-03-02 19:00:31 +00002511 ** Put the new pages in accending order. This helps to
2512 ** keep entries in the disk file in order so that a scan
2513 ** of the table is a linear scan through the file. That
2514 ** in turn helps the operating system to deliver pages
2515 ** from the disk more rapidly.
2516 **
2517 ** An O(n^2) insertion sort algorithm is used, but since
drhc3b70572003-01-04 19:44:07 +00002518 ** n is never more than NB (a small constant), that should
2519 ** not be a problem.
drhf9ffac92002-03-02 19:00:31 +00002520 **
drhc3b70572003-01-04 19:44:07 +00002521 ** When NB==3, this one optimization makes the database
2522 ** about 25% faster for large insertions and deletions.
drhf9ffac92002-03-02 19:00:31 +00002523 */
2524 for(i=0; i<k-1; i++){
2525 int minV = pgnoNew[i];
2526 int minI = i;
2527 for(j=i+1; j<k; j++){
drh7d02cb72003-06-04 16:24:39 +00002528 if( pgnoNew[j]<(unsigned)minV ){
drhf9ffac92002-03-02 19:00:31 +00002529 minI = j;
2530 minV = pgnoNew[j];
2531 }
2532 }
2533 if( minI>i ){
2534 int t;
2535 MemPage *pT;
2536 t = pgnoNew[i];
2537 pT = apNew[i];
2538 pgnoNew[i] = pgnoNew[minI];
2539 apNew[i] = apNew[minI];
2540 pgnoNew[minI] = t;
2541 apNew[minI] = pT;
2542 }
2543 }
2544
2545 /*
drh14acc042001-06-10 19:56:58 +00002546 ** Evenly distribute the data in apCell[] across the new pages.
2547 ** Insert divider cells into pParent as necessary.
2548 */
2549 j = 0;
2550 for(i=0; i<nNew; i++){
2551 MemPage *pNew = apNew[i];
drh6019e162001-07-02 17:51:45 +00002552 while( j<cntNew[i] ){
2553 assert( pNew->nFree>=szCell[j] );
drh14acc042001-06-10 19:56:58 +00002554 if( pCur && iCur==j ){ pCur->pPage = pNew; pCur->idx = pNew->nCell; }
drh0d316a42002-08-11 20:10:47 +00002555 insertCell(pBt, pNew, pNew->nCell, apCell[j], szCell[j]);
drh14acc042001-06-10 19:56:58 +00002556 j++;
2557 }
drh6019e162001-07-02 17:51:45 +00002558 assert( pNew->nCell>0 );
drh14acc042001-06-10 19:56:58 +00002559 assert( !pNew->isOverfull );
drh0d316a42002-08-11 20:10:47 +00002560 relinkCellList(pBt, pNew);
drh14acc042001-06-10 19:56:58 +00002561 if( i<nNew-1 && j<nCell ){
2562 pNew->u.hdr.rightChild = apCell[j]->h.leftChild;
drh0d316a42002-08-11 20:10:47 +00002563 apCell[j]->h.leftChild = SWAB32(pBt, pgnoNew[i]);
drh14acc042001-06-10 19:56:58 +00002564 if( pCur && iCur==j ){ pCur->pPage = pParent; pCur->idx = nxDiv; }
drh0d316a42002-08-11 20:10:47 +00002565 insertCell(pBt, pParent, nxDiv, apCell[j], szCell[j]);
drh14acc042001-06-10 19:56:58 +00002566 j++;
2567 nxDiv++;
2568 }
2569 }
drh6019e162001-07-02 17:51:45 +00002570 assert( j==nCell );
drh6b308672002-07-08 02:16:37 +00002571 apNew[nNew-1]->u.hdr.rightChild = aOld[nOld-1].u.hdr.rightChild;
drh14acc042001-06-10 19:56:58 +00002572 if( nxDiv==pParent->nCell ){
drh0d316a42002-08-11 20:10:47 +00002573 pParent->u.hdr.rightChild = SWAB32(pBt, pgnoNew[nNew-1]);
drh14acc042001-06-10 19:56:58 +00002574 }else{
drh0d316a42002-08-11 20:10:47 +00002575 pParent->apCell[nxDiv]->h.leftChild = SWAB32(pBt, pgnoNew[nNew-1]);
drh14acc042001-06-10 19:56:58 +00002576 }
2577 if( pCur ){
drh3fc190c2001-09-14 03:24:23 +00002578 if( j<=iCur && pCur->pPage==pParent && pCur->idx>idxDiv[nOld-1] ){
2579 assert( pCur->pPage==pOldCurPage );
2580 pCur->idx += nNew - nOld;
2581 }else{
2582 assert( pOldCurPage!=0 );
2583 sqlitepager_ref(pCur->pPage);
2584 sqlitepager_unref(pOldCurPage);
2585 }
drh14acc042001-06-10 19:56:58 +00002586 }
2587
2588 /*
2589 ** Reparent children of all cells.
drh8b2f49b2001-06-08 00:21:52 +00002590 */
2591 for(i=0; i<nNew; i++){
drh0d316a42002-08-11 20:10:47 +00002592 reparentChildPages(pBt, apNew[i]);
drh8b2f49b2001-06-08 00:21:52 +00002593 }
drh0d316a42002-08-11 20:10:47 +00002594 reparentChildPages(pBt, pParent);
drh8b2f49b2001-06-08 00:21:52 +00002595
2596 /*
drh14acc042001-06-10 19:56:58 +00002597 ** balance the parent page.
drh8b2f49b2001-06-08 00:21:52 +00002598 */
drh5edc3122001-09-13 21:53:09 +00002599 rc = balance(pBt, pParent, pCur);
drh8b2f49b2001-06-08 00:21:52 +00002600
2601 /*
drh14acc042001-06-10 19:56:58 +00002602 ** Cleanup before returning.
drh8b2f49b2001-06-08 00:21:52 +00002603 */
drh14acc042001-06-10 19:56:58 +00002604balance_cleanup:
drh9ca7d3b2001-06-28 11:50:21 +00002605 if( extraUnref ){
2606 sqlitepager_unref(extraUnref);
2607 }
drh8b2f49b2001-06-08 00:21:52 +00002608 for(i=0; i<nOld; i++){
drh6b308672002-07-08 02:16:37 +00002609 if( apOld[i]!=0 && apOld[i]!=&aOld[i] ) sqlitepager_unref(apOld[i]);
drh8b2f49b2001-06-08 00:21:52 +00002610 }
drh14acc042001-06-10 19:56:58 +00002611 for(i=0; i<nNew; i++){
2612 sqlitepager_unref(apNew[i]);
drh8b2f49b2001-06-08 00:21:52 +00002613 }
drh14acc042001-06-10 19:56:58 +00002614 if( pCur && pCur->pPage==0 ){
2615 pCur->pPage = pParent;
2616 pCur->idx = 0;
2617 }else{
2618 sqlitepager_unref(pParent);
drh8b2f49b2001-06-08 00:21:52 +00002619 }
2620 return rc;
2621}
2622
2623/*
drhf74b8d92002-09-01 23:20:45 +00002624** This routine checks all cursors that point to the same table
2625** as pCur points to. If any of those cursors were opened with
2626** wrFlag==0 then this routine returns SQLITE_LOCKED. If all
2627** cursors point to the same table were opened with wrFlag==1
2628** then this routine returns SQLITE_OK.
2629**
2630** In addition to checking for read-locks (where a read-lock
2631** means a cursor opened with wrFlag==0) this routine also moves
2632** all cursors other than pCur so that they are pointing to the
2633** first Cell on root page. This is necessary because an insert
2634** or delete might change the number of cells on a page or delete
2635** a page entirely and we do not want to leave any cursors
2636** pointing to non-existant pages or cells.
2637*/
2638static int checkReadLocks(BtCursor *pCur){
2639 BtCursor *p;
2640 assert( pCur->wrFlag );
2641 for(p=pCur->pShared; p!=pCur; p=p->pShared){
2642 assert( p );
2643 assert( p->pgnoRoot==pCur->pgnoRoot );
2644 if( p->wrFlag==0 ) return SQLITE_LOCKED;
2645 if( sqlitepager_pagenumber(p->pPage)!=p->pgnoRoot ){
2646 moveToRoot(p);
2647 }
2648 }
2649 return SQLITE_OK;
2650}
2651
2652/*
drh3b7511c2001-05-26 13:15:44 +00002653** Insert a new record into the BTree. The key is given by (pKey,nKey)
2654** and the data is given by (pData,nData). The cursor is used only to
2655** define what database the record should be inserted into. The cursor
drh14acc042001-06-10 19:56:58 +00002656** is left pointing at the new record.
drh3b7511c2001-05-26 13:15:44 +00002657*/
drh144f9ea2003-04-16 01:28:16 +00002658static int fileBtreeInsert(
drh5c4d9702001-08-20 00:33:58 +00002659 BtCursor *pCur, /* Insert data into the table of this cursor */
drhbe0072d2001-09-13 14:46:09 +00002660 const void *pKey, int nKey, /* The key of the new record */
drh5c4d9702001-08-20 00:33:58 +00002661 const void *pData, int nData /* The data of the new record */
drh3b7511c2001-05-26 13:15:44 +00002662){
2663 Cell newCell;
2664 int rc;
2665 int loc;
drh14acc042001-06-10 19:56:58 +00002666 int szNew;
drh3b7511c2001-05-26 13:15:44 +00002667 MemPage *pPage;
2668 Btree *pBt = pCur->pBt;
2669
drhecdc7532001-09-23 02:35:53 +00002670 if( pCur->pPage==0 ){
2671 return SQLITE_ABORT; /* A rollback destroyed this cursor */
2672 }
drhf74b8d92002-09-01 23:20:45 +00002673 if( !pBt->inTrans || nKey+nData==0 ){
2674 /* Must start a transaction before doing an insert */
2675 return pBt->readOnly ? SQLITE_READONLY : SQLITE_ERROR;
drh8b2f49b2001-06-08 00:21:52 +00002676 }
drhf74b8d92002-09-01 23:20:45 +00002677 assert( !pBt->readOnly );
drhecdc7532001-09-23 02:35:53 +00002678 if( !pCur->wrFlag ){
2679 return SQLITE_PERM; /* Cursor not open for writing */
2680 }
drhf74b8d92002-09-01 23:20:45 +00002681 if( checkReadLocks(pCur) ){
2682 return SQLITE_LOCKED; /* The table pCur points to has a read lock */
2683 }
drh144f9ea2003-04-16 01:28:16 +00002684 rc = fileBtreeMoveto(pCur, pKey, nKey, &loc);
drh3b7511c2001-05-26 13:15:44 +00002685 if( rc ) return rc;
drh14acc042001-06-10 19:56:58 +00002686 pPage = pCur->pPage;
drh7aa128d2002-06-21 13:09:16 +00002687 assert( pPage->isInit );
drh14acc042001-06-10 19:56:58 +00002688 rc = sqlitepager_write(pPage);
drhbd03cae2001-06-02 02:40:57 +00002689 if( rc ) return rc;
drh3b7511c2001-05-26 13:15:44 +00002690 rc = fillInCell(pBt, &newCell, pKey, nKey, pData, nData);
2691 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00002692 szNew = cellSize(pBt, &newCell);
drh3b7511c2001-05-26 13:15:44 +00002693 if( loc==0 ){
drh14acc042001-06-10 19:56:58 +00002694 newCell.h.leftChild = pPage->apCell[pCur->idx]->h.leftChild;
2695 rc = clearCell(pBt, pPage->apCell[pCur->idx]);
drh5e2f8b92001-05-28 00:41:15 +00002696 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00002697 dropCell(pBt, pPage, pCur->idx, cellSize(pBt, pPage->apCell[pCur->idx]));
drh7c717f72001-06-24 20:39:41 +00002698 }else if( loc<0 && pPage->nCell>0 ){
drh14acc042001-06-10 19:56:58 +00002699 assert( pPage->u.hdr.rightChild==0 ); /* Must be a leaf page */
2700 pCur->idx++;
2701 }else{
2702 assert( pPage->u.hdr.rightChild==0 ); /* Must be a leaf page */
drh3b7511c2001-05-26 13:15:44 +00002703 }
drh0d316a42002-08-11 20:10:47 +00002704 insertCell(pBt, pPage, pCur->idx, &newCell, szNew);
drh14acc042001-06-10 19:56:58 +00002705 rc = balance(pCur->pBt, pPage, pCur);
drh3fc190c2001-09-14 03:24:23 +00002706 /* sqliteBtreePageDump(pCur->pBt, pCur->pgnoRoot, 1); */
2707 /* fflush(stdout); */
drh2dcc9aa2002-12-04 13:40:25 +00002708 pCur->eSkip = SKIP_INVALID;
drh5e2f8b92001-05-28 00:41:15 +00002709 return rc;
2710}
2711
2712/*
drhbd03cae2001-06-02 02:40:57 +00002713** Delete the entry that the cursor is pointing to.
drh5e2f8b92001-05-28 00:41:15 +00002714**
drhbd03cae2001-06-02 02:40:57 +00002715** The cursor is left pointing at either the next or the previous
2716** entry. If the cursor is left pointing to the next entry, then
drh2dcc9aa2002-12-04 13:40:25 +00002717** the pCur->eSkip flag is set to SKIP_NEXT which forces the next call to
drhbd03cae2001-06-02 02:40:57 +00002718** sqliteBtreeNext() to be a no-op. That way, you can always call
2719** sqliteBtreeNext() after a delete and the cursor will be left
drh2dcc9aa2002-12-04 13:40:25 +00002720** pointing to the first entry after the deleted entry. Similarly,
2721** pCur->eSkip is set to SKIP_PREV is the cursor is left pointing to
2722** the entry prior to the deleted entry so that a subsequent call to
2723** sqliteBtreePrevious() will always leave the cursor pointing at the
2724** entry immediately before the one that was deleted.
drh3b7511c2001-05-26 13:15:44 +00002725*/
drh144f9ea2003-04-16 01:28:16 +00002726static int fileBtreeDelete(BtCursor *pCur){
drh5e2f8b92001-05-28 00:41:15 +00002727 MemPage *pPage = pCur->pPage;
2728 Cell *pCell;
2729 int rc;
drh8c42ca92001-06-22 19:15:00 +00002730 Pgno pgnoChild;
drh0d316a42002-08-11 20:10:47 +00002731 Btree *pBt = pCur->pBt;
drh8b2f49b2001-06-08 00:21:52 +00002732
drh7aa128d2002-06-21 13:09:16 +00002733 assert( pPage->isInit );
drhecdc7532001-09-23 02:35:53 +00002734 if( pCur->pPage==0 ){
2735 return SQLITE_ABORT; /* A rollback destroyed this cursor */
2736 }
drhf74b8d92002-09-01 23:20:45 +00002737 if( !pBt->inTrans ){
2738 /* Must start a transaction before doing a delete */
2739 return pBt->readOnly ? SQLITE_READONLY : SQLITE_ERROR;
drh8b2f49b2001-06-08 00:21:52 +00002740 }
drhf74b8d92002-09-01 23:20:45 +00002741 assert( !pBt->readOnly );
drhbd03cae2001-06-02 02:40:57 +00002742 if( pCur->idx >= pPage->nCell ){
2743 return SQLITE_ERROR; /* The cursor is not pointing to anything */
2744 }
drhecdc7532001-09-23 02:35:53 +00002745 if( !pCur->wrFlag ){
2746 return SQLITE_PERM; /* Did not open this cursor for writing */
2747 }
drhf74b8d92002-09-01 23:20:45 +00002748 if( checkReadLocks(pCur) ){
2749 return SQLITE_LOCKED; /* The table pCur points to has a read lock */
2750 }
drhbd03cae2001-06-02 02:40:57 +00002751 rc = sqlitepager_write(pPage);
2752 if( rc ) return rc;
drh5e2f8b92001-05-28 00:41:15 +00002753 pCell = pPage->apCell[pCur->idx];
drh0d316a42002-08-11 20:10:47 +00002754 pgnoChild = SWAB32(pBt, pCell->h.leftChild);
2755 clearCell(pBt, pCell);
drh14acc042001-06-10 19:56:58 +00002756 if( pgnoChild ){
2757 /*
drh5e00f6c2001-09-13 13:46:56 +00002758 ** The entry we are about to delete is not a leaf so if we do not
drh9ca7d3b2001-06-28 11:50:21 +00002759 ** do something we will leave a hole on an internal page.
2760 ** We have to fill the hole by moving in a cell from a leaf. The
2761 ** next Cell after the one to be deleted is guaranteed to exist and
2762 ** to be a leaf so we can use it.
drh5e2f8b92001-05-28 00:41:15 +00002763 */
drh14acc042001-06-10 19:56:58 +00002764 BtCursor leafCur;
2765 Cell *pNext;
2766 int szNext;
drh8c1238a2003-01-02 14:43:55 +00002767 int notUsed;
drh14acc042001-06-10 19:56:58 +00002768 getTempCursor(pCur, &leafCur);
drh144f9ea2003-04-16 01:28:16 +00002769 rc = fileBtreeNext(&leafCur, &notUsed);
drh14acc042001-06-10 19:56:58 +00002770 if( rc!=SQLITE_OK ){
drh8a6ac0a2004-02-14 17:35:07 +00002771 if( rc!=SQLITE_NOMEM ) rc = SQLITE_CORRUPT;
2772 return rc;
drh5e2f8b92001-05-28 00:41:15 +00002773 }
drh6019e162001-07-02 17:51:45 +00002774 rc = sqlitepager_write(leafCur.pPage);
2775 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00002776 dropCell(pBt, pPage, pCur->idx, cellSize(pBt, pCell));
drh8c42ca92001-06-22 19:15:00 +00002777 pNext = leafCur.pPage->apCell[leafCur.idx];
drh0d316a42002-08-11 20:10:47 +00002778 szNext = cellSize(pBt, pNext);
2779 pNext->h.leftChild = SWAB32(pBt, pgnoChild);
2780 insertCell(pBt, pPage, pCur->idx, pNext, szNext);
2781 rc = balance(pBt, pPage, pCur);
drh5e2f8b92001-05-28 00:41:15 +00002782 if( rc ) return rc;
drh2dcc9aa2002-12-04 13:40:25 +00002783 pCur->eSkip = SKIP_NEXT;
drh0d316a42002-08-11 20:10:47 +00002784 dropCell(pBt, leafCur.pPage, leafCur.idx, szNext);
2785 rc = balance(pBt, leafCur.pPage, pCur);
drh8c42ca92001-06-22 19:15:00 +00002786 releaseTempCursor(&leafCur);
drh5e2f8b92001-05-28 00:41:15 +00002787 }else{
drh0d316a42002-08-11 20:10:47 +00002788 dropCell(pBt, pPage, pCur->idx, cellSize(pBt, pCell));
drh5edc3122001-09-13 21:53:09 +00002789 if( pCur->idx>=pPage->nCell ){
2790 pCur->idx = pPage->nCell-1;
drhf5bf0a72001-11-23 00:24:12 +00002791 if( pCur->idx<0 ){
2792 pCur->idx = 0;
drh2dcc9aa2002-12-04 13:40:25 +00002793 pCur->eSkip = SKIP_NEXT;
drhf5bf0a72001-11-23 00:24:12 +00002794 }else{
drh2dcc9aa2002-12-04 13:40:25 +00002795 pCur->eSkip = SKIP_PREV;
drhf5bf0a72001-11-23 00:24:12 +00002796 }
drh6019e162001-07-02 17:51:45 +00002797 }else{
drh2dcc9aa2002-12-04 13:40:25 +00002798 pCur->eSkip = SKIP_NEXT;
drh6019e162001-07-02 17:51:45 +00002799 }
drh0d316a42002-08-11 20:10:47 +00002800 rc = balance(pBt, pPage, pCur);
drh5e2f8b92001-05-28 00:41:15 +00002801 }
drh5e2f8b92001-05-28 00:41:15 +00002802 return rc;
drh3b7511c2001-05-26 13:15:44 +00002803}
drh8b2f49b2001-06-08 00:21:52 +00002804
2805/*
drhc6b52df2002-01-04 03:09:29 +00002806** Create a new BTree table. Write into *piTable the page
2807** number for the root page of the new table.
2808**
2809** In the current implementation, BTree tables and BTree indices are the
drh144f9ea2003-04-16 01:28:16 +00002810** the same. In the future, we may change this so that BTree tables
drhc6b52df2002-01-04 03:09:29 +00002811** are restricted to having a 4-byte integer key and arbitrary data and
2812** BTree indices are restricted to having an arbitrary key and no data.
drh144f9ea2003-04-16 01:28:16 +00002813** But for now, this routine also serves to create indices.
drh8b2f49b2001-06-08 00:21:52 +00002814*/
drh144f9ea2003-04-16 01:28:16 +00002815static int fileBtreeCreateTable(Btree *pBt, int *piTable){
drh8b2f49b2001-06-08 00:21:52 +00002816 MemPage *pRoot;
2817 Pgno pgnoRoot;
2818 int rc;
2819 if( !pBt->inTrans ){
drhf74b8d92002-09-01 23:20:45 +00002820 /* Must start a transaction first */
2821 return pBt->readOnly ? SQLITE_READONLY : SQLITE_ERROR;
drh8b2f49b2001-06-08 00:21:52 +00002822 }
drh5df72a52002-06-06 23:16:05 +00002823 if( pBt->readOnly ){
2824 return SQLITE_READONLY;
2825 }
drhbea00b92002-07-08 10:59:50 +00002826 rc = allocatePage(pBt, &pRoot, &pgnoRoot, 0);
drh8b2f49b2001-06-08 00:21:52 +00002827 if( rc ) return rc;
drh6019e162001-07-02 17:51:45 +00002828 assert( sqlitepager_iswriteable(pRoot) );
drh0d316a42002-08-11 20:10:47 +00002829 zeroPage(pBt, pRoot);
drh8b2f49b2001-06-08 00:21:52 +00002830 sqlitepager_unref(pRoot);
2831 *piTable = (int)pgnoRoot;
2832 return SQLITE_OK;
2833}
2834
2835/*
2836** Erase the given database page and all its children. Return
2837** the page to the freelist.
2838*/
drh2aa679f2001-06-25 02:11:07 +00002839static int clearDatabasePage(Btree *pBt, Pgno pgno, int freePageFlag){
drh8b2f49b2001-06-08 00:21:52 +00002840 MemPage *pPage;
2841 int rc;
drh8b2f49b2001-06-08 00:21:52 +00002842 Cell *pCell;
2843 int idx;
2844
drh8c42ca92001-06-22 19:15:00 +00002845 rc = sqlitepager_get(pBt->pPager, pgno, (void**)&pPage);
drh8b2f49b2001-06-08 00:21:52 +00002846 if( rc ) return rc;
drh6019e162001-07-02 17:51:45 +00002847 rc = sqlitepager_write(pPage);
2848 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00002849 rc = initPage(pBt, pPage, pgno, 0);
drh7aa128d2002-06-21 13:09:16 +00002850 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00002851 idx = SWAB16(pBt, pPage->u.hdr.firstCell);
drh8b2f49b2001-06-08 00:21:52 +00002852 while( idx>0 ){
drh14acc042001-06-10 19:56:58 +00002853 pCell = (Cell*)&pPage->u.aDisk[idx];
drh0d316a42002-08-11 20:10:47 +00002854 idx = SWAB16(pBt, pCell->h.iNext);
drh8b2f49b2001-06-08 00:21:52 +00002855 if( pCell->h.leftChild ){
drh0d316a42002-08-11 20:10:47 +00002856 rc = clearDatabasePage(pBt, SWAB32(pBt, pCell->h.leftChild), 1);
drh8b2f49b2001-06-08 00:21:52 +00002857 if( rc ) return rc;
2858 }
drh8c42ca92001-06-22 19:15:00 +00002859 rc = clearCell(pBt, pCell);
drh8b2f49b2001-06-08 00:21:52 +00002860 if( rc ) return rc;
2861 }
drh2aa679f2001-06-25 02:11:07 +00002862 if( pPage->u.hdr.rightChild ){
drh0d316a42002-08-11 20:10:47 +00002863 rc = clearDatabasePage(pBt, SWAB32(pBt, pPage->u.hdr.rightChild), 1);
drh2aa679f2001-06-25 02:11:07 +00002864 if( rc ) return rc;
2865 }
2866 if( freePageFlag ){
2867 rc = freePage(pBt, pPage, pgno);
2868 }else{
drh0d316a42002-08-11 20:10:47 +00002869 zeroPage(pBt, pPage);
drh2aa679f2001-06-25 02:11:07 +00002870 }
drhdd793422001-06-28 01:54:48 +00002871 sqlitepager_unref(pPage);
drh2aa679f2001-06-25 02:11:07 +00002872 return rc;
drh8b2f49b2001-06-08 00:21:52 +00002873}
2874
2875/*
2876** Delete all information from a single table in the database.
2877*/
drh144f9ea2003-04-16 01:28:16 +00002878static int fileBtreeClearTable(Btree *pBt, int iTable){
drh8b2f49b2001-06-08 00:21:52 +00002879 int rc;
drhf74b8d92002-09-01 23:20:45 +00002880 BtCursor *pCur;
drh8b2f49b2001-06-08 00:21:52 +00002881 if( !pBt->inTrans ){
drhf74b8d92002-09-01 23:20:45 +00002882 return pBt->readOnly ? SQLITE_READONLY : SQLITE_ERROR;
drh8b2f49b2001-06-08 00:21:52 +00002883 }
drhf74b8d92002-09-01 23:20:45 +00002884 for(pCur=pBt->pCursor; pCur; pCur=pCur->pNext){
2885 if( pCur->pgnoRoot==(Pgno)iTable ){
2886 if( pCur->wrFlag==0 ) return SQLITE_LOCKED;
2887 moveToRoot(pCur);
2888 }
drhecdc7532001-09-23 02:35:53 +00002889 }
drh2aa679f2001-06-25 02:11:07 +00002890 rc = clearDatabasePage(pBt, (Pgno)iTable, 0);
drh8b2f49b2001-06-08 00:21:52 +00002891 if( rc ){
drh144f9ea2003-04-16 01:28:16 +00002892 fileBtreeRollback(pBt);
drh8b2f49b2001-06-08 00:21:52 +00002893 }
drh8c42ca92001-06-22 19:15:00 +00002894 return rc;
drh8b2f49b2001-06-08 00:21:52 +00002895}
2896
2897/*
2898** Erase all information in a table and add the root of the table to
2899** the freelist. Except, the root of the principle table (the one on
2900** page 2) is never added to the freelist.
2901*/
drh144f9ea2003-04-16 01:28:16 +00002902static int fileBtreeDropTable(Btree *pBt, int iTable){
drh8b2f49b2001-06-08 00:21:52 +00002903 int rc;
2904 MemPage *pPage;
drhf74b8d92002-09-01 23:20:45 +00002905 BtCursor *pCur;
drh8b2f49b2001-06-08 00:21:52 +00002906 if( !pBt->inTrans ){
drhf74b8d92002-09-01 23:20:45 +00002907 return pBt->readOnly ? SQLITE_READONLY : SQLITE_ERROR;
drh8b2f49b2001-06-08 00:21:52 +00002908 }
drhf74b8d92002-09-01 23:20:45 +00002909 for(pCur=pBt->pCursor; pCur; pCur=pCur->pNext){
2910 if( pCur->pgnoRoot==(Pgno)iTable ){
2911 return SQLITE_LOCKED; /* Cannot drop a table that has a cursor */
2912 }
drh5df72a52002-06-06 23:16:05 +00002913 }
drh8c42ca92001-06-22 19:15:00 +00002914 rc = sqlitepager_get(pBt->pPager, (Pgno)iTable, (void**)&pPage);
drh2aa679f2001-06-25 02:11:07 +00002915 if( rc ) return rc;
drh144f9ea2003-04-16 01:28:16 +00002916 rc = fileBtreeClearTable(pBt, iTable);
drh2aa679f2001-06-25 02:11:07 +00002917 if( rc ) return rc;
2918 if( iTable>2 ){
2919 rc = freePage(pBt, pPage, iTable);
2920 }else{
drh0d316a42002-08-11 20:10:47 +00002921 zeroPage(pBt, pPage);
drh8b2f49b2001-06-08 00:21:52 +00002922 }
drhdd793422001-06-28 01:54:48 +00002923 sqlitepager_unref(pPage);
drh8b2f49b2001-06-08 00:21:52 +00002924 return rc;
2925}
2926
drh001bbcb2003-03-19 03:14:00 +00002927#if 0 /* UNTESTED */
2928/*
2929** Copy all cell data from one database file into another.
2930** pages back the freelist.
2931*/
2932static int copyCell(Btree *pBtFrom, BTree *pBtTo, Cell *pCell){
2933 Pager *pFromPager = pBtFrom->pPager;
2934 OverflowPage *pOvfl;
2935 Pgno ovfl, nextOvfl;
2936 Pgno *pPrev;
2937 int rc = SQLITE_OK;
2938 MemPage *pNew, *pPrevPg;
2939 Pgno new;
2940
2941 if( NKEY(pBtTo, pCell->h) + NDATA(pBtTo, pCell->h) <= MX_LOCAL_PAYLOAD ){
2942 return SQLITE_OK;
2943 }
2944 pPrev = &pCell->ovfl;
2945 pPrevPg = 0;
2946 ovfl = SWAB32(pBtTo, pCell->ovfl);
2947 while( ovfl && rc==SQLITE_OK ){
2948 rc = sqlitepager_get(pFromPager, ovfl, (void**)&pOvfl);
2949 if( rc ) return rc;
2950 nextOvfl = SWAB32(pBtFrom, pOvfl->iNext);
2951 rc = allocatePage(pBtTo, &pNew, &new, 0);
2952 if( rc==SQLITE_OK ){
2953 rc = sqlitepager_write(pNew);
2954 if( rc==SQLITE_OK ){
drhd0ba1932004-02-10 01:54:28 +00002955 memcpy(pNew, pOvfl, SQLITE_USABLE_SIZE);
drh001bbcb2003-03-19 03:14:00 +00002956 *pPrev = SWAB32(pBtTo, new);
2957 if( pPrevPg ){
2958 sqlitepager_unref(pPrevPg);
2959 }
2960 pPrev = &pOvfl->iNext;
2961 pPrevPg = pNew;
2962 }
2963 }
2964 sqlitepager_unref(pOvfl);
2965 ovfl = nextOvfl;
2966 }
2967 if( pPrevPg ){
2968 sqlitepager_unref(pPrevPg);
2969 }
2970 return rc;
2971}
2972#endif
2973
2974
2975#if 0 /* UNTESTED */
2976/*
2977** Copy a page of data from one database over to another.
2978*/
2979static int copyDatabasePage(
2980 Btree *pBtFrom,
2981 Pgno pgnoFrom,
2982 Btree *pBtTo,
2983 Pgno *pTo
2984){
2985 MemPage *pPageFrom, *pPage;
2986 Pgno to;
2987 int rc;
2988 Cell *pCell;
2989 int idx;
2990
2991 rc = sqlitepager_get(pBtFrom->pPager, pgno, (void**)&pPageFrom);
2992 if( rc ) return rc;
2993 rc = allocatePage(pBt, &pPage, pTo, 0);
2994 if( rc==SQLITE_OK ){
2995 rc = sqlitepager_write(pPage);
2996 }
2997 if( rc==SQLITE_OK ){
drhd0ba1932004-02-10 01:54:28 +00002998 memcpy(pPage, pPageFrom, SQLITE_USABLE_SIZE);
drh001bbcb2003-03-19 03:14:00 +00002999 idx = SWAB16(pBt, pPage->u.hdr.firstCell);
3000 while( idx>0 ){
3001 pCell = (Cell*)&pPage->u.aDisk[idx];
3002 idx = SWAB16(pBt, pCell->h.iNext);
3003 if( pCell->h.leftChild ){
3004 Pgno newChld;
3005 rc = copyDatabasePage(pBtFrom, SWAB32(pBtFrom, pCell->h.leftChild),
3006 pBtTo, &newChld);
3007 if( rc ) return rc;
3008 pCell->h.leftChild = SWAB32(pBtFrom, newChld);
3009 }
3010 rc = copyCell(pBtFrom, pBtTo, pCell);
3011 if( rc ) return rc;
3012 }
3013 if( pPage->u.hdr.rightChild ){
3014 Pgno newChld;
3015 rc = copyDatabasePage(pBtFrom, SWAB32(pBtFrom, pPage->u.hdr.rightChild),
3016 pBtTo, &newChld);
3017 if( rc ) return rc;
3018 pPage->u.hdr.rightChild = SWAB32(pBtTo, newChild);
3019 }
3020 }
3021 sqlitepager_unref(pPage);
3022 return rc;
3023}
3024#endif
3025
drh8b2f49b2001-06-08 00:21:52 +00003026/*
3027** Read the meta-information out of a database file.
3028*/
drh144f9ea2003-04-16 01:28:16 +00003029static int fileBtreeGetMeta(Btree *pBt, int *aMeta){
drh8b2f49b2001-06-08 00:21:52 +00003030 PageOne *pP1;
3031 int rc;
drh0d316a42002-08-11 20:10:47 +00003032 int i;
drh8b2f49b2001-06-08 00:21:52 +00003033
drh8c42ca92001-06-22 19:15:00 +00003034 rc = sqlitepager_get(pBt->pPager, 1, (void**)&pP1);
drh8b2f49b2001-06-08 00:21:52 +00003035 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00003036 aMeta[0] = SWAB32(pBt, pP1->nFree);
3037 for(i=0; i<sizeof(pP1->aMeta)/sizeof(pP1->aMeta[0]); i++){
3038 aMeta[i+1] = SWAB32(pBt, pP1->aMeta[i]);
3039 }
drh8b2f49b2001-06-08 00:21:52 +00003040 sqlitepager_unref(pP1);
3041 return SQLITE_OK;
3042}
3043
3044/*
3045** Write meta-information back into the database.
3046*/
drh144f9ea2003-04-16 01:28:16 +00003047static int fileBtreeUpdateMeta(Btree *pBt, int *aMeta){
drh8b2f49b2001-06-08 00:21:52 +00003048 PageOne *pP1;
drh0d316a42002-08-11 20:10:47 +00003049 int rc, i;
drh8b2f49b2001-06-08 00:21:52 +00003050 if( !pBt->inTrans ){
drhf74b8d92002-09-01 23:20:45 +00003051 return pBt->readOnly ? SQLITE_READONLY : SQLITE_ERROR;
drh5df72a52002-06-06 23:16:05 +00003052 }
drh8b2f49b2001-06-08 00:21:52 +00003053 pP1 = pBt->page1;
3054 rc = sqlitepager_write(pP1);
drh9adf9ac2002-05-15 11:44:13 +00003055 if( rc ) return rc;
drh0d316a42002-08-11 20:10:47 +00003056 for(i=0; i<sizeof(pP1->aMeta)/sizeof(pP1->aMeta[0]); i++){
3057 pP1->aMeta[i] = SWAB32(pBt, aMeta[i+1]);
3058 }
drh8b2f49b2001-06-08 00:21:52 +00003059 return SQLITE_OK;
3060}
drh8c42ca92001-06-22 19:15:00 +00003061
drh5eddca62001-06-30 21:53:53 +00003062/******************************************************************************
3063** The complete implementation of the BTree subsystem is above this line.
3064** All the code the follows is for testing and troubleshooting the BTree
3065** subsystem. None of the code that follows is used during normal operation.
drh5eddca62001-06-30 21:53:53 +00003066******************************************************************************/
drh5eddca62001-06-30 21:53:53 +00003067
drh8c42ca92001-06-22 19:15:00 +00003068/*
3069** Print a disassembly of the given page on standard output. This routine
3070** is used for debugging and testing only.
3071*/
drhaaab5722002-02-19 13:39:21 +00003072#ifdef SQLITE_TEST
drh144f9ea2003-04-16 01:28:16 +00003073static int fileBtreePageDump(Btree *pBt, int pgno, int recursive){
drh8c42ca92001-06-22 19:15:00 +00003074 int rc;
3075 MemPage *pPage;
3076 int i, j;
3077 int nFree;
3078 u16 idx;
3079 char range[20];
3080 unsigned char payload[20];
3081 rc = sqlitepager_get(pBt->pPager, (Pgno)pgno, (void**)&pPage);
3082 if( rc ){
3083 return rc;
3084 }
drh6019e162001-07-02 17:51:45 +00003085 if( recursive ) printf("PAGE %d:\n", pgno);
drh8c42ca92001-06-22 19:15:00 +00003086 i = 0;
drh0d316a42002-08-11 20:10:47 +00003087 idx = SWAB16(pBt, pPage->u.hdr.firstCell);
drhd0ba1932004-02-10 01:54:28 +00003088 while( idx>0 && idx<=SQLITE_USABLE_SIZE-MIN_CELL_SIZE ){
drh8c42ca92001-06-22 19:15:00 +00003089 Cell *pCell = (Cell*)&pPage->u.aDisk[idx];
drh0d316a42002-08-11 20:10:47 +00003090 int sz = cellSize(pBt, pCell);
drh8c42ca92001-06-22 19:15:00 +00003091 sprintf(range,"%d..%d", idx, idx+sz-1);
drh0d316a42002-08-11 20:10:47 +00003092 sz = NKEY(pBt, pCell->h) + NDATA(pBt, pCell->h);
drh8c42ca92001-06-22 19:15:00 +00003093 if( sz>sizeof(payload)-1 ) sz = sizeof(payload)-1;
3094 memcpy(payload, pCell->aPayload, sz);
3095 for(j=0; j<sz; j++){
3096 if( payload[j]<0x20 || payload[j]>0x7f ) payload[j] = '.';
3097 }
3098 payload[sz] = 0;
3099 printf(
drh6019e162001-07-02 17:51:45 +00003100 "cell %2d: i=%-10s chld=%-4d nk=%-4d nd=%-4d payload=%s\n",
drh0d316a42002-08-11 20:10:47 +00003101 i, range, (int)pCell->h.leftChild,
3102 NKEY(pBt, pCell->h), NDATA(pBt, pCell->h),
drh2aa679f2001-06-25 02:11:07 +00003103 payload
drh8c42ca92001-06-22 19:15:00 +00003104 );
drh6019e162001-07-02 17:51:45 +00003105 if( pPage->isInit && pPage->apCell[i]!=pCell ){
drh2aa679f2001-06-25 02:11:07 +00003106 printf("**** apCell[%d] does not match on prior entry ****\n", i);
3107 }
drh7c717f72001-06-24 20:39:41 +00003108 i++;
drh0d316a42002-08-11 20:10:47 +00003109 idx = SWAB16(pBt, pCell->h.iNext);
drh8c42ca92001-06-22 19:15:00 +00003110 }
3111 if( idx!=0 ){
3112 printf("ERROR: next cell index out of range: %d\n", idx);
3113 }
drh0d316a42002-08-11 20:10:47 +00003114 printf("right_child: %d\n", SWAB32(pBt, pPage->u.hdr.rightChild));
drh8c42ca92001-06-22 19:15:00 +00003115 nFree = 0;
3116 i = 0;
drh0d316a42002-08-11 20:10:47 +00003117 idx = SWAB16(pBt, pPage->u.hdr.firstFree);
drhd0ba1932004-02-10 01:54:28 +00003118 while( idx>0 && idx<SQLITE_USABLE_SIZE ){
drh8c42ca92001-06-22 19:15:00 +00003119 FreeBlk *p = (FreeBlk*)&pPage->u.aDisk[idx];
3120 sprintf(range,"%d..%d", idx, idx+p->iSize-1);
drh0d316a42002-08-11 20:10:47 +00003121 nFree += SWAB16(pBt, p->iSize);
drh8c42ca92001-06-22 19:15:00 +00003122 printf("freeblock %2d: i=%-10s size=%-4d total=%d\n",
drh0d316a42002-08-11 20:10:47 +00003123 i, range, SWAB16(pBt, p->iSize), nFree);
3124 idx = SWAB16(pBt, p->iNext);
drh2aa679f2001-06-25 02:11:07 +00003125 i++;
drh8c42ca92001-06-22 19:15:00 +00003126 }
3127 if( idx!=0 ){
3128 printf("ERROR: next freeblock index out of range: %d\n", idx);
3129 }
drh6019e162001-07-02 17:51:45 +00003130 if( recursive && pPage->u.hdr.rightChild!=0 ){
drh0d316a42002-08-11 20:10:47 +00003131 idx = SWAB16(pBt, pPage->u.hdr.firstCell);
drhd0ba1932004-02-10 01:54:28 +00003132 while( idx>0 && idx<SQLITE_USABLE_SIZE-MIN_CELL_SIZE ){
drh6019e162001-07-02 17:51:45 +00003133 Cell *pCell = (Cell*)&pPage->u.aDisk[idx];
drh144f9ea2003-04-16 01:28:16 +00003134 fileBtreePageDump(pBt, SWAB32(pBt, pCell->h.leftChild), 1);
drh0d316a42002-08-11 20:10:47 +00003135 idx = SWAB16(pBt, pCell->h.iNext);
drh6019e162001-07-02 17:51:45 +00003136 }
drh144f9ea2003-04-16 01:28:16 +00003137 fileBtreePageDump(pBt, SWAB32(pBt, pPage->u.hdr.rightChild), 1);
drh6019e162001-07-02 17:51:45 +00003138 }
drh8c42ca92001-06-22 19:15:00 +00003139 sqlitepager_unref(pPage);
3140 return SQLITE_OK;
3141}
drhaaab5722002-02-19 13:39:21 +00003142#endif
drh8c42ca92001-06-22 19:15:00 +00003143
drhaaab5722002-02-19 13:39:21 +00003144#ifdef SQLITE_TEST
drh8c42ca92001-06-22 19:15:00 +00003145/*
drh2aa679f2001-06-25 02:11:07 +00003146** Fill aResult[] with information about the entry and page that the
3147** cursor is pointing to.
3148**
3149** aResult[0] = The page number
3150** aResult[1] = The entry number
3151** aResult[2] = Total number of entries on this page
3152** aResult[3] = Size of this entry
3153** aResult[4] = Number of free bytes on this page
3154** aResult[5] = Number of free blocks on the page
3155** aResult[6] = Page number of the left child of this entry
3156** aResult[7] = Page number of the right child for the whole page
drh5eddca62001-06-30 21:53:53 +00003157**
3158** This routine is used for testing and debugging only.
drh8c42ca92001-06-22 19:15:00 +00003159*/
drh144f9ea2003-04-16 01:28:16 +00003160static int fileBtreeCursorDump(BtCursor *pCur, int *aResult){
drh2aa679f2001-06-25 02:11:07 +00003161 int cnt, idx;
3162 MemPage *pPage = pCur->pPage;
drh0d316a42002-08-11 20:10:47 +00003163 Btree *pBt = pCur->pBt;
drh2aa679f2001-06-25 02:11:07 +00003164 aResult[0] = sqlitepager_pagenumber(pPage);
drh8c42ca92001-06-22 19:15:00 +00003165 aResult[1] = pCur->idx;
drh2aa679f2001-06-25 02:11:07 +00003166 aResult[2] = pPage->nCell;
3167 if( pCur->idx>=0 && pCur->idx<pPage->nCell ){
drh0d316a42002-08-11 20:10:47 +00003168 aResult[3] = cellSize(pBt, pPage->apCell[pCur->idx]);
3169 aResult[6] = SWAB32(pBt, pPage->apCell[pCur->idx]->h.leftChild);
drh2aa679f2001-06-25 02:11:07 +00003170 }else{
3171 aResult[3] = 0;
3172 aResult[6] = 0;
3173 }
3174 aResult[4] = pPage->nFree;
3175 cnt = 0;
drh0d316a42002-08-11 20:10:47 +00003176 idx = SWAB16(pBt, pPage->u.hdr.firstFree);
drhd0ba1932004-02-10 01:54:28 +00003177 while( idx>0 && idx<SQLITE_USABLE_SIZE ){
drh2aa679f2001-06-25 02:11:07 +00003178 cnt++;
drh0d316a42002-08-11 20:10:47 +00003179 idx = SWAB16(pBt, ((FreeBlk*)&pPage->u.aDisk[idx])->iNext);
drh2aa679f2001-06-25 02:11:07 +00003180 }
3181 aResult[5] = cnt;
drh0d316a42002-08-11 20:10:47 +00003182 aResult[7] = SWAB32(pBt, pPage->u.hdr.rightChild);
drh8c42ca92001-06-22 19:15:00 +00003183 return SQLITE_OK;
3184}
drhaaab5722002-02-19 13:39:21 +00003185#endif
drhdd793422001-06-28 01:54:48 +00003186
drhdd793422001-06-28 01:54:48 +00003187/*
drh5eddca62001-06-30 21:53:53 +00003188** Return the pager associated with a BTree. This routine is used for
3189** testing and debugging only.
drhdd793422001-06-28 01:54:48 +00003190*/
drh144f9ea2003-04-16 01:28:16 +00003191static Pager *fileBtreePager(Btree *pBt){
drhdd793422001-06-28 01:54:48 +00003192 return pBt->pPager;
3193}
drh5eddca62001-06-30 21:53:53 +00003194
3195/*
3196** This structure is passed around through all the sanity checking routines
3197** in order to keep track of some global state information.
3198*/
drhaaab5722002-02-19 13:39:21 +00003199typedef struct IntegrityCk IntegrityCk;
3200struct IntegrityCk {
drh100569d2001-10-02 13:01:48 +00003201 Btree *pBt; /* The tree being checked out */
3202 Pager *pPager; /* The associated pager. Also accessible by pBt->pPager */
3203 int nPage; /* Number of pages in the database */
3204 int *anRef; /* Number of times each page is referenced */
drh100569d2001-10-02 13:01:48 +00003205 char *zErrMsg; /* An error message. NULL of no errors seen. */
drh5eddca62001-06-30 21:53:53 +00003206};
3207
3208/*
3209** Append a message to the error message string.
3210*/
drhaaab5722002-02-19 13:39:21 +00003211static void checkAppendMsg(IntegrityCk *pCheck, char *zMsg1, char *zMsg2){
drh5eddca62001-06-30 21:53:53 +00003212 if( pCheck->zErrMsg ){
3213 char *zOld = pCheck->zErrMsg;
3214 pCheck->zErrMsg = 0;
drh41743982003-12-06 21:43:55 +00003215 sqliteSetString(&pCheck->zErrMsg, zOld, "\n", zMsg1, zMsg2, (char*)0);
drh5eddca62001-06-30 21:53:53 +00003216 sqliteFree(zOld);
3217 }else{
drh41743982003-12-06 21:43:55 +00003218 sqliteSetString(&pCheck->zErrMsg, zMsg1, zMsg2, (char*)0);
drh5eddca62001-06-30 21:53:53 +00003219 }
3220}
3221
3222/*
3223** Add 1 to the reference count for page iPage. If this is the second
3224** reference to the page, add an error message to pCheck->zErrMsg.
3225** Return 1 if there are 2 ore more references to the page and 0 if
3226** if this is the first reference to the page.
3227**
3228** Also check that the page number is in bounds.
3229*/
drhaaab5722002-02-19 13:39:21 +00003230static int checkRef(IntegrityCk *pCheck, int iPage, char *zContext){
drh5eddca62001-06-30 21:53:53 +00003231 if( iPage==0 ) return 1;
drh0de8c112002-07-06 16:32:14 +00003232 if( iPage>pCheck->nPage || iPage<0 ){
drh5eddca62001-06-30 21:53:53 +00003233 char zBuf[100];
3234 sprintf(zBuf, "invalid page number %d", iPage);
3235 checkAppendMsg(pCheck, zContext, zBuf);
3236 return 1;
3237 }
3238 if( pCheck->anRef[iPage]==1 ){
3239 char zBuf[100];
3240 sprintf(zBuf, "2nd reference to page %d", iPage);
3241 checkAppendMsg(pCheck, zContext, zBuf);
3242 return 1;
3243 }
3244 return (pCheck->anRef[iPage]++)>1;
3245}
3246
3247/*
3248** Check the integrity of the freelist or of an overflow page list.
3249** Verify that the number of pages on the list is N.
3250*/
drh30e58752002-03-02 20:41:57 +00003251static void checkList(
3252 IntegrityCk *pCheck, /* Integrity checking context */
3253 int isFreeList, /* True for a freelist. False for overflow page list */
3254 int iPage, /* Page number for first page in the list */
3255 int N, /* Expected number of pages in the list */
3256 char *zContext /* Context for error messages */
3257){
3258 int i;
drh5eddca62001-06-30 21:53:53 +00003259 char zMsg[100];
drh30e58752002-03-02 20:41:57 +00003260 while( N-- > 0 ){
drh5eddca62001-06-30 21:53:53 +00003261 OverflowPage *pOvfl;
3262 if( iPage<1 ){
3263 sprintf(zMsg, "%d pages missing from overflow list", N+1);
3264 checkAppendMsg(pCheck, zContext, zMsg);
3265 break;
3266 }
3267 if( checkRef(pCheck, iPage, zContext) ) break;
3268 if( sqlitepager_get(pCheck->pPager, (Pgno)iPage, (void**)&pOvfl) ){
3269 sprintf(zMsg, "failed to get page %d", iPage);
3270 checkAppendMsg(pCheck, zContext, zMsg);
3271 break;
3272 }
drh30e58752002-03-02 20:41:57 +00003273 if( isFreeList ){
3274 FreelistInfo *pInfo = (FreelistInfo*)pOvfl->aPayload;
drh0d316a42002-08-11 20:10:47 +00003275 int n = SWAB32(pCheck->pBt, pInfo->nFree);
3276 for(i=0; i<n; i++){
drh8bf8dc92003-05-17 17:35:10 +00003277 checkRef(pCheck, SWAB32(pCheck->pBt, pInfo->aFree[i]), zContext);
drh30e58752002-03-02 20:41:57 +00003278 }
drh0d316a42002-08-11 20:10:47 +00003279 N -= n;
drh30e58752002-03-02 20:41:57 +00003280 }
drh0d316a42002-08-11 20:10:47 +00003281 iPage = SWAB32(pCheck->pBt, pOvfl->iNext);
drh5eddca62001-06-30 21:53:53 +00003282 sqlitepager_unref(pOvfl);
3283 }
3284}
3285
3286/*
drh1bffb9c2002-02-03 17:37:36 +00003287** Return negative if zKey1<zKey2.
3288** Return zero if zKey1==zKey2.
3289** Return positive if zKey1>zKey2.
3290*/
3291static int keyCompare(
3292 const char *zKey1, int nKey1,
3293 const char *zKey2, int nKey2
3294){
3295 int min = nKey1>nKey2 ? nKey2 : nKey1;
3296 int c = memcmp(zKey1, zKey2, min);
3297 if( c==0 ){
3298 c = nKey1 - nKey2;
3299 }
3300 return c;
3301}
3302
3303/*
drh5eddca62001-06-30 21:53:53 +00003304** Do various sanity checks on a single page of a tree. Return
3305** the tree depth. Root pages return 0. Parents of root pages
3306** return 1, and so forth.
3307**
3308** These checks are done:
3309**
3310** 1. Make sure that cells and freeblocks do not overlap
3311** but combine to completely cover the page.
3312** 2. Make sure cell keys are in order.
3313** 3. Make sure no key is less than or equal to zLowerBound.
3314** 4. Make sure no key is greater than or equal to zUpperBound.
3315** 5. Check the integrity of overflow pages.
3316** 6. Recursively call checkTreePage on all children.
3317** 7. Verify that the depth of all children is the same.
drh6019e162001-07-02 17:51:45 +00003318** 8. Make sure this page is at least 33% full or else it is
drh5eddca62001-06-30 21:53:53 +00003319** the root of the tree.
3320*/
3321static int checkTreePage(
drhaaab5722002-02-19 13:39:21 +00003322 IntegrityCk *pCheck, /* Context for the sanity check */
drh5eddca62001-06-30 21:53:53 +00003323 int iPage, /* Page number of the page to check */
3324 MemPage *pParent, /* Parent page */
3325 char *zParentContext, /* Parent context */
3326 char *zLowerBound, /* All keys should be greater than this, if not NULL */
drh1bffb9c2002-02-03 17:37:36 +00003327 int nLower, /* Number of characters in zLowerBound */
3328 char *zUpperBound, /* All keys should be less than this, if not NULL */
3329 int nUpper /* Number of characters in zUpperBound */
drh5eddca62001-06-30 21:53:53 +00003330){
3331 MemPage *pPage;
3332 int i, rc, depth, d2, pgno;
3333 char *zKey1, *zKey2;
drh1bffb9c2002-02-03 17:37:36 +00003334 int nKey1, nKey2;
drh5eddca62001-06-30 21:53:53 +00003335 BtCursor cur;
drh0d316a42002-08-11 20:10:47 +00003336 Btree *pBt;
drh5eddca62001-06-30 21:53:53 +00003337 char zMsg[100];
3338 char zContext[100];
drhd0ba1932004-02-10 01:54:28 +00003339 char hit[SQLITE_USABLE_SIZE];
drh5eddca62001-06-30 21:53:53 +00003340
3341 /* Check that the page exists
3342 */
drh0d316a42002-08-11 20:10:47 +00003343 cur.pBt = pBt = pCheck->pBt;
drh5eddca62001-06-30 21:53:53 +00003344 if( iPage==0 ) return 0;
3345 if( checkRef(pCheck, iPage, zParentContext) ) return 0;
3346 sprintf(zContext, "On tree page %d: ", iPage);
3347 if( (rc = sqlitepager_get(pCheck->pPager, (Pgno)iPage, (void**)&pPage))!=0 ){
3348 sprintf(zMsg, "unable to get the page. error code=%d", rc);
3349 checkAppendMsg(pCheck, zContext, zMsg);
3350 return 0;
3351 }
drh0d316a42002-08-11 20:10:47 +00003352 if( (rc = initPage(pBt, pPage, (Pgno)iPage, pParent))!=0 ){
drh5eddca62001-06-30 21:53:53 +00003353 sprintf(zMsg, "initPage() returns error code %d", rc);
3354 checkAppendMsg(pCheck, zContext, zMsg);
3355 sqlitepager_unref(pPage);
3356 return 0;
3357 }
3358
3359 /* Check out all the cells.
3360 */
3361 depth = 0;
drh1bffb9c2002-02-03 17:37:36 +00003362 if( zLowerBound ){
3363 zKey1 = sqliteMalloc( nLower+1 );
3364 memcpy(zKey1, zLowerBound, nLower);
3365 zKey1[nLower] = 0;
3366 }else{
3367 zKey1 = 0;
3368 }
3369 nKey1 = nLower;
drh5eddca62001-06-30 21:53:53 +00003370 cur.pPage = pPage;
drh5eddca62001-06-30 21:53:53 +00003371 for(i=0; i<pPage->nCell; i++){
3372 Cell *pCell = pPage->apCell[i];
3373 int sz;
3374
3375 /* Check payload overflow pages
3376 */
drh0d316a42002-08-11 20:10:47 +00003377 nKey2 = NKEY(pBt, pCell->h);
3378 sz = nKey2 + NDATA(pBt, pCell->h);
drh5eddca62001-06-30 21:53:53 +00003379 sprintf(zContext, "On page %d cell %d: ", iPage, i);
3380 if( sz>MX_LOCAL_PAYLOAD ){
3381 int nPage = (sz - MX_LOCAL_PAYLOAD + OVERFLOW_SIZE - 1)/OVERFLOW_SIZE;
drh0d316a42002-08-11 20:10:47 +00003382 checkList(pCheck, 0, SWAB32(pBt, pCell->ovfl), nPage, zContext);
drh5eddca62001-06-30 21:53:53 +00003383 }
3384
3385 /* Check that keys are in the right order
3386 */
3387 cur.idx = i;
drh8c1238a2003-01-02 14:43:55 +00003388 zKey2 = sqliteMallocRaw( nKey2+1 );
drh1bffb9c2002-02-03 17:37:36 +00003389 getPayload(&cur, 0, nKey2, zKey2);
3390 if( zKey1 && keyCompare(zKey1, nKey1, zKey2, nKey2)>=0 ){
drh5eddca62001-06-30 21:53:53 +00003391 checkAppendMsg(pCheck, zContext, "Key is out of order");
3392 }
3393
3394 /* Check sanity of left child page.
3395 */
drh0d316a42002-08-11 20:10:47 +00003396 pgno = SWAB32(pBt, pCell->h.leftChild);
drh1bffb9c2002-02-03 17:37:36 +00003397 d2 = checkTreePage(pCheck, pgno, pPage, zContext, zKey1,nKey1,zKey2,nKey2);
drh5eddca62001-06-30 21:53:53 +00003398 if( i>0 && d2!=depth ){
3399 checkAppendMsg(pCheck, zContext, "Child page depth differs");
3400 }
3401 depth = d2;
3402 sqliteFree(zKey1);
3403 zKey1 = zKey2;
drh1bffb9c2002-02-03 17:37:36 +00003404 nKey1 = nKey2;
drh5eddca62001-06-30 21:53:53 +00003405 }
drh0d316a42002-08-11 20:10:47 +00003406 pgno = SWAB32(pBt, pPage->u.hdr.rightChild);
drh5eddca62001-06-30 21:53:53 +00003407 sprintf(zContext, "On page %d at right child: ", iPage);
drh1bffb9c2002-02-03 17:37:36 +00003408 checkTreePage(pCheck, pgno, pPage, zContext, zKey1,nKey1,zUpperBound,nUpper);
drh5eddca62001-06-30 21:53:53 +00003409 sqliteFree(zKey1);
3410
3411 /* Check for complete coverage of the page
3412 */
3413 memset(hit, 0, sizeof(hit));
3414 memset(hit, 1, sizeof(PageHdr));
drhd0ba1932004-02-10 01:54:28 +00003415 for(i=SWAB16(pBt, pPage->u.hdr.firstCell); i>0 && i<SQLITE_USABLE_SIZE; ){
drh5eddca62001-06-30 21:53:53 +00003416 Cell *pCell = (Cell*)&pPage->u.aDisk[i];
3417 int j;
drh0d316a42002-08-11 20:10:47 +00003418 for(j=i+cellSize(pBt, pCell)-1; j>=i; j--) hit[j]++;
3419 i = SWAB16(pBt, pCell->h.iNext);
drh5eddca62001-06-30 21:53:53 +00003420 }
drhd0ba1932004-02-10 01:54:28 +00003421 for(i=SWAB16(pBt,pPage->u.hdr.firstFree); i>0 && i<SQLITE_USABLE_SIZE; ){
drh5eddca62001-06-30 21:53:53 +00003422 FreeBlk *pFBlk = (FreeBlk*)&pPage->u.aDisk[i];
3423 int j;
drh0d316a42002-08-11 20:10:47 +00003424 for(j=i+SWAB16(pBt,pFBlk->iSize)-1; j>=i; j--) hit[j]++;
3425 i = SWAB16(pBt,pFBlk->iNext);
drh5eddca62001-06-30 21:53:53 +00003426 }
drhd0ba1932004-02-10 01:54:28 +00003427 for(i=0; i<SQLITE_USABLE_SIZE; i++){
drh5eddca62001-06-30 21:53:53 +00003428 if( hit[i]==0 ){
3429 sprintf(zMsg, "Unused space at byte %d of page %d", i, iPage);
3430 checkAppendMsg(pCheck, zMsg, 0);
3431 break;
3432 }else if( hit[i]>1 ){
3433 sprintf(zMsg, "Multiple uses for byte %d of page %d", i, iPage);
3434 checkAppendMsg(pCheck, zMsg, 0);
3435 break;
3436 }
3437 }
3438
3439 /* Check that free space is kept to a minimum
3440 */
drh6019e162001-07-02 17:51:45 +00003441#if 0
drhd0ba1932004-02-10 01:54:28 +00003442 if( pParent && pParent->nCell>2 && pPage->nFree>3*SQLITE_USABLE_SIZE/4 ){
drh5eddca62001-06-30 21:53:53 +00003443 sprintf(zMsg, "free space (%d) greater than max (%d)", pPage->nFree,
drhd0ba1932004-02-10 01:54:28 +00003444 SQLITE_USABLE_SIZE/3);
drh5eddca62001-06-30 21:53:53 +00003445 checkAppendMsg(pCheck, zContext, zMsg);
3446 }
drh6019e162001-07-02 17:51:45 +00003447#endif
3448
drh5eddca62001-06-30 21:53:53 +00003449 sqlitepager_unref(pPage);
3450 return depth;
3451}
3452
3453/*
3454** This routine does a complete check of the given BTree file. aRoot[] is
3455** an array of pages numbers were each page number is the root page of
3456** a table. nRoot is the number of entries in aRoot.
3457**
3458** If everything checks out, this routine returns NULL. If something is
3459** amiss, an error message is written into memory obtained from malloc()
3460** and a pointer to that error message is returned. The calling function
3461** is responsible for freeing the error message when it is done.
3462*/
drh144f9ea2003-04-16 01:28:16 +00003463char *fileBtreeIntegrityCheck(Btree *pBt, int *aRoot, int nRoot){
drh5eddca62001-06-30 21:53:53 +00003464 int i;
3465 int nRef;
drhaaab5722002-02-19 13:39:21 +00003466 IntegrityCk sCheck;
drh5eddca62001-06-30 21:53:53 +00003467
3468 nRef = *sqlitepager_stats(pBt->pPager);
drhefc251d2001-07-01 22:12:01 +00003469 if( lockBtree(pBt)!=SQLITE_OK ){
3470 return sqliteStrDup("Unable to acquire a read lock on the database");
3471 }
drh5eddca62001-06-30 21:53:53 +00003472 sCheck.pBt = pBt;
3473 sCheck.pPager = pBt->pPager;
3474 sCheck.nPage = sqlitepager_pagecount(sCheck.pPager);
drh0de8c112002-07-06 16:32:14 +00003475 if( sCheck.nPage==0 ){
3476 unlockBtreeIfUnused(pBt);
3477 return 0;
3478 }
drh8c1238a2003-01-02 14:43:55 +00003479 sCheck.anRef = sqliteMallocRaw( (sCheck.nPage+1)*sizeof(sCheck.anRef[0]) );
drh5eddca62001-06-30 21:53:53 +00003480 sCheck.anRef[1] = 1;
3481 for(i=2; i<=sCheck.nPage; i++){ sCheck.anRef[i] = 0; }
3482 sCheck.zErrMsg = 0;
3483
3484 /* Check the integrity of the freelist
3485 */
drh0d316a42002-08-11 20:10:47 +00003486 checkList(&sCheck, 1, SWAB32(pBt, pBt->page1->freeList),
3487 SWAB32(pBt, pBt->page1->nFree), "Main freelist: ");
drh5eddca62001-06-30 21:53:53 +00003488
3489 /* Check all the tables.
3490 */
3491 for(i=0; i<nRoot; i++){
drh4ff6dfa2002-03-03 23:06:00 +00003492 if( aRoot[i]==0 ) continue;
drh1bffb9c2002-02-03 17:37:36 +00003493 checkTreePage(&sCheck, aRoot[i], 0, "List of tree roots: ", 0,0,0,0);
drh5eddca62001-06-30 21:53:53 +00003494 }
3495
3496 /* Make sure every page in the file is referenced
3497 */
3498 for(i=1; i<=sCheck.nPage; i++){
3499 if( sCheck.anRef[i]==0 ){
3500 char zBuf[100];
3501 sprintf(zBuf, "Page %d is never used", i);
3502 checkAppendMsg(&sCheck, zBuf, 0);
3503 }
3504 }
3505
3506 /* Make sure this analysis did not leave any unref() pages
3507 */
drh5e00f6c2001-09-13 13:46:56 +00003508 unlockBtreeIfUnused(pBt);
drh5eddca62001-06-30 21:53:53 +00003509 if( nRef != *sqlitepager_stats(pBt->pPager) ){
3510 char zBuf[100];
3511 sprintf(zBuf,
3512 "Outstanding page count goes from %d to %d during this analysis",
3513 nRef, *sqlitepager_stats(pBt->pPager)
3514 );
3515 checkAppendMsg(&sCheck, zBuf, 0);
3516 }
3517
3518 /* Clean up and report errors.
3519 */
3520 sqliteFree(sCheck.anRef);
3521 return sCheck.zErrMsg;
3522}
paulb95a8862003-04-01 21:16:41 +00003523
drh73509ee2003-04-06 20:44:45 +00003524/*
3525** Return the full pathname of the underlying database file.
3526*/
drh144f9ea2003-04-16 01:28:16 +00003527static const char *fileBtreeGetFilename(Btree *pBt){
drh73509ee2003-04-06 20:44:45 +00003528 assert( pBt->pPager!=0 );
3529 return sqlitepager_filename(pBt->pPager);
3530}
3531
3532/*
drhf7c57532003-04-25 13:22:51 +00003533** Copy the complete content of pBtFrom into pBtTo. A transaction
3534** must be active for both files.
3535**
3536** The size of file pBtFrom may be reduced by this operation.
3537** If anything goes wrong, the transaction on pBtFrom is rolled back.
drh73509ee2003-04-06 20:44:45 +00003538*/
drhf7c57532003-04-25 13:22:51 +00003539static int fileBtreeCopyFile(Btree *pBtTo, Btree *pBtFrom){
3540 int rc = SQLITE_OK;
drh2e6d11b2003-04-25 15:37:57 +00003541 Pgno i, nPage, nToPage;
drhf7c57532003-04-25 13:22:51 +00003542
3543 if( !pBtTo->inTrans || !pBtFrom->inTrans ) return SQLITE_ERROR;
3544 if( pBtTo->needSwab!=pBtFrom->needSwab ) return SQLITE_ERROR;
3545 if( pBtTo->pCursor ) return SQLITE_BUSY;
drhd0ba1932004-02-10 01:54:28 +00003546 memcpy(pBtTo->page1, pBtFrom->page1, SQLITE_USABLE_SIZE);
drh2e6d11b2003-04-25 15:37:57 +00003547 rc = sqlitepager_overwrite(pBtTo->pPager, 1, pBtFrom->page1);
3548 nToPage = sqlitepager_pagecount(pBtTo->pPager);
drhf7c57532003-04-25 13:22:51 +00003549 nPage = sqlitepager_pagecount(pBtFrom->pPager);
drh2e6d11b2003-04-25 15:37:57 +00003550 for(i=2; rc==SQLITE_OK && i<=nPage; i++){
drhf7c57532003-04-25 13:22:51 +00003551 void *pPage;
3552 rc = sqlitepager_get(pBtFrom->pPager, i, &pPage);
3553 if( rc ) break;
drh2e6d11b2003-04-25 15:37:57 +00003554 rc = sqlitepager_overwrite(pBtTo->pPager, i, pPage);
3555 if( rc ) break;
drhf7c57532003-04-25 13:22:51 +00003556 sqlitepager_unref(pPage);
3557 }
drh2e6d11b2003-04-25 15:37:57 +00003558 for(i=nPage+1; rc==SQLITE_OK && i<=nToPage; i++){
3559 void *pPage;
3560 rc = sqlitepager_get(pBtTo->pPager, i, &pPage);
3561 if( rc ) break;
3562 rc = sqlitepager_write(pPage);
3563 sqlitepager_unref(pPage);
3564 sqlitepager_dont_write(pBtTo->pPager, i);
3565 }
3566 if( !rc && nPage<nToPage ){
3567 rc = sqlitepager_truncate(pBtTo->pPager, nPage);
3568 }
drhf7c57532003-04-25 13:22:51 +00003569 if( rc ){
3570 fileBtreeRollback(pBtTo);
3571 }
3572 return rc;
drh73509ee2003-04-06 20:44:45 +00003573}
3574
3575/*
3576** The following tables contain pointers to all of the interface
3577** routines for this implementation of the B*Tree backend. To
3578** substitute a different implemention of the backend, one has merely
3579** to provide pointers to alternative functions in similar tables.
3580*/
paulb95a8862003-04-01 21:16:41 +00003581static BtOps sqliteBtreeOps = {
drh144f9ea2003-04-16 01:28:16 +00003582 fileBtreeClose,
3583 fileBtreeSetCacheSize,
3584 fileBtreeSetSafetyLevel,
3585 fileBtreeBeginTrans,
3586 fileBtreeCommit,
3587 fileBtreeRollback,
3588 fileBtreeBeginCkpt,
3589 fileBtreeCommitCkpt,
3590 fileBtreeRollbackCkpt,
3591 fileBtreeCreateTable,
3592 fileBtreeCreateTable, /* Really sqliteBtreeCreateIndex() */
3593 fileBtreeDropTable,
3594 fileBtreeClearTable,
3595 fileBtreeCursor,
3596 fileBtreeGetMeta,
3597 fileBtreeUpdateMeta,
3598 fileBtreeIntegrityCheck,
3599 fileBtreeGetFilename,
drhf7c57532003-04-25 13:22:51 +00003600 fileBtreeCopyFile,
drh57ced912004-02-10 02:57:59 +00003601 fileBtreePager,
paulb95a8862003-04-01 21:16:41 +00003602#ifdef SQLITE_TEST
drh144f9ea2003-04-16 01:28:16 +00003603 fileBtreePageDump,
paulb95a8862003-04-01 21:16:41 +00003604#endif
3605};
paulb95a8862003-04-01 21:16:41 +00003606static BtCursorOps sqliteBtreeCursorOps = {
drh144f9ea2003-04-16 01:28:16 +00003607 fileBtreeMoveto,
3608 fileBtreeDelete,
3609 fileBtreeInsert,
3610 fileBtreeFirst,
3611 fileBtreeLast,
3612 fileBtreeNext,
3613 fileBtreePrevious,
3614 fileBtreeKeySize,
3615 fileBtreeKey,
3616 fileBtreeKeyCompare,
3617 fileBtreeDataSize,
3618 fileBtreeData,
3619 fileBtreeCloseCursor,
paulb95a8862003-04-01 21:16:41 +00003620#ifdef SQLITE_TEST
drh144f9ea2003-04-16 01:28:16 +00003621 fileBtreeCursorDump,
paulb95a8862003-04-01 21:16:41 +00003622#endif
3623};