blob: 83dd758378cef9ddf8a7040aa0e5e8d107698b56 [file] [log] [blame]
dan7c246102010-04-12 19:00:29 +00001/*
drh7ed91f22010-04-29 22:34:07 +00002** 2010 February 1
3**
4** The author disclaims copyright to this source code. In place of
5** a legal notice, here is a blessing:
6**
7** May you do good and not evil.
8** May you find forgiveness for yourself and forgive others.
9** May you share freely, never taking more than you give.
10**
11*************************************************************************
12**
drh027a1282010-05-19 01:53:53 +000013** This file contains the implementation of a write-ahead log (WAL) used in
14** "journal_mode=WAL" mode.
drh29d4dbe2010-05-18 23:29:52 +000015**
drh7ed91f22010-04-29 22:34:07 +000016** WRITE-AHEAD LOG (WAL) FILE FORMAT
dan97a31352010-04-16 13:59:31 +000017**
drh7e263722010-05-20 21:21:09 +000018** A WAL file consists of a header followed by zero or more "frames".
drh027a1282010-05-19 01:53:53 +000019** Each frame records the revised content of a single page from the
drh29d4dbe2010-05-18 23:29:52 +000020** database file. All changes to the database are recorded by writing
21** frames into the WAL. Transactions commit when a frame is written that
22** contains a commit marker. A single WAL can and usually does record
23** multiple transactions. Periodically, the content of the WAL is
24** transferred back into the database file in an operation called a
25** "checkpoint".
26**
27** A single WAL file can be used multiple times. In other words, the
drh027a1282010-05-19 01:53:53 +000028** WAL can fill up with frames and then be checkpointed and then new
drh29d4dbe2010-05-18 23:29:52 +000029** frames can overwrite the old ones. A WAL always grows from beginning
30** toward the end. Checksums and counters attached to each frame are
31** used to determine which frames within the WAL are valid and which
32** are leftovers from prior checkpoints.
33**
drh23ea97b2010-05-20 16:45:58 +000034** The WAL header is 24 bytes in size and consists of the following six
dan97a31352010-04-16 13:59:31 +000035** big-endian 32-bit unsigned integer values:
36**
drh1b78eaf2010-05-25 13:40:03 +000037** 0: Magic number. 0x377f0682 or 0x377f0683
drh23ea97b2010-05-20 16:45:58 +000038** 4: File format version. Currently 3007000
39** 8: Database page size. Example: 1024
40** 12: Checkpoint sequence number
drh7e263722010-05-20 21:21:09 +000041** 16: Salt-1, random integer incremented with each checkpoint
42** 20: Salt-2, a different random integer changing with each ckpt
dan10f5a502010-06-23 15:55:43 +000043** 24: Checksum-1 (first part of checksum for first 24 bytes of header).
44** 28: Checksum-2 (second part of checksum for first 24 bytes of header).
dan97a31352010-04-16 13:59:31 +000045**
drh23ea97b2010-05-20 16:45:58 +000046** Immediately following the wal-header are zero or more frames. Each
47** frame consists of a 24-byte frame-header followed by a <page-size> bytes
48** of page data. The frame-header is broken into 6 big-endian 32-bit unsigned
dan97a31352010-04-16 13:59:31 +000049** integer values, as follows:
50**
dan3de777f2010-04-17 12:31:37 +000051** 0: Page number.
52** 4: For commit records, the size of the database image in pages
dan97a31352010-04-16 13:59:31 +000053** after the commit. For all other records, zero.
drh7e263722010-05-20 21:21:09 +000054** 8: Salt-1 (copied from the header)
55** 12: Salt-2 (copied from the header)
drh23ea97b2010-05-20 16:45:58 +000056** 16: Checksum-1.
57** 20: Checksum-2.
drh29d4dbe2010-05-18 23:29:52 +000058**
drh7e263722010-05-20 21:21:09 +000059** A frame is considered valid if and only if the following conditions are
60** true:
61**
62** (1) The salt-1 and salt-2 values in the frame-header match
63** salt values in the wal-header
64**
65** (2) The checksum values in the final 8 bytes of the frame-header
drh1b78eaf2010-05-25 13:40:03 +000066** exactly match the checksum computed consecutively on the
67** WAL header and the first 8 bytes and the content of all frames
68** up to and including the current frame.
69**
70** The checksum is computed using 32-bit big-endian integers if the
71** magic number in the first 4 bytes of the WAL is 0x377f0683 and it
72** is computed using little-endian if the magic number is 0x377f0682.
drh51b21b12010-05-25 15:53:31 +000073** The checksum values are always stored in the frame header in a
74** big-endian format regardless of which byte order is used to compute
75** the checksum. The checksum is computed by interpreting the input as
76** an even number of unsigned 32-bit integers: x[0] through x[N]. The
drhffca4302010-06-15 11:21:54 +000077** algorithm used for the checksum is as follows:
drh51b21b12010-05-25 15:53:31 +000078**
79** for i from 0 to n-1 step 2:
80** s0 += x[i] + s1;
81** s1 += x[i+1] + s0;
82** endfor
drh7e263722010-05-20 21:21:09 +000083**
84** On a checkpoint, the WAL is first VFS.xSync-ed, then valid content of the
85** WAL is transferred into the database, then the database is VFS.xSync-ed.
drhffca4302010-06-15 11:21:54 +000086** The VFS.xSync operations serve as write barriers - all writes launched
drh7e263722010-05-20 21:21:09 +000087** before the xSync must complete before any write that launches after the
88** xSync begins.
89**
90** After each checkpoint, the salt-1 value is incremented and the salt-2
91** value is randomized. This prevents old and new frames in the WAL from
92** being considered valid at the same time and being checkpointing together
93** following a crash.
94**
drh29d4dbe2010-05-18 23:29:52 +000095** READER ALGORITHM
96**
97** To read a page from the database (call it page number P), a reader
98** first checks the WAL to see if it contains page P. If so, then the
drh73b64e42010-05-30 19:55:15 +000099** last valid instance of page P that is a followed by a commit frame
100** or is a commit frame itself becomes the value read. If the WAL
101** contains no copies of page P that are valid and which are a commit
102** frame or are followed by a commit frame, then page P is read from
103** the database file.
drh29d4dbe2010-05-18 23:29:52 +0000104**
drh73b64e42010-05-30 19:55:15 +0000105** To start a read transaction, the reader records the index of the last
106** valid frame in the WAL. The reader uses this recorded "mxFrame" value
107** for all subsequent read operations. New transactions can be appended
108** to the WAL, but as long as the reader uses its original mxFrame value
109** and ignores the newly appended content, it will see a consistent snapshot
110** of the database from a single point in time. This technique allows
111** multiple concurrent readers to view different versions of the database
112** content simultaneously.
113**
114** The reader algorithm in the previous paragraphs works correctly, but
drh29d4dbe2010-05-18 23:29:52 +0000115** because frames for page P can appear anywhere within the WAL, the
drh027a1282010-05-19 01:53:53 +0000116** reader has to scan the entire WAL looking for page P frames. If the
drh29d4dbe2010-05-18 23:29:52 +0000117** WAL is large (multiple megabytes is typical) that scan can be slow,
drh027a1282010-05-19 01:53:53 +0000118** and read performance suffers. To overcome this problem, a separate
119** data structure called the wal-index is maintained to expedite the
drh29d4dbe2010-05-18 23:29:52 +0000120** search for frames of a particular page.
121**
122** WAL-INDEX FORMAT
123**
124** Conceptually, the wal-index is shared memory, though VFS implementations
125** might choose to implement the wal-index using a mmapped file. Because
126** the wal-index is shared memory, SQLite does not support journal_mode=WAL
127** on a network filesystem. All users of the database must be able to
128** share memory.
129**
130** The wal-index is transient. After a crash, the wal-index can (and should
131** be) reconstructed from the original WAL file. In fact, the VFS is required
132** to either truncate or zero the header of the wal-index when the last
133** connection to it closes. Because the wal-index is transient, it can
134** use an architecture-specific format; it does not have to be cross-platform.
135** Hence, unlike the database and WAL file formats which store all values
136** as big endian, the wal-index can store multi-byte values in the native
137** byte order of the host computer.
138**
139** The purpose of the wal-index is to answer this question quickly: Given
140** a page number P, return the index of the last frame for page P in the WAL,
141** or return NULL if there are no frames for page P in the WAL.
142**
143** The wal-index consists of a header region, followed by an one or
144** more index blocks.
145**
drh027a1282010-05-19 01:53:53 +0000146** The wal-index header contains the total number of frames within the WAL
danad3cadd2010-06-14 11:49:26 +0000147** in the the mxFrame field.
148**
149** Each index block except for the first contains information on
150** HASHTABLE_NPAGE frames. The first index block contains information on
151** HASHTABLE_NPAGE_ONE frames. The values of HASHTABLE_NPAGE_ONE and
152** HASHTABLE_NPAGE are selected so that together the wal-index header and
153** first index block are the same size as all other index blocks in the
154** wal-index.
155**
156** Each index block contains two sections, a page-mapping that contains the
157** database page number associated with each wal frame, and a hash-table
drhffca4302010-06-15 11:21:54 +0000158** that allows readers to query an index block for a specific page number.
danad3cadd2010-06-14 11:49:26 +0000159** The page-mapping is an array of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE
160** for the first index block) 32-bit page numbers. The first entry in the
161** first index-block contains the database page number corresponding to the
162** first frame in the WAL file. The first entry in the second index block
163** in the WAL file corresponds to the (HASHTABLE_NPAGE_ONE+1)th frame in
164** the log, and so on.
165**
166** The last index block in a wal-index usually contains less than the full
167** complement of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE) page-numbers,
168** depending on the contents of the WAL file. This does not change the
169** allocated size of the page-mapping array - the page-mapping array merely
170** contains unused entries.
drh027a1282010-05-19 01:53:53 +0000171**
172** Even without using the hash table, the last frame for page P
danad3cadd2010-06-14 11:49:26 +0000173** can be found by scanning the page-mapping sections of each index block
drh027a1282010-05-19 01:53:53 +0000174** starting with the last index block and moving toward the first, and
175** within each index block, starting at the end and moving toward the
176** beginning. The first entry that equals P corresponds to the frame
177** holding the content for that page.
178**
179** The hash table consists of HASHTABLE_NSLOT 16-bit unsigned integers.
180** HASHTABLE_NSLOT = 2*HASHTABLE_NPAGE, and there is one entry in the
181** hash table for each page number in the mapping section, so the hash
182** table is never more than half full. The expected number of collisions
183** prior to finding a match is 1. Each entry of the hash table is an
184** 1-based index of an entry in the mapping section of the same
185** index block. Let K be the 1-based index of the largest entry in
186** the mapping section. (For index blocks other than the last, K will
187** always be exactly HASHTABLE_NPAGE (4096) and for the last index block
188** K will be (mxFrame%HASHTABLE_NPAGE).) Unused slots of the hash table
drh73b64e42010-05-30 19:55:15 +0000189** contain a value of 0.
drh027a1282010-05-19 01:53:53 +0000190**
191** To look for page P in the hash table, first compute a hash iKey on
192** P as follows:
193**
194** iKey = (P * 383) % HASHTABLE_NSLOT
195**
196** Then start scanning entries of the hash table, starting with iKey
197** (wrapping around to the beginning when the end of the hash table is
198** reached) until an unused hash slot is found. Let the first unused slot
199** be at index iUnused. (iUnused might be less than iKey if there was
200** wrap-around.) Because the hash table is never more than half full,
201** the search is guaranteed to eventually hit an unused entry. Let
202** iMax be the value between iKey and iUnused, closest to iUnused,
203** where aHash[iMax]==P. If there is no iMax entry (if there exists
204** no hash slot such that aHash[i]==p) then page P is not in the
205** current index block. Otherwise the iMax-th mapping entry of the
206** current index block corresponds to the last entry that references
207** page P.
208**
209** A hash search begins with the last index block and moves toward the
210** first index block, looking for entries corresponding to page P. On
211** average, only two or three slots in each index block need to be
212** examined in order to either find the last entry for page P, or to
213** establish that no such entry exists in the block. Each index block
214** holds over 4000 entries. So two or three index blocks are sufficient
215** to cover a typical 10 megabyte WAL file, assuming 1K pages. 8 or 10
216** comparisons (on average) suffice to either locate a frame in the
217** WAL or to establish that the frame does not exist in the WAL. This
218** is much faster than scanning the entire 10MB WAL.
219**
220** Note that entries are added in order of increasing K. Hence, one
221** reader might be using some value K0 and a second reader that started
222** at a later time (after additional transactions were added to the WAL
223** and to the wal-index) might be using a different value K1, where K1>K0.
224** Both readers can use the same hash table and mapping section to get
225** the correct result. There may be entries in the hash table with
226** K>K0 but to the first reader, those entries will appear to be unused
227** slots in the hash table and so the first reader will get an answer as
228** if no values greater than K0 had ever been inserted into the hash table
229** in the first place - which is what reader one wants. Meanwhile, the
230** second reader using K1 will see additional values that were inserted
231** later, which is exactly what reader two wants.
232**
dan6f150142010-05-21 15:31:56 +0000233** When a rollback occurs, the value of K is decreased. Hash table entries
234** that correspond to frames greater than the new K value are removed
235** from the hash table at this point.
dan97a31352010-04-16 13:59:31 +0000236*/
drh29d4dbe2010-05-18 23:29:52 +0000237#ifndef SQLITE_OMIT_WAL
dan97a31352010-04-16 13:59:31 +0000238
drh29d4dbe2010-05-18 23:29:52 +0000239#include "wal.h"
240
drh73b64e42010-05-30 19:55:15 +0000241/*
drhc74c3332010-05-31 12:15:19 +0000242** Trace output macros
243*/
drhc74c3332010-05-31 12:15:19 +0000244#if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
drh15d68092010-05-31 16:56:14 +0000245int sqlite3WalTrace = 0;
drhc74c3332010-05-31 12:15:19 +0000246# define WALTRACE(X) if(sqlite3WalTrace) sqlite3DebugPrintf X
247#else
248# define WALTRACE(X)
249#endif
250
dan10f5a502010-06-23 15:55:43 +0000251/*
252** The maximum (and only) versions of the wal and wal-index formats
253** that may be interpreted by this version of SQLite.
254**
255** If a client begins recovering a WAL file and finds that (a) the checksum
256** values in the wal-header are correct and (b) the version field is not
257** WAL_MAX_VERSION, recovery fails and SQLite returns SQLITE_CANTOPEN.
258**
259** Similarly, if a client successfully reads a wal-index header (i.e. the
260** checksum test is successful) and finds that the version field is not
261** WALINDEX_MAX_VERSION, then no read-transaction is opened and SQLite
262** returns SQLITE_CANTOPEN.
263*/
264#define WAL_MAX_VERSION 3007000
265#define WALINDEX_MAX_VERSION 3007000
drhc74c3332010-05-31 12:15:19 +0000266
267/*
drh73b64e42010-05-30 19:55:15 +0000268** Indices of various locking bytes. WAL_NREADER is the number
269** of available reader locks and should be at least 3.
270*/
271#define WAL_WRITE_LOCK 0
272#define WAL_ALL_BUT_WRITE 1
273#define WAL_CKPT_LOCK 1
274#define WAL_RECOVER_LOCK 2
275#define WAL_READ_LOCK(I) (3+(I))
276#define WAL_NREADER (SQLITE_SHM_NLOCK-3)
277
dan97a31352010-04-16 13:59:31 +0000278
drh7ed91f22010-04-29 22:34:07 +0000279/* Object declarations */
280typedef struct WalIndexHdr WalIndexHdr;
281typedef struct WalIterator WalIterator;
drh73b64e42010-05-30 19:55:15 +0000282typedef struct WalCkptInfo WalCkptInfo;
dan7c246102010-04-12 19:00:29 +0000283
284
285/*
drh286a2882010-05-20 23:51:06 +0000286** The following object holds a copy of the wal-index header content.
287**
288** The actual header in the wal-index consists of two copies of this
289** object.
dan7c246102010-04-12 19:00:29 +0000290*/
drh7ed91f22010-04-29 22:34:07 +0000291struct WalIndexHdr {
dan10f5a502010-06-23 15:55:43 +0000292 u32 iVersion; /* Wal-index version */
293 u32 unused; /* Unused (padding) field */
dan71d89912010-05-24 13:57:42 +0000294 u32 iChange; /* Counter incremented each transaction */
drh4b82c382010-05-31 18:24:19 +0000295 u8 isInit; /* 1 when initialized */
296 u8 bigEndCksum; /* True if checksums in WAL are big-endian */
dan71d89912010-05-24 13:57:42 +0000297 u16 szPage; /* Database page size in bytes */
dand0aa3422010-05-31 16:41:53 +0000298 u32 mxFrame; /* Index of last valid frame in the WAL */
dan71d89912010-05-24 13:57:42 +0000299 u32 nPage; /* Size of database in pages */
300 u32 aFrameCksum[2]; /* Checksum of last frame in log */
301 u32 aSalt[2]; /* Two salt values copied from WAL header */
302 u32 aCksum[2]; /* Checksum over all prior fields */
dan7c246102010-04-12 19:00:29 +0000303};
304
drh73b64e42010-05-30 19:55:15 +0000305/*
306** A copy of the following object occurs in the wal-index immediately
307** following the second copy of the WalIndexHdr. This object stores
308** information used by checkpoint.
309**
310** nBackfill is the number of frames in the WAL that have been written
311** back into the database. (We call the act of moving content from WAL to
312** database "backfilling".) The nBackfill number is never greater than
313** WalIndexHdr.mxFrame. nBackfill can only be increased by threads
314** holding the WAL_CKPT_LOCK lock (which includes a recovery thread).
315** However, a WAL_WRITE_LOCK thread can move the value of nBackfill from
316** mxFrame back to zero when the WAL is reset.
317**
318** There is one entry in aReadMark[] for each reader lock. If a reader
319** holds read-lock K, then the value in aReadMark[K] is no greater than
drhdb7f6472010-06-09 14:45:12 +0000320** the mxFrame for that reader. The value READMARK_NOT_USED (0xffffffff)
321** for any aReadMark[] means that entry is unused. aReadMark[0] is
322** a special case; its value is never used and it exists as a place-holder
323** to avoid having to offset aReadMark[] indexs by one. Readers holding
324** WAL_READ_LOCK(0) always ignore the entire WAL and read all content
325** directly from the database.
drh73b64e42010-05-30 19:55:15 +0000326**
327** The value of aReadMark[K] may only be changed by a thread that
328** is holding an exclusive lock on WAL_READ_LOCK(K). Thus, the value of
329** aReadMark[K] cannot changed while there is a reader is using that mark
330** since the reader will be holding a shared lock on WAL_READ_LOCK(K).
331**
332** The checkpointer may only transfer frames from WAL to database where
333** the frame numbers are less than or equal to every aReadMark[] that is
334** in use (that is, every aReadMark[j] for which there is a corresponding
335** WAL_READ_LOCK(j)). New readers (usually) pick the aReadMark[] with the
336** largest value and will increase an unused aReadMark[] to mxFrame if there
337** is not already an aReadMark[] equal to mxFrame. The exception to the
338** previous sentence is when nBackfill equals mxFrame (meaning that everything
339** in the WAL has been backfilled into the database) then new readers
340** will choose aReadMark[0] which has value 0 and hence such reader will
341** get all their all content directly from the database file and ignore
342** the WAL.
343**
344** Writers normally append new frames to the end of the WAL. However,
345** if nBackfill equals mxFrame (meaning that all WAL content has been
346** written back into the database) and if no readers are using the WAL
347** (in other words, if there are no WAL_READ_LOCK(i) where i>0) then
348** the writer will first "reset" the WAL back to the beginning and start
349** writing new content beginning at frame 1.
350**
351** We assume that 32-bit loads are atomic and so no locks are needed in
352** order to read from any aReadMark[] entries.
353*/
354struct WalCkptInfo {
355 u32 nBackfill; /* Number of WAL frames backfilled into DB */
356 u32 aReadMark[WAL_NREADER]; /* Reader marks */
357};
drhdb7f6472010-06-09 14:45:12 +0000358#define READMARK_NOT_USED 0xffffffff
drh73b64e42010-05-30 19:55:15 +0000359
360
drh7e263722010-05-20 21:21:09 +0000361/* A block of WALINDEX_LOCK_RESERVED bytes beginning at
362** WALINDEX_LOCK_OFFSET is reserved for locks. Since some systems
363** only support mandatory file-locks, we do not read or write data
364** from the region of the file on which locks are applied.
danff207012010-04-24 04:49:15 +0000365*/
drh73b64e42010-05-30 19:55:15 +0000366#define WALINDEX_LOCK_OFFSET (sizeof(WalIndexHdr)*2 + sizeof(WalCkptInfo))
367#define WALINDEX_LOCK_RESERVED 16
drh026ac282010-05-26 15:06:38 +0000368#define WALINDEX_HDR_SIZE (WALINDEX_LOCK_OFFSET+WALINDEX_LOCK_RESERVED)
dan7c246102010-04-12 19:00:29 +0000369
drh7ed91f22010-04-29 22:34:07 +0000370/* Size of header before each frame in wal */
drh23ea97b2010-05-20 16:45:58 +0000371#define WAL_FRAME_HDRSIZE 24
danff207012010-04-24 04:49:15 +0000372
dan10f5a502010-06-23 15:55:43 +0000373/* Size of write ahead log header, including checksum. */
374/* #define WAL_HDRSIZE 24 */
375#define WAL_HDRSIZE 32
dan97a31352010-04-16 13:59:31 +0000376
danb8fd6c22010-05-24 10:39:36 +0000377/* WAL magic value. Either this value, or the same value with the least
378** significant bit also set (WAL_MAGIC | 0x00000001) is stored in 32-bit
379** big-endian format in the first 4 bytes of a WAL file.
380**
381** If the LSB is set, then the checksums for each frame within the WAL
382** file are calculated by treating all data as an array of 32-bit
383** big-endian words. Otherwise, they are calculated by interpreting
384** all data as 32-bit little-endian words.
385*/
386#define WAL_MAGIC 0x377f0682
387
dan97a31352010-04-16 13:59:31 +0000388/*
drh7ed91f22010-04-29 22:34:07 +0000389** Return the offset of frame iFrame in the write-ahead log file,
drh6e810962010-05-19 17:49:50 +0000390** assuming a database page size of szPage bytes. The offset returned
drh7ed91f22010-04-29 22:34:07 +0000391** is to the start of the write-ahead log frame-header.
dan97a31352010-04-16 13:59:31 +0000392*/
drh6e810962010-05-19 17:49:50 +0000393#define walFrameOffset(iFrame, szPage) ( \
394 WAL_HDRSIZE + ((iFrame)-1)*((szPage)+WAL_FRAME_HDRSIZE) \
dan97a31352010-04-16 13:59:31 +0000395)
dan7c246102010-04-12 19:00:29 +0000396
397/*
drh7ed91f22010-04-29 22:34:07 +0000398** An open write-ahead log file is represented by an instance of the
399** following object.
dance4f05f2010-04-22 19:14:13 +0000400*/
drh7ed91f22010-04-29 22:34:07 +0000401struct Wal {
drh73b64e42010-05-30 19:55:15 +0000402 sqlite3_vfs *pVfs; /* The VFS used to create pDbFd */
drhd9e5c4f2010-05-12 18:01:39 +0000403 sqlite3_file *pDbFd; /* File handle for the database file */
404 sqlite3_file *pWalFd; /* File handle for WAL file */
drh7ed91f22010-04-29 22:34:07 +0000405 u32 iCallback; /* Value to pass to log callback (or 0) */
dan13a3cb82010-06-11 19:04:21 +0000406 int nWiData; /* Size of array apWiData */
407 volatile u32 **apWiData; /* Pointer to wal-index content in memory */
drh73b64e42010-05-30 19:55:15 +0000408 u16 szPage; /* Database page size */
409 i16 readLock; /* Which read lock is being held. -1 for none */
dan55437592010-05-11 12:19:26 +0000410 u8 exclusiveMode; /* Non-zero if connection is in exclusive mode */
drh73b64e42010-05-30 19:55:15 +0000411 u8 isWIndexOpen; /* True if ShmOpen() called on pDbFd */
412 u8 writeLock; /* True if in a write transaction */
413 u8 ckptLock; /* True if holding a checkpoint lock */
414 WalIndexHdr hdr; /* Wal-index header for current transaction */
drhd9e5c4f2010-05-12 18:01:39 +0000415 char *zWalName; /* Name of WAL file */
drh7e263722010-05-20 21:21:09 +0000416 u32 nCkpt; /* Checkpoint sequence counter in the wal-header */
drhaab4c022010-06-02 14:45:51 +0000417#ifdef SQLITE_DEBUG
418 u8 lockError; /* True if a locking error has occurred */
419#endif
dan7c246102010-04-12 19:00:29 +0000420};
421
drh73b64e42010-05-30 19:55:15 +0000422/*
dan067f3162010-06-14 10:30:12 +0000423** Each page of the wal-index mapping contains a hash-table made up of
424** an array of HASHTABLE_NSLOT elements of the following type.
425*/
426typedef u16 ht_slot;
427
428/*
danad3cadd2010-06-14 11:49:26 +0000429** This structure is used to implement an iterator that loops through
430** all frames in the WAL in database page order. Where two or more frames
431** correspond to the same database page, the iterator visits only the
432** frame most recently written to the WAL (in other words, the frame with
433** the largest index).
434**
435** The internals of this structure are only accessed by:
436**
437** walIteratorInit() - Create a new iterator,
438** walIteratorNext() - Step an iterator,
439** walIteratorFree() - Free an iterator.
440**
441** This functionality is used by the checkpoint code (see walCheckpoint()).
442*/
443struct WalIterator {
444 int iPrior; /* Last result returned from the iterator */
445 int nSegment; /* Size of the aSegment[] array */
446 struct WalSegment {
447 int iNext; /* Next slot in aIndex[] not yet returned */
448 ht_slot *aIndex; /* i0, i1, i2... such that aPgno[iN] ascend */
449 u32 *aPgno; /* Array of page numbers. */
450 int nEntry; /* Max size of aPgno[] and aIndex[] arrays */
451 int iZero; /* Frame number associated with aPgno[0] */
452 } aSegment[1]; /* One for every 32KB page in the WAL */
453};
454
455/*
dan13a3cb82010-06-11 19:04:21 +0000456** Define the parameters of the hash tables in the wal-index file. There
457** is a hash-table following every HASHTABLE_NPAGE page numbers in the
458** wal-index.
459**
460** Changing any of these constants will alter the wal-index format and
461** create incompatibilities.
462*/
dan067f3162010-06-14 10:30:12 +0000463#define HASHTABLE_NPAGE 4096 /* Must be power of 2 */
dan13a3cb82010-06-11 19:04:21 +0000464#define HASHTABLE_HASH_1 383 /* Should be prime */
465#define HASHTABLE_NSLOT (HASHTABLE_NPAGE*2) /* Must be a power of 2 */
dan13a3cb82010-06-11 19:04:21 +0000466
danad3cadd2010-06-14 11:49:26 +0000467/*
468** The block of page numbers associated with the first hash-table in a
dan13a3cb82010-06-11 19:04:21 +0000469** wal-index is smaller than usual. This is so that there is a complete
470** hash-table on each aligned 32KB page of the wal-index.
471*/
dan067f3162010-06-14 10:30:12 +0000472#define HASHTABLE_NPAGE_ONE (HASHTABLE_NPAGE - (WALINDEX_HDR_SIZE/sizeof(u32)))
dan13a3cb82010-06-11 19:04:21 +0000473
dan067f3162010-06-14 10:30:12 +0000474/* The wal-index is divided into pages of WALINDEX_PGSZ bytes each. */
475#define WALINDEX_PGSZ ( \
476 sizeof(ht_slot)*HASHTABLE_NSLOT + HASHTABLE_NPAGE*sizeof(u32) \
477)
dan13a3cb82010-06-11 19:04:21 +0000478
479/*
480** Obtain a pointer to the iPage'th page of the wal-index. The wal-index
dan067f3162010-06-14 10:30:12 +0000481** is broken into pages of WALINDEX_PGSZ bytes. Wal-index pages are
dan13a3cb82010-06-11 19:04:21 +0000482** numbered from zero.
483**
484** If this call is successful, *ppPage is set to point to the wal-index
485** page and SQLITE_OK is returned. If an error (an OOM or VFS error) occurs,
486** then an SQLite error code is returned and *ppPage is set to 0.
487*/
488static int walIndexPage(Wal *pWal, int iPage, volatile u32 **ppPage){
489 int rc = SQLITE_OK;
490
491 /* Enlarge the pWal->apWiData[] array if required */
492 if( pWal->nWiData<=iPage ){
493 int nByte = sizeof(u32 *)*(iPage+1);
494 volatile u32 **apNew;
495 apNew = (volatile u32 **)sqlite3_realloc(pWal->apWiData, nByte);
496 if( !apNew ){
497 *ppPage = 0;
498 return SQLITE_NOMEM;
499 }
500 memset(&apNew[pWal->nWiData], 0, sizeof(u32 *)*(iPage+1-pWal->nWiData));
501 pWal->apWiData = apNew;
502 pWal->nWiData = iPage+1;
503 }
504
505 /* Request a pointer to the required page from the VFS */
506 if( pWal->apWiData[iPage]==0 ){
dan18801912010-06-14 14:07:50 +0000507 rc = sqlite3OsShmMap(pWal->pDbFd, iPage, WALINDEX_PGSZ,
dan13a3cb82010-06-11 19:04:21 +0000508 pWal->writeLock, (void volatile **)&pWal->apWiData[iPage]
509 );
510 }
511
512 *ppPage = pWal->apWiData[iPage];
513 assert( iPage==0 || *ppPage || rc!=SQLITE_OK );
514 return rc;
515}
516
517/*
drh73b64e42010-05-30 19:55:15 +0000518** Return a pointer to the WalCkptInfo structure in the wal-index.
519*/
520static volatile WalCkptInfo *walCkptInfo(Wal *pWal){
dan4280eb32010-06-12 12:02:35 +0000521 assert( pWal->nWiData>0 && pWal->apWiData[0] );
522 return (volatile WalCkptInfo*)&(pWal->apWiData[0][sizeof(WalIndexHdr)/2]);
523}
524
525/*
526** Return a pointer to the WalIndexHdr structure in the wal-index.
527*/
528static volatile WalIndexHdr *walIndexHdr(Wal *pWal){
529 assert( pWal->nWiData>0 && pWal->apWiData[0] );
530 return (volatile WalIndexHdr*)pWal->apWiData[0];
drh73b64e42010-05-30 19:55:15 +0000531}
532
dan7c246102010-04-12 19:00:29 +0000533/*
danb8fd6c22010-05-24 10:39:36 +0000534** The argument to this macro must be of type u32. On a little-endian
535** architecture, it returns the u32 value that results from interpreting
536** the 4 bytes as a big-endian value. On a big-endian architecture, it
537** returns the value that would be produced by intepreting the 4 bytes
538** of the input value as a little-endian integer.
539*/
540#define BYTESWAP32(x) ( \
541 (((x)&0x000000FF)<<24) + (((x)&0x0000FF00)<<8) \
542 + (((x)&0x00FF0000)>>8) + (((x)&0xFF000000)>>24) \
543)
dan64d039e2010-04-13 19:27:31 +0000544
dan7c246102010-04-12 19:00:29 +0000545/*
drh7e263722010-05-20 21:21:09 +0000546** Generate or extend an 8 byte checksum based on the data in
547** array aByte[] and the initial values of aIn[0] and aIn[1] (or
548** initial values of 0 and 0 if aIn==NULL).
549**
550** The checksum is written back into aOut[] before returning.
551**
552** nByte must be a positive multiple of 8.
dan7c246102010-04-12 19:00:29 +0000553*/
drh7e263722010-05-20 21:21:09 +0000554static void walChecksumBytes(
danb8fd6c22010-05-24 10:39:36 +0000555 int nativeCksum, /* True for native byte-order, false for non-native */
drh7e263722010-05-20 21:21:09 +0000556 u8 *a, /* Content to be checksummed */
557 int nByte, /* Bytes of content in a[]. Must be a multiple of 8. */
558 const u32 *aIn, /* Initial checksum value input */
559 u32 *aOut /* OUT: Final checksum value output */
560){
561 u32 s1, s2;
danb8fd6c22010-05-24 10:39:36 +0000562 u32 *aData = (u32 *)a;
563 u32 *aEnd = (u32 *)&a[nByte];
564
drh7e263722010-05-20 21:21:09 +0000565 if( aIn ){
566 s1 = aIn[0];
567 s2 = aIn[1];
568 }else{
569 s1 = s2 = 0;
570 }
dan7c246102010-04-12 19:00:29 +0000571
drh584c7542010-05-19 18:08:10 +0000572 assert( nByte>=8 );
danb8fd6c22010-05-24 10:39:36 +0000573 assert( (nByte&0x00000007)==0 );
dan7c246102010-04-12 19:00:29 +0000574
danb8fd6c22010-05-24 10:39:36 +0000575 if( nativeCksum ){
576 do {
577 s1 += *aData++ + s2;
578 s2 += *aData++ + s1;
579 }while( aData<aEnd );
580 }else{
581 do {
582 s1 += BYTESWAP32(aData[0]) + s2;
583 s2 += BYTESWAP32(aData[1]) + s1;
584 aData += 2;
585 }while( aData<aEnd );
586 }
587
drh7e263722010-05-20 21:21:09 +0000588 aOut[0] = s1;
589 aOut[1] = s2;
dan7c246102010-04-12 19:00:29 +0000590}
591
592/*
drh7e263722010-05-20 21:21:09 +0000593** Write the header information in pWal->hdr into the wal-index.
594**
595** The checksum on pWal->hdr is updated before it is written.
drh7ed91f22010-04-29 22:34:07 +0000596*/
drh7e263722010-05-20 21:21:09 +0000597static void walIndexWriteHdr(Wal *pWal){
dan4280eb32010-06-12 12:02:35 +0000598 volatile WalIndexHdr *aHdr = walIndexHdr(pWal);
599 const int nCksum = offsetof(WalIndexHdr, aCksum);
drh73b64e42010-05-30 19:55:15 +0000600
601 assert( pWal->writeLock );
drh4b82c382010-05-31 18:24:19 +0000602 pWal->hdr.isInit = 1;
dan10f5a502010-06-23 15:55:43 +0000603 pWal->hdr.iVersion = WALINDEX_MAX_VERSION;
dan4280eb32010-06-12 12:02:35 +0000604 walChecksumBytes(1, (u8*)&pWal->hdr, nCksum, 0, pWal->hdr.aCksum);
605 memcpy((void *)&aHdr[1], (void *)&pWal->hdr, sizeof(WalIndexHdr));
drh286a2882010-05-20 23:51:06 +0000606 sqlite3OsShmBarrier(pWal->pDbFd);
dan4280eb32010-06-12 12:02:35 +0000607 memcpy((void *)&aHdr[0], (void *)&pWal->hdr, sizeof(WalIndexHdr));
dan7c246102010-04-12 19:00:29 +0000608}
609
610/*
611** This function encodes a single frame header and writes it to a buffer
drh7ed91f22010-04-29 22:34:07 +0000612** supplied by the caller. A frame-header is made up of a series of
dan7c246102010-04-12 19:00:29 +0000613** 4-byte big-endian integers, as follows:
614**
drh23ea97b2010-05-20 16:45:58 +0000615** 0: Page number.
616** 4: For commit records, the size of the database image in pages
617** after the commit. For all other records, zero.
drh7e263722010-05-20 21:21:09 +0000618** 8: Salt-1 (copied from the wal-header)
619** 12: Salt-2 (copied from the wal-header)
drh23ea97b2010-05-20 16:45:58 +0000620** 16: Checksum-1.
621** 20: Checksum-2.
dan7c246102010-04-12 19:00:29 +0000622*/
drh7ed91f22010-04-29 22:34:07 +0000623static void walEncodeFrame(
drh23ea97b2010-05-20 16:45:58 +0000624 Wal *pWal, /* The write-ahead log */
dan7c246102010-04-12 19:00:29 +0000625 u32 iPage, /* Database page number for frame */
626 u32 nTruncate, /* New db size (or 0 for non-commit frames) */
drh7e263722010-05-20 21:21:09 +0000627 u8 *aData, /* Pointer to page data */
dan7c246102010-04-12 19:00:29 +0000628 u8 *aFrame /* OUT: Write encoded frame here */
629){
danb8fd6c22010-05-24 10:39:36 +0000630 int nativeCksum; /* True for native byte-order checksums */
dan71d89912010-05-24 13:57:42 +0000631 u32 *aCksum = pWal->hdr.aFrameCksum;
drh23ea97b2010-05-20 16:45:58 +0000632 assert( WAL_FRAME_HDRSIZE==24 );
dan97a31352010-04-16 13:59:31 +0000633 sqlite3Put4byte(&aFrame[0], iPage);
634 sqlite3Put4byte(&aFrame[4], nTruncate);
drh7e263722010-05-20 21:21:09 +0000635 memcpy(&aFrame[8], pWal->hdr.aSalt, 8);
dan7c246102010-04-12 19:00:29 +0000636
danb8fd6c22010-05-24 10:39:36 +0000637 nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN);
dan71d89912010-05-24 13:57:42 +0000638 walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum);
danb8fd6c22010-05-24 10:39:36 +0000639 walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum);
dan7c246102010-04-12 19:00:29 +0000640
drh23ea97b2010-05-20 16:45:58 +0000641 sqlite3Put4byte(&aFrame[16], aCksum[0]);
642 sqlite3Put4byte(&aFrame[20], aCksum[1]);
dan7c246102010-04-12 19:00:29 +0000643}
644
645/*
drh7e263722010-05-20 21:21:09 +0000646** Check to see if the frame with header in aFrame[] and content
647** in aData[] is valid. If it is a valid frame, fill *piPage and
648** *pnTruncate and return true. Return if the frame is not valid.
dan7c246102010-04-12 19:00:29 +0000649*/
drh7ed91f22010-04-29 22:34:07 +0000650static int walDecodeFrame(
drh23ea97b2010-05-20 16:45:58 +0000651 Wal *pWal, /* The write-ahead log */
dan7c246102010-04-12 19:00:29 +0000652 u32 *piPage, /* OUT: Database page number for frame */
653 u32 *pnTruncate, /* OUT: New db size (or 0 if not commit) */
dan7c246102010-04-12 19:00:29 +0000654 u8 *aData, /* Pointer to page data (for checksum) */
655 u8 *aFrame /* Frame data */
656){
danb8fd6c22010-05-24 10:39:36 +0000657 int nativeCksum; /* True for native byte-order checksums */
dan71d89912010-05-24 13:57:42 +0000658 u32 *aCksum = pWal->hdr.aFrameCksum;
drhc8179152010-05-24 13:28:36 +0000659 u32 pgno; /* Page number of the frame */
drh23ea97b2010-05-20 16:45:58 +0000660 assert( WAL_FRAME_HDRSIZE==24 );
661
drh7e263722010-05-20 21:21:09 +0000662 /* A frame is only valid if the salt values in the frame-header
663 ** match the salt values in the wal-header.
664 */
665 if( memcmp(&pWal->hdr.aSalt, &aFrame[8], 8)!=0 ){
drh23ea97b2010-05-20 16:45:58 +0000666 return 0;
667 }
dan4a4b01d2010-04-16 11:30:18 +0000668
drhc8179152010-05-24 13:28:36 +0000669 /* A frame is only valid if the page number is creater than zero.
670 */
671 pgno = sqlite3Get4byte(&aFrame[0]);
672 if( pgno==0 ){
673 return 0;
674 }
675
drh7e263722010-05-20 21:21:09 +0000676 /* A frame is only valid if a checksum of the first 16 bytes
677 ** of the frame-header, and the frame-data matches
678 ** the checksum in the last 8 bytes of the frame-header.
679 */
danb8fd6c22010-05-24 10:39:36 +0000680 nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN);
dan71d89912010-05-24 13:57:42 +0000681 walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum);
danb8fd6c22010-05-24 10:39:36 +0000682 walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum);
drh23ea97b2010-05-20 16:45:58 +0000683 if( aCksum[0]!=sqlite3Get4byte(&aFrame[16])
684 || aCksum[1]!=sqlite3Get4byte(&aFrame[20])
dan7c246102010-04-12 19:00:29 +0000685 ){
686 /* Checksum failed. */
687 return 0;
688 }
689
drh7e263722010-05-20 21:21:09 +0000690 /* If we reach this point, the frame is valid. Return the page number
691 ** and the new database size.
692 */
drhc8179152010-05-24 13:28:36 +0000693 *piPage = pgno;
dan97a31352010-04-16 13:59:31 +0000694 *pnTruncate = sqlite3Get4byte(&aFrame[4]);
dan7c246102010-04-12 19:00:29 +0000695 return 1;
696}
697
dan7c246102010-04-12 19:00:29 +0000698
drhc74c3332010-05-31 12:15:19 +0000699#if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
700/*
drh181e0912010-06-01 01:08:08 +0000701** Names of locks. This routine is used to provide debugging output and is not
702** a part of an ordinary build.
drhc74c3332010-05-31 12:15:19 +0000703*/
704static const char *walLockName(int lockIdx){
705 if( lockIdx==WAL_WRITE_LOCK ){
706 return "WRITE-LOCK";
707 }else if( lockIdx==WAL_CKPT_LOCK ){
708 return "CKPT-LOCK";
709 }else if( lockIdx==WAL_RECOVER_LOCK ){
710 return "RECOVER-LOCK";
711 }else{
712 static char zName[15];
713 sqlite3_snprintf(sizeof(zName), zName, "READ-LOCK[%d]",
714 lockIdx-WAL_READ_LOCK(0));
715 return zName;
716 }
717}
718#endif /*defined(SQLITE_TEST) || defined(SQLITE_DEBUG) */
719
720
dan7c246102010-04-12 19:00:29 +0000721/*
drh181e0912010-06-01 01:08:08 +0000722** Set or release locks on the WAL. Locks are either shared or exclusive.
723** A lock cannot be moved directly between shared and exclusive - it must go
724** through the unlocked state first.
drh73b64e42010-05-30 19:55:15 +0000725**
726** In locking_mode=EXCLUSIVE, all of these routines become no-ops.
727*/
728static int walLockShared(Wal *pWal, int lockIdx){
drhc74c3332010-05-31 12:15:19 +0000729 int rc;
drh73b64e42010-05-30 19:55:15 +0000730 if( pWal->exclusiveMode ) return SQLITE_OK;
drhc74c3332010-05-31 12:15:19 +0000731 rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1,
732 SQLITE_SHM_LOCK | SQLITE_SHM_SHARED);
733 WALTRACE(("WAL%p: acquire SHARED-%s %s\n", pWal,
734 walLockName(lockIdx), rc ? "failed" : "ok"));
drhaab4c022010-06-02 14:45:51 +0000735 VVA_ONLY( pWal->lockError = (rc!=SQLITE_OK && rc!=SQLITE_BUSY); )
drhc74c3332010-05-31 12:15:19 +0000736 return rc;
drh73b64e42010-05-30 19:55:15 +0000737}
738static void walUnlockShared(Wal *pWal, int lockIdx){
739 if( pWal->exclusiveMode ) return;
740 (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1,
741 SQLITE_SHM_UNLOCK | SQLITE_SHM_SHARED);
drhc74c3332010-05-31 12:15:19 +0000742 WALTRACE(("WAL%p: release SHARED-%s\n", pWal, walLockName(lockIdx)));
drh73b64e42010-05-30 19:55:15 +0000743}
744static int walLockExclusive(Wal *pWal, int lockIdx, int n){
drhc74c3332010-05-31 12:15:19 +0000745 int rc;
drh73b64e42010-05-30 19:55:15 +0000746 if( pWal->exclusiveMode ) return SQLITE_OK;
drhc74c3332010-05-31 12:15:19 +0000747 rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, n,
748 SQLITE_SHM_LOCK | SQLITE_SHM_EXCLUSIVE);
749 WALTRACE(("WAL%p: acquire EXCLUSIVE-%s cnt=%d %s\n", pWal,
750 walLockName(lockIdx), n, rc ? "failed" : "ok"));
drhaab4c022010-06-02 14:45:51 +0000751 VVA_ONLY( pWal->lockError = (rc!=SQLITE_OK && rc!=SQLITE_BUSY); )
drhc74c3332010-05-31 12:15:19 +0000752 return rc;
drh73b64e42010-05-30 19:55:15 +0000753}
754static void walUnlockExclusive(Wal *pWal, int lockIdx, int n){
755 if( pWal->exclusiveMode ) return;
756 (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, n,
757 SQLITE_SHM_UNLOCK | SQLITE_SHM_EXCLUSIVE);
drhc74c3332010-05-31 12:15:19 +0000758 WALTRACE(("WAL%p: release EXCLUSIVE-%s cnt=%d\n", pWal,
759 walLockName(lockIdx), n));
drh73b64e42010-05-30 19:55:15 +0000760}
761
762/*
drh29d4dbe2010-05-18 23:29:52 +0000763** Compute a hash on a page number. The resulting hash value must land
drh181e0912010-06-01 01:08:08 +0000764** between 0 and (HASHTABLE_NSLOT-1). The walHashNext() function advances
765** the hash to the next value in the event of a collision.
drh29d4dbe2010-05-18 23:29:52 +0000766*/
767static int walHash(u32 iPage){
768 assert( iPage>0 );
769 assert( (HASHTABLE_NSLOT & (HASHTABLE_NSLOT-1))==0 );
770 return (iPage*HASHTABLE_HASH_1) & (HASHTABLE_NSLOT-1);
771}
772static int walNextHash(int iPriorHash){
773 return (iPriorHash+1)&(HASHTABLE_NSLOT-1);
danbb23aff2010-05-10 14:46:09 +0000774}
775
dan4280eb32010-06-12 12:02:35 +0000776/*
777** Return pointers to the hash table and page number array stored on
778** page iHash of the wal-index. The wal-index is broken into 32KB pages
779** numbered starting from 0.
780**
781** Set output variable *paHash to point to the start of the hash table
782** in the wal-index file. Set *piZero to one less than the frame
783** number of the first frame indexed by this hash table. If a
784** slot in the hash table is set to N, it refers to frame number
785** (*piZero+N) in the log.
786**
dand60bf112010-06-14 11:18:50 +0000787** Finally, set *paPgno so that *paPgno[1] is the page number of the
788** first frame indexed by the hash table, frame (*piZero+1).
dan4280eb32010-06-12 12:02:35 +0000789*/
790static int walHashGet(
dan13a3cb82010-06-11 19:04:21 +0000791 Wal *pWal, /* WAL handle */
792 int iHash, /* Find the iHash'th table */
dan067f3162010-06-14 10:30:12 +0000793 volatile ht_slot **paHash, /* OUT: Pointer to hash index */
dan13a3cb82010-06-11 19:04:21 +0000794 volatile u32 **paPgno, /* OUT: Pointer to page number array */
795 u32 *piZero /* OUT: Frame associated with *paPgno[0] */
796){
dan4280eb32010-06-12 12:02:35 +0000797 int rc; /* Return code */
dan13a3cb82010-06-11 19:04:21 +0000798 volatile u32 *aPgno;
dan13a3cb82010-06-11 19:04:21 +0000799
dan4280eb32010-06-12 12:02:35 +0000800 rc = walIndexPage(pWal, iHash, &aPgno);
801 assert( rc==SQLITE_OK || iHash>0 );
dan13a3cb82010-06-11 19:04:21 +0000802
dan4280eb32010-06-12 12:02:35 +0000803 if( rc==SQLITE_OK ){
804 u32 iZero;
dan067f3162010-06-14 10:30:12 +0000805 volatile ht_slot *aHash;
dan4280eb32010-06-12 12:02:35 +0000806
dan067f3162010-06-14 10:30:12 +0000807 aHash = (volatile ht_slot *)&aPgno[HASHTABLE_NPAGE];
dan4280eb32010-06-12 12:02:35 +0000808 if( iHash==0 ){
dand60bf112010-06-14 11:18:50 +0000809 aPgno = &aPgno[WALINDEX_HDR_SIZE/sizeof(u32)];
dan4280eb32010-06-12 12:02:35 +0000810 iZero = 0;
811 }else{
812 iZero = HASHTABLE_NPAGE_ONE + (iHash-1)*HASHTABLE_NPAGE;
dan4280eb32010-06-12 12:02:35 +0000813 }
814
dand60bf112010-06-14 11:18:50 +0000815 *paPgno = &aPgno[-1];
dan4280eb32010-06-12 12:02:35 +0000816 *paHash = aHash;
817 *piZero = iZero;
dan13a3cb82010-06-11 19:04:21 +0000818 }
dan4280eb32010-06-12 12:02:35 +0000819 return rc;
dan13a3cb82010-06-11 19:04:21 +0000820}
821
dan4280eb32010-06-12 12:02:35 +0000822/*
823** Return the number of the wal-index page that contains the hash-table
824** and page-number array that contain entries corresponding to WAL frame
825** iFrame. The wal-index is broken up into 32KB pages. Wal-index pages
826** are numbered starting from 0.
827*/
dan13a3cb82010-06-11 19:04:21 +0000828static int walFramePage(u32 iFrame){
829 int iHash = (iFrame+HASHTABLE_NPAGE-HASHTABLE_NPAGE_ONE-1) / HASHTABLE_NPAGE;
830 assert( (iHash==0 || iFrame>HASHTABLE_NPAGE_ONE)
831 && (iHash>=1 || iFrame<=HASHTABLE_NPAGE_ONE)
832 && (iHash<=1 || iFrame>(HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE))
833 && (iHash>=2 || iFrame<=HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE)
834 && (iHash<=2 || iFrame>(HASHTABLE_NPAGE_ONE+2*HASHTABLE_NPAGE))
835 );
836 return iHash;
837}
838
839/*
840** Return the page number associated with frame iFrame in this WAL.
841*/
842static u32 walFramePgno(Wal *pWal, u32 iFrame){
843 int iHash = walFramePage(iFrame);
844 if( iHash==0 ){
845 return pWal->apWiData[0][WALINDEX_HDR_SIZE/sizeof(u32) + iFrame - 1];
846 }
847 return pWal->apWiData[iHash][(iFrame-1-HASHTABLE_NPAGE_ONE)%HASHTABLE_NPAGE];
848}
danbb23aff2010-05-10 14:46:09 +0000849
danca6b5ba2010-05-25 10:50:56 +0000850/*
851** Remove entries from the hash table that point to WAL slots greater
852** than pWal->hdr.mxFrame.
853**
854** This function is called whenever pWal->hdr.mxFrame is decreased due
855** to a rollback or savepoint.
856**
drh181e0912010-06-01 01:08:08 +0000857** At most only the hash table containing pWal->hdr.mxFrame needs to be
858** updated. Any later hash tables will be automatically cleared when
859** pWal->hdr.mxFrame advances to the point where those hash tables are
860** actually needed.
danca6b5ba2010-05-25 10:50:56 +0000861*/
862static void walCleanupHash(Wal *pWal){
dan067f3162010-06-14 10:30:12 +0000863 volatile ht_slot *aHash; /* Pointer to hash table to clear */
864 volatile u32 *aPgno; /* Page number array for hash table */
865 u32 iZero; /* frame == (aHash[x]+iZero) */
866 int iLimit = 0; /* Zero values greater than this */
867 int nByte; /* Number of bytes to zero in aPgno[] */
868 int i; /* Used to iterate through aHash[] */
danca6b5ba2010-05-25 10:50:56 +0000869
drh73b64e42010-05-30 19:55:15 +0000870 assert( pWal->writeLock );
drhffca4302010-06-15 11:21:54 +0000871 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE-1 );
872 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE );
873 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE+1 );
drh9c156472010-06-01 12:58:41 +0000874
dan4280eb32010-06-12 12:02:35 +0000875 if( pWal->hdr.mxFrame==0 ) return;
876
877 /* Obtain pointers to the hash-table and page-number array containing
878 ** the entry that corresponds to frame pWal->hdr.mxFrame. It is guaranteed
879 ** that the page said hash-table and array reside on is already mapped.
880 */
881 assert( pWal->nWiData>walFramePage(pWal->hdr.mxFrame) );
882 assert( pWal->apWiData[walFramePage(pWal->hdr.mxFrame)] );
883 walHashGet(pWal, walFramePage(pWal->hdr.mxFrame), &aHash, &aPgno, &iZero);
884
885 /* Zero all hash-table entries that correspond to frame numbers greater
886 ** than pWal->hdr.mxFrame.
887 */
888 iLimit = pWal->hdr.mxFrame - iZero;
889 assert( iLimit>0 );
890 for(i=0; i<HASHTABLE_NSLOT; i++){
891 if( aHash[i]>iLimit ){
892 aHash[i] = 0;
danca6b5ba2010-05-25 10:50:56 +0000893 }
danca6b5ba2010-05-25 10:50:56 +0000894 }
dan4280eb32010-06-12 12:02:35 +0000895
896 /* Zero the entries in the aPgno array that correspond to frames with
897 ** frame numbers greater than pWal->hdr.mxFrame.
898 */
dand60bf112010-06-14 11:18:50 +0000899 nByte = ((char *)aHash - (char *)&aPgno[iLimit+1]);
900 memset((void *)&aPgno[iLimit+1], 0, nByte);
danca6b5ba2010-05-25 10:50:56 +0000901
902#ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
903 /* Verify that the every entry in the mapping region is still reachable
904 ** via the hash table even after the cleanup.
905 */
drhf77bbd92010-06-01 13:17:44 +0000906 if( iLimit ){
danca6b5ba2010-05-25 10:50:56 +0000907 int i; /* Loop counter */
908 int iKey; /* Hash key */
909 for(i=1; i<=iLimit; i++){
dand60bf112010-06-14 11:18:50 +0000910 for(iKey=walHash(aPgno[i]); aHash[iKey]; iKey=walNextHash(iKey)){
danca6b5ba2010-05-25 10:50:56 +0000911 if( aHash[iKey]==i ) break;
912 }
913 assert( aHash[iKey]==i );
914 }
915 }
916#endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */
917}
918
danbb23aff2010-05-10 14:46:09 +0000919
drh7ed91f22010-04-29 22:34:07 +0000920/*
drh29d4dbe2010-05-18 23:29:52 +0000921** Set an entry in the wal-index that will map database page number
922** pPage into WAL frame iFrame.
dan7c246102010-04-12 19:00:29 +0000923*/
drh7ed91f22010-04-29 22:34:07 +0000924static int walIndexAppend(Wal *pWal, u32 iFrame, u32 iPage){
dan4280eb32010-06-12 12:02:35 +0000925 int rc; /* Return code */
926 u32 iZero; /* One less than frame number of aPgno[1] */
927 volatile u32 *aPgno; /* Page number array */
dan067f3162010-06-14 10:30:12 +0000928 volatile ht_slot *aHash; /* Hash table */
dance4f05f2010-04-22 19:14:13 +0000929
dan4280eb32010-06-12 12:02:35 +0000930 rc = walHashGet(pWal, walFramePage(iFrame), &aHash, &aPgno, &iZero);
931
932 /* Assuming the wal-index file was successfully mapped, populate the
933 ** page number array and hash table entry.
dan7c246102010-04-12 19:00:29 +0000934 */
danbb23aff2010-05-10 14:46:09 +0000935 if( rc==SQLITE_OK ){
936 int iKey; /* Hash table key */
dan4280eb32010-06-12 12:02:35 +0000937 int idx; /* Value to write to hash-table slot */
938 TESTONLY( int nCollide = 0; /* Number of hash collisions */ )
dan7c246102010-04-12 19:00:29 +0000939
danbb23aff2010-05-10 14:46:09 +0000940 idx = iFrame - iZero;
dan4280eb32010-06-12 12:02:35 +0000941 assert( idx <= HASHTABLE_NSLOT/2 + 1 );
942
943 /* If this is the first entry to be added to this hash-table, zero the
944 ** entire hash table and aPgno[] array before proceding.
945 */
danca6b5ba2010-05-25 10:50:56 +0000946 if( idx==1 ){
dand60bf112010-06-14 11:18:50 +0000947 int nByte = (u8 *)&aHash[HASHTABLE_NSLOT] - (u8 *)&aPgno[1];
948 memset((void*)&aPgno[1], 0, nByte);
danca6b5ba2010-05-25 10:50:56 +0000949 }
danca6b5ba2010-05-25 10:50:56 +0000950
dan4280eb32010-06-12 12:02:35 +0000951 /* If the entry in aPgno[] is already set, then the previous writer
952 ** must have exited unexpectedly in the middle of a transaction (after
953 ** writing one or more dirty pages to the WAL to free up memory).
954 ** Remove the remnants of that writers uncommitted transaction from
955 ** the hash-table before writing any new entries.
956 */
dand60bf112010-06-14 11:18:50 +0000957 if( aPgno[idx] ){
danca6b5ba2010-05-25 10:50:56 +0000958 walCleanupHash(pWal);
dand60bf112010-06-14 11:18:50 +0000959 assert( !aPgno[idx] );
danca6b5ba2010-05-25 10:50:56 +0000960 }
dan4280eb32010-06-12 12:02:35 +0000961
962 /* Write the aPgno[] array entry and the hash-table slot. */
dan6f150142010-05-21 15:31:56 +0000963 for(iKey=walHash(iPage); aHash[iKey]; iKey=walNextHash(iKey)){
drh29d4dbe2010-05-18 23:29:52 +0000964 assert( nCollide++ < idx );
965 }
dand60bf112010-06-14 11:18:50 +0000966 aPgno[idx] = iPage;
danbb23aff2010-05-10 14:46:09 +0000967 aHash[iKey] = idx;
drh4fa95bf2010-05-22 00:55:39 +0000968
969#ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
970 /* Verify that the number of entries in the hash table exactly equals
971 ** the number of entries in the mapping region.
972 */
973 {
974 int i; /* Loop counter */
975 int nEntry = 0; /* Number of entries in the hash table */
976 for(i=0; i<HASHTABLE_NSLOT; i++){ if( aHash[i] ) nEntry++; }
977 assert( nEntry==idx );
978 }
979
980 /* Verify that the every entry in the mapping region is reachable
981 ** via the hash table. This turns out to be a really, really expensive
982 ** thing to check, so only do this occasionally - not on every
983 ** iteration.
984 */
985 if( (idx&0x3ff)==0 ){
986 int i; /* Loop counter */
987 for(i=1; i<=idx; i++){
dand60bf112010-06-14 11:18:50 +0000988 for(iKey=walHash(aPgno[i]); aHash[iKey]; iKey=walNextHash(iKey)){
drh4fa95bf2010-05-22 00:55:39 +0000989 if( aHash[iKey]==i ) break;
990 }
991 assert( aHash[iKey]==i );
992 }
993 }
994#endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */
dan7c246102010-04-12 19:00:29 +0000995 }
dan31f98fc2010-04-27 05:42:32 +0000996
drh4fa95bf2010-05-22 00:55:39 +0000997
danbb23aff2010-05-10 14:46:09 +0000998 return rc;
dan7c246102010-04-12 19:00:29 +0000999}
1000
1001
1002/*
drh7ed91f22010-04-29 22:34:07 +00001003** Recover the wal-index by reading the write-ahead log file.
drh73b64e42010-05-30 19:55:15 +00001004**
1005** This routine first tries to establish an exclusive lock on the
1006** wal-index to prevent other threads/processes from doing anything
1007** with the WAL or wal-index while recovery is running. The
1008** WAL_RECOVER_LOCK is also held so that other threads will know
1009** that this thread is running recovery. If unable to establish
1010** the necessary locks, this routine returns SQLITE_BUSY.
dan7c246102010-04-12 19:00:29 +00001011*/
drh7ed91f22010-04-29 22:34:07 +00001012static int walIndexRecover(Wal *pWal){
dan7c246102010-04-12 19:00:29 +00001013 int rc; /* Return Code */
1014 i64 nSize; /* Size of log file */
dan71d89912010-05-24 13:57:42 +00001015 u32 aFrameCksum[2] = {0, 0};
dand0aa3422010-05-31 16:41:53 +00001016 int iLock; /* Lock offset to lock for checkpoint */
1017 int nLock; /* Number of locks to hold */
dan7c246102010-04-12 19:00:29 +00001018
dand0aa3422010-05-31 16:41:53 +00001019 /* Obtain an exclusive lock on all byte in the locking range not already
1020 ** locked by the caller. The caller is guaranteed to have locked the
1021 ** WAL_WRITE_LOCK byte, and may have also locked the WAL_CKPT_LOCK byte.
1022 ** If successful, the same bytes that are locked here are unlocked before
1023 ** this function returns.
1024 */
1025 assert( pWal->ckptLock==1 || pWal->ckptLock==0 );
1026 assert( WAL_ALL_BUT_WRITE==WAL_WRITE_LOCK+1 );
1027 assert( WAL_CKPT_LOCK==WAL_ALL_BUT_WRITE );
1028 assert( pWal->writeLock );
1029 iLock = WAL_ALL_BUT_WRITE + pWal->ckptLock;
1030 nLock = SQLITE_SHM_NLOCK - iLock;
1031 rc = walLockExclusive(pWal, iLock, nLock);
drh73b64e42010-05-30 19:55:15 +00001032 if( rc ){
1033 return rc;
1034 }
drhc74c3332010-05-31 12:15:19 +00001035 WALTRACE(("WAL%p: recovery begin...\n", pWal));
drh73b64e42010-05-30 19:55:15 +00001036
dan71d89912010-05-24 13:57:42 +00001037 memset(&pWal->hdr, 0, sizeof(WalIndexHdr));
dan7c246102010-04-12 19:00:29 +00001038
drhd9e5c4f2010-05-12 18:01:39 +00001039 rc = sqlite3OsFileSize(pWal->pWalFd, &nSize);
dan7c246102010-04-12 19:00:29 +00001040 if( rc!=SQLITE_OK ){
drh73b64e42010-05-30 19:55:15 +00001041 goto recovery_error;
dan7c246102010-04-12 19:00:29 +00001042 }
1043
danb8fd6c22010-05-24 10:39:36 +00001044 if( nSize>WAL_HDRSIZE ){
1045 u8 aBuf[WAL_HDRSIZE]; /* Buffer to load WAL header into */
dan7c246102010-04-12 19:00:29 +00001046 u8 *aFrame = 0; /* Malloc'd buffer to load entire frame */
drh584c7542010-05-19 18:08:10 +00001047 int szFrame; /* Number of bytes in buffer aFrame[] */
dan7c246102010-04-12 19:00:29 +00001048 u8 *aData; /* Pointer to data part of aFrame buffer */
1049 int iFrame; /* Index of last frame read */
1050 i64 iOffset; /* Next offset to read from log file */
drh6e810962010-05-19 17:49:50 +00001051 int szPage; /* Page size according to the log */
danb8fd6c22010-05-24 10:39:36 +00001052 u32 magic; /* Magic value read from WAL header */
dan10f5a502010-06-23 15:55:43 +00001053 u32 version; /* Magic value read from WAL header */
dan7c246102010-04-12 19:00:29 +00001054
danb8fd6c22010-05-24 10:39:36 +00001055 /* Read in the WAL header. */
drhd9e5c4f2010-05-12 18:01:39 +00001056 rc = sqlite3OsRead(pWal->pWalFd, aBuf, WAL_HDRSIZE, 0);
dan7c246102010-04-12 19:00:29 +00001057 if( rc!=SQLITE_OK ){
drh73b64e42010-05-30 19:55:15 +00001058 goto recovery_error;
dan7c246102010-04-12 19:00:29 +00001059 }
1060
1061 /* If the database page size is not a power of two, or is greater than
danb8fd6c22010-05-24 10:39:36 +00001062 ** SQLITE_MAX_PAGE_SIZE, conclude that the WAL file contains no valid
1063 ** data. Similarly, if the 'magic' value is invalid, ignore the whole
1064 ** WAL file.
dan7c246102010-04-12 19:00:29 +00001065 */
danb8fd6c22010-05-24 10:39:36 +00001066 magic = sqlite3Get4byte(&aBuf[0]);
drh23ea97b2010-05-20 16:45:58 +00001067 szPage = sqlite3Get4byte(&aBuf[8]);
danb8fd6c22010-05-24 10:39:36 +00001068 if( (magic&0xFFFFFFFE)!=WAL_MAGIC
1069 || szPage&(szPage-1)
1070 || szPage>SQLITE_MAX_PAGE_SIZE
1071 || szPage<512
1072 ){
dan7c246102010-04-12 19:00:29 +00001073 goto finished;
1074 }
dan71d89912010-05-24 13:57:42 +00001075 pWal->hdr.bigEndCksum = (magic&0x00000001);
drh7e263722010-05-20 21:21:09 +00001076 pWal->szPage = szPage;
drh23ea97b2010-05-20 16:45:58 +00001077 pWal->nCkpt = sqlite3Get4byte(&aBuf[12]);
drh7e263722010-05-20 21:21:09 +00001078 memcpy(&pWal->hdr.aSalt, &aBuf[16], 8);
dan71d89912010-05-24 13:57:42 +00001079 walChecksumBytes(pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN,
dan10f5a502010-06-23 15:55:43 +00001080 aBuf, WAL_HDRSIZE-2*4, 0, pWal->hdr.aFrameCksum
dan71d89912010-05-24 13:57:42 +00001081 );
dan7c246102010-04-12 19:00:29 +00001082
dan10f5a502010-06-23 15:55:43 +00001083 if( pWal->hdr.aFrameCksum[0]!=sqlite3Get4byte(&aBuf[24])
1084 || pWal->hdr.aFrameCksum[1]!=sqlite3Get4byte(&aBuf[28])
1085 ){
1086 goto finished;
1087 }
1088
1089 version = sqlite3Get4byte(&aBuf[4]);
1090 if( version!=WAL_MAX_VERSION ){
1091 rc = SQLITE_CANTOPEN_BKPT;
1092 goto finished;
1093 }
1094
dan7c246102010-04-12 19:00:29 +00001095 /* Malloc a buffer to read frames into. */
drh584c7542010-05-19 18:08:10 +00001096 szFrame = szPage + WAL_FRAME_HDRSIZE;
1097 aFrame = (u8 *)sqlite3_malloc(szFrame);
dan7c246102010-04-12 19:00:29 +00001098 if( !aFrame ){
drh73b64e42010-05-30 19:55:15 +00001099 rc = SQLITE_NOMEM;
1100 goto recovery_error;
dan7c246102010-04-12 19:00:29 +00001101 }
drh7ed91f22010-04-29 22:34:07 +00001102 aData = &aFrame[WAL_FRAME_HDRSIZE];
dan7c246102010-04-12 19:00:29 +00001103
1104 /* Read all frames from the log file. */
1105 iFrame = 0;
drh584c7542010-05-19 18:08:10 +00001106 for(iOffset=WAL_HDRSIZE; (iOffset+szFrame)<=nSize; iOffset+=szFrame){
dan7c246102010-04-12 19:00:29 +00001107 u32 pgno; /* Database page number for frame */
1108 u32 nTruncate; /* dbsize field from frame header */
1109 int isValid; /* True if this frame is valid */
1110
1111 /* Read and decode the next log frame. */
drh584c7542010-05-19 18:08:10 +00001112 rc = sqlite3OsRead(pWal->pWalFd, aFrame, szFrame, iOffset);
dan7c246102010-04-12 19:00:29 +00001113 if( rc!=SQLITE_OK ) break;
drh7e263722010-05-20 21:21:09 +00001114 isValid = walDecodeFrame(pWal, &pgno, &nTruncate, aData, aFrame);
dan7c246102010-04-12 19:00:29 +00001115 if( !isValid ) break;
danc7991bd2010-05-05 19:04:59 +00001116 rc = walIndexAppend(pWal, ++iFrame, pgno);
1117 if( rc!=SQLITE_OK ) break;
dan7c246102010-04-12 19:00:29 +00001118
1119 /* If nTruncate is non-zero, this is a commit record. */
1120 if( nTruncate ){
dan71d89912010-05-24 13:57:42 +00001121 pWal->hdr.mxFrame = iFrame;
1122 pWal->hdr.nPage = nTruncate;
1123 pWal->hdr.szPage = szPage;
1124 aFrameCksum[0] = pWal->hdr.aFrameCksum[0];
1125 aFrameCksum[1] = pWal->hdr.aFrameCksum[1];
dan7c246102010-04-12 19:00:29 +00001126 }
1127 }
1128
1129 sqlite3_free(aFrame);
dan7c246102010-04-12 19:00:29 +00001130 }
1131
1132finished:
dan576bc322010-05-06 18:04:50 +00001133 if( rc==SQLITE_OK ){
drhdb7f6472010-06-09 14:45:12 +00001134 volatile WalCkptInfo *pInfo;
1135 int i;
dan71d89912010-05-24 13:57:42 +00001136 pWal->hdr.aFrameCksum[0] = aFrameCksum[0];
1137 pWal->hdr.aFrameCksum[1] = aFrameCksum[1];
drh7e263722010-05-20 21:21:09 +00001138 walIndexWriteHdr(pWal);
dan3dee6da2010-05-31 16:17:54 +00001139
drhdb7f6472010-06-09 14:45:12 +00001140 /* Reset the checkpoint-header. This is safe because this thread is
dan3dee6da2010-05-31 16:17:54 +00001141 ** currently holding locks that exclude all other readers, writers and
1142 ** checkpointers.
1143 */
drhdb7f6472010-06-09 14:45:12 +00001144 pInfo = walCkptInfo(pWal);
1145 pInfo->nBackfill = 0;
1146 pInfo->aReadMark[0] = 0;
1147 for(i=1; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED;
dan576bc322010-05-06 18:04:50 +00001148 }
drh73b64e42010-05-30 19:55:15 +00001149
1150recovery_error:
drhc74c3332010-05-31 12:15:19 +00001151 WALTRACE(("WAL%p: recovery %s\n", pWal, rc ? "failed" : "ok"));
dand0aa3422010-05-31 16:41:53 +00001152 walUnlockExclusive(pWal, iLock, nLock);
dan7c246102010-04-12 19:00:29 +00001153 return rc;
1154}
1155
drha8e654e2010-05-04 17:38:42 +00001156/*
dan1018e902010-05-05 15:33:05 +00001157** Close an open wal-index.
drha8e654e2010-05-04 17:38:42 +00001158*/
dan1018e902010-05-05 15:33:05 +00001159static void walIndexClose(Wal *pWal, int isDelete){
drh73b64e42010-05-30 19:55:15 +00001160 if( pWal->isWIndexOpen ){
drhd9e5c4f2010-05-12 18:01:39 +00001161 sqlite3OsShmClose(pWal->pDbFd, isDelete);
drh73b64e42010-05-30 19:55:15 +00001162 pWal->isWIndexOpen = 0;
drha8e654e2010-05-04 17:38:42 +00001163 }
1164}
1165
dan7c246102010-04-12 19:00:29 +00001166/*
drh181e0912010-06-01 01:08:08 +00001167** Open a connection to the WAL file associated with database zDbName.
1168** The database file must already be opened on connection pDbFd.
dan3de777f2010-04-17 12:31:37 +00001169**
1170** A SHARED lock should be held on the database file when this function
1171** is called. The purpose of this SHARED lock is to prevent any other
drh181e0912010-06-01 01:08:08 +00001172** client from unlinking the WAL or wal-index file. If another process
dan3de777f2010-04-17 12:31:37 +00001173** were to do this just after this client opened one of these files, the
1174** system would be badly broken.
danef378022010-05-04 11:06:03 +00001175**
1176** If the log file is successfully opened, SQLITE_OK is returned and
1177** *ppWal is set to point to a new WAL handle. If an error occurs,
1178** an SQLite error code is returned and *ppWal is left unmodified.
dan7c246102010-04-12 19:00:29 +00001179*/
drhc438efd2010-04-26 00:19:45 +00001180int sqlite3WalOpen(
drh7ed91f22010-04-29 22:34:07 +00001181 sqlite3_vfs *pVfs, /* vfs module to open wal and wal-index */
drhd9e5c4f2010-05-12 18:01:39 +00001182 sqlite3_file *pDbFd, /* The open database file */
1183 const char *zDbName, /* Name of the database file */
drh7ed91f22010-04-29 22:34:07 +00001184 Wal **ppWal /* OUT: Allocated Wal handle */
dan7c246102010-04-12 19:00:29 +00001185){
danef378022010-05-04 11:06:03 +00001186 int rc; /* Return Code */
drh7ed91f22010-04-29 22:34:07 +00001187 Wal *pRet; /* Object to allocate and return */
dan7c246102010-04-12 19:00:29 +00001188 int flags; /* Flags passed to OsOpen() */
drhd9e5c4f2010-05-12 18:01:39 +00001189 char *zWal; /* Name of write-ahead log file */
dan7c246102010-04-12 19:00:29 +00001190 int nWal; /* Length of zWal in bytes */
1191
drhd9e5c4f2010-05-12 18:01:39 +00001192 assert( zDbName && zDbName[0] );
1193 assert( pDbFd );
dan7c246102010-04-12 19:00:29 +00001194
drh1b78eaf2010-05-25 13:40:03 +00001195 /* In the amalgamation, the os_unix.c and os_win.c source files come before
1196 ** this source file. Verify that the #defines of the locking byte offsets
1197 ** in os_unix.c and os_win.c agree with the WALINDEX_LOCK_OFFSET value.
1198 */
1199#ifdef WIN_SHM_BASE
1200 assert( WIN_SHM_BASE==WALINDEX_LOCK_OFFSET );
1201#endif
1202#ifdef UNIX_SHM_BASE
1203 assert( UNIX_SHM_BASE==WALINDEX_LOCK_OFFSET );
1204#endif
1205
1206
drh7ed91f22010-04-29 22:34:07 +00001207 /* Allocate an instance of struct Wal to return. */
1208 *ppWal = 0;
drh686138f2010-05-12 18:10:52 +00001209 nWal = sqlite3Strlen30(zDbName) + 5;
drhd9e5c4f2010-05-12 18:01:39 +00001210 pRet = (Wal*)sqlite3MallocZero(sizeof(Wal) + pVfs->szOsFile + nWal);
dan76ed3bc2010-05-03 17:18:24 +00001211 if( !pRet ){
1212 return SQLITE_NOMEM;
1213 }
1214
dan7c246102010-04-12 19:00:29 +00001215 pRet->pVfs = pVfs;
drhd9e5c4f2010-05-12 18:01:39 +00001216 pRet->pWalFd = (sqlite3_file *)&pRet[1];
1217 pRet->pDbFd = pDbFd;
drh73b64e42010-05-30 19:55:15 +00001218 pRet->readLock = -1;
drh7e263722010-05-20 21:21:09 +00001219 sqlite3_randomness(8, &pRet->hdr.aSalt);
drhd9e5c4f2010-05-12 18:01:39 +00001220 pRet->zWalName = zWal = pVfs->szOsFile + (char*)pRet->pWalFd;
1221 sqlite3_snprintf(nWal, zWal, "%s-wal", zDbName);
1222 rc = sqlite3OsShmOpen(pDbFd);
dan7c246102010-04-12 19:00:29 +00001223
drh7ed91f22010-04-29 22:34:07 +00001224 /* Open file handle on the write-ahead log file. */
dan76ed3bc2010-05-03 17:18:24 +00001225 if( rc==SQLITE_OK ){
drh73b64e42010-05-30 19:55:15 +00001226 pRet->isWIndexOpen = 1;
dan76ed3bc2010-05-03 17:18:24 +00001227 flags = (SQLITE_OPEN_READWRITE|SQLITE_OPEN_CREATE|SQLITE_OPEN_MAIN_JOURNAL);
drhd9e5c4f2010-05-12 18:01:39 +00001228 rc = sqlite3OsOpen(pVfs, zWal, pRet->pWalFd, flags, &flags);
dan76ed3bc2010-05-03 17:18:24 +00001229 }
dan7c246102010-04-12 19:00:29 +00001230
dan7c246102010-04-12 19:00:29 +00001231 if( rc!=SQLITE_OK ){
dan1018e902010-05-05 15:33:05 +00001232 walIndexClose(pRet, 0);
drhd9e5c4f2010-05-12 18:01:39 +00001233 sqlite3OsClose(pRet->pWalFd);
danef378022010-05-04 11:06:03 +00001234 sqlite3_free(pRet);
1235 }else{
1236 *ppWal = pRet;
drhc74c3332010-05-31 12:15:19 +00001237 WALTRACE(("WAL%d: opened\n", pRet));
dan7c246102010-04-12 19:00:29 +00001238 }
dan7c246102010-04-12 19:00:29 +00001239 return rc;
1240}
1241
drha2a42012010-05-18 18:01:08 +00001242/*
1243** Find the smallest page number out of all pages held in the WAL that
1244** has not been returned by any prior invocation of this method on the
1245** same WalIterator object. Write into *piFrame the frame index where
1246** that page was last written into the WAL. Write into *piPage the page
1247** number.
1248**
1249** Return 0 on success. If there are no pages in the WAL with a page
1250** number larger than *piPage, then return 1.
1251*/
drh7ed91f22010-04-29 22:34:07 +00001252static int walIteratorNext(
1253 WalIterator *p, /* Iterator */
drha2a42012010-05-18 18:01:08 +00001254 u32 *piPage, /* OUT: The page number of the next page */
1255 u32 *piFrame /* OUT: Wal frame index of next page */
dan7c246102010-04-12 19:00:29 +00001256){
drha2a42012010-05-18 18:01:08 +00001257 u32 iMin; /* Result pgno must be greater than iMin */
1258 u32 iRet = 0xFFFFFFFF; /* 0xffffffff is never a valid page number */
1259 int i; /* For looping through segments */
dan7c246102010-04-12 19:00:29 +00001260
drha2a42012010-05-18 18:01:08 +00001261 iMin = p->iPrior;
1262 assert( iMin<0xffffffff );
dan7c246102010-04-12 19:00:29 +00001263 for(i=p->nSegment-1; i>=0; i--){
drh7ed91f22010-04-29 22:34:07 +00001264 struct WalSegment *pSegment = &p->aSegment[i];
dan13a3cb82010-06-11 19:04:21 +00001265 while( pSegment->iNext<pSegment->nEntry ){
drha2a42012010-05-18 18:01:08 +00001266 u32 iPg = pSegment->aPgno[pSegment->aIndex[pSegment->iNext]];
dan7c246102010-04-12 19:00:29 +00001267 if( iPg>iMin ){
1268 if( iPg<iRet ){
1269 iRet = iPg;
dan13a3cb82010-06-11 19:04:21 +00001270 *piFrame = pSegment->iZero + pSegment->aIndex[pSegment->iNext];
dan7c246102010-04-12 19:00:29 +00001271 }
1272 break;
1273 }
1274 pSegment->iNext++;
1275 }
dan7c246102010-04-12 19:00:29 +00001276 }
1277
drha2a42012010-05-18 18:01:08 +00001278 *piPage = p->iPrior = iRet;
dan7c246102010-04-12 19:00:29 +00001279 return (iRet==0xFFFFFFFF);
1280}
1281
dan7c246102010-04-12 19:00:29 +00001282
dan13a3cb82010-06-11 19:04:21 +00001283static void walMergesort(
1284 u32 *aContent, /* Pages in wal */
dan067f3162010-06-14 10:30:12 +00001285 ht_slot *aBuffer, /* Buffer of at least *pnList items to use */
1286 ht_slot *aList, /* IN/OUT: List to sort */
drha2a42012010-05-18 18:01:08 +00001287 int *pnList /* IN/OUT: Number of elements in aList[] */
1288){
1289 int nList = *pnList;
1290 if( nList>1 ){
1291 int nLeft = nList / 2; /* Elements in left list */
1292 int nRight = nList - nLeft; /* Elements in right list */
drha2a42012010-05-18 18:01:08 +00001293 int iLeft = 0; /* Current index in aLeft */
1294 int iRight = 0; /* Current index in aright */
1295 int iOut = 0; /* Current index in output buffer */
dan067f3162010-06-14 10:30:12 +00001296 ht_slot *aLeft = aList; /* Left list */
1297 ht_slot *aRight = aList+nLeft;/* Right list */
drha2a42012010-05-18 18:01:08 +00001298
1299 /* TODO: Change to non-recursive version. */
dan13a3cb82010-06-11 19:04:21 +00001300 walMergesort(aContent, aBuffer, aLeft, &nLeft);
1301 walMergesort(aContent, aBuffer, aRight, &nRight);
drha2a42012010-05-18 18:01:08 +00001302
1303 while( iRight<nRight || iLeft<nLeft ){
dan067f3162010-06-14 10:30:12 +00001304 ht_slot logpage;
drha2a42012010-05-18 18:01:08 +00001305 Pgno dbpage;
1306
1307 if( (iLeft<nLeft)
1308 && (iRight>=nRight || aContent[aLeft[iLeft]]<aContent[aRight[iRight]])
1309 ){
1310 logpage = aLeft[iLeft++];
1311 }else{
1312 logpage = aRight[iRight++];
1313 }
1314 dbpage = aContent[logpage];
1315
1316 aBuffer[iOut++] = logpage;
1317 if( iLeft<nLeft && aContent[aLeft[iLeft]]==dbpage ) iLeft++;
1318
1319 assert( iLeft>=nLeft || aContent[aLeft[iLeft]]>dbpage );
1320 assert( iRight>=nRight || aContent[aRight[iRight]]>dbpage );
1321 }
1322 memcpy(aList, aBuffer, sizeof(aList[0])*iOut);
1323 *pnList = iOut;
1324 }
1325
1326#ifdef SQLITE_DEBUG
1327 {
1328 int i;
1329 for(i=1; i<*pnList; i++){
1330 assert( aContent[aList[i]] > aContent[aList[i-1]] );
1331 }
1332 }
1333#endif
1334}
1335
dan5d656852010-06-14 07:53:26 +00001336/*
1337** Free an iterator allocated by walIteratorInit().
1338*/
1339static void walIteratorFree(WalIterator *p){
1340 sqlite3_free(p);
1341}
1342
drha2a42012010-05-18 18:01:08 +00001343/*
1344** Map the wal-index into memory owned by this thread, if it is not
1345** mapped already. Then construct a WalInterator object that can be
1346** used to loop over all pages in the WAL in ascending order.
1347**
1348** On success, make *pp point to the newly allocated WalInterator object
1349** return SQLITE_OK. Otherwise, leave *pp unchanged and return an error
1350** code.
1351**
1352** The calling routine should invoke walIteratorFree() to destroy the
1353** WalIterator object when it has finished with it. The caller must
1354** also unmap the wal-index. But the wal-index must not be unmapped
1355** prior to the WalIterator object being destroyed.
1356*/
1357static int walIteratorInit(Wal *pWal, WalIterator **pp){
dan067f3162010-06-14 10:30:12 +00001358 WalIterator *p; /* Return value */
1359 int nSegment; /* Number of segments to merge */
1360 u32 iLast; /* Last frame in log */
1361 int nByte; /* Number of bytes to allocate */
1362 int i; /* Iterator variable */
1363 ht_slot *aTmp; /* Temp space used by merge-sort */
1364 ht_slot *aSpace; /* Space at the end of the allocation */
drha2a42012010-05-18 18:01:08 +00001365
1366 /* This routine only runs while holding SQLITE_SHM_CHECKPOINT. No other
1367 ** thread is able to write to shared memory while this routine is
1368 ** running (or, indeed, while the WalIterator object exists). Hence,
dan13a3cb82010-06-11 19:04:21 +00001369 ** we can cast off the volatile qualification from shared memory
drha2a42012010-05-18 18:01:08 +00001370 */
dan1beb9392010-05-31 12:02:30 +00001371 assert( pWal->ckptLock );
dan13a3cb82010-06-11 19:04:21 +00001372 iLast = pWal->hdr.mxFrame;
drha2a42012010-05-18 18:01:08 +00001373
1374 /* Allocate space for the WalIterator object */
dan13a3cb82010-06-11 19:04:21 +00001375 nSegment = walFramePage(iLast) + 1;
1376 nByte = sizeof(WalIterator)
1377 + nSegment*(sizeof(struct WalSegment))
dan067f3162010-06-14 10:30:12 +00001378 + (nSegment+1)*(HASHTABLE_NPAGE * sizeof(ht_slot));
drh7ed91f22010-04-29 22:34:07 +00001379 p = (WalIterator *)sqlite3_malloc(nByte);
dan8f6097c2010-05-06 07:43:58 +00001380 if( !p ){
drha2a42012010-05-18 18:01:08 +00001381 return SQLITE_NOMEM;
1382 }
1383 memset(p, 0, nByte);
dan76ed3bc2010-05-03 17:18:24 +00001384
dan13a3cb82010-06-11 19:04:21 +00001385 /* Allocate space for the WalIterator object */
drha2a42012010-05-18 18:01:08 +00001386 p->nSegment = nSegment;
dan067f3162010-06-14 10:30:12 +00001387 aSpace = (ht_slot *)&p->aSegment[nSegment];
dan13a3cb82010-06-11 19:04:21 +00001388 aTmp = &aSpace[HASHTABLE_NPAGE*nSegment];
drha2a42012010-05-18 18:01:08 +00001389 for(i=0; i<nSegment; i++){
dan067f3162010-06-14 10:30:12 +00001390 volatile ht_slot *aHash;
drha2a42012010-05-18 18:01:08 +00001391 int j;
dan13a3cb82010-06-11 19:04:21 +00001392 u32 iZero;
1393 int nEntry;
1394 volatile u32 *aPgno;
dan4280eb32010-06-12 12:02:35 +00001395 int rc;
dan13a3cb82010-06-11 19:04:21 +00001396
dan4280eb32010-06-12 12:02:35 +00001397 rc = walHashGet(pWal, i, &aHash, &aPgno, &iZero);
1398 if( rc!=SQLITE_OK ){
dan5d656852010-06-14 07:53:26 +00001399 walIteratorFree(p);
dan4280eb32010-06-12 12:02:35 +00001400 return rc;
dan13a3cb82010-06-11 19:04:21 +00001401 }
dand60bf112010-06-14 11:18:50 +00001402 aPgno++;
1403 nEntry = ((i+1)==nSegment)?iLast-iZero:(u32 *)aHash-(u32 *)aPgno;
dan13a3cb82010-06-11 19:04:21 +00001404 iZero++;
dan13a3cb82010-06-11 19:04:21 +00001405
1406 for(j=0; j<nEntry; j++){
drha2a42012010-05-18 18:01:08 +00001407 aSpace[j] = j;
dan76ed3bc2010-05-03 17:18:24 +00001408 }
dan13a3cb82010-06-11 19:04:21 +00001409 walMergesort((u32 *)aPgno, aTmp, aSpace, &nEntry);
1410 p->aSegment[i].iZero = iZero;
1411 p->aSegment[i].nEntry = nEntry;
1412 p->aSegment[i].aIndex = aSpace;
1413 p->aSegment[i].aPgno = (u32 *)aPgno;
1414 aSpace += HASHTABLE_NPAGE;
dan7c246102010-04-12 19:00:29 +00001415 }
dan13a3cb82010-06-11 19:04:21 +00001416 assert( aSpace==aTmp );
dan7c246102010-04-12 19:00:29 +00001417
dan13a3cb82010-06-11 19:04:21 +00001418 /* Return the fully initialized WalIterator object */
dan8f6097c2010-05-06 07:43:58 +00001419 *pp = p;
drha2a42012010-05-18 18:01:08 +00001420 return SQLITE_OK ;
dan7c246102010-04-12 19:00:29 +00001421}
1422
dan7c246102010-04-12 19:00:29 +00001423/*
drh73b64e42010-05-30 19:55:15 +00001424** Copy as much content as we can from the WAL back into the database file
1425** in response to an sqlite3_wal_checkpoint() request or the equivalent.
1426**
1427** The amount of information copies from WAL to database might be limited
1428** by active readers. This routine will never overwrite a database page
1429** that a concurrent reader might be using.
1430**
1431** All I/O barrier operations (a.k.a fsyncs) occur in this routine when
1432** SQLite is in WAL-mode in synchronous=NORMAL. That means that if
1433** checkpoints are always run by a background thread or background
1434** process, foreground threads will never block on a lengthy fsync call.
1435**
1436** Fsync is called on the WAL before writing content out of the WAL and
1437** into the database. This ensures that if the new content is persistent
1438** in the WAL and can be recovered following a power-loss or hard reset.
1439**
1440** Fsync is also called on the database file if (and only if) the entire
1441** WAL content is copied into the database file. This second fsync makes
1442** it safe to delete the WAL since the new content will persist in the
1443** database file.
1444**
1445** This routine uses and updates the nBackfill field of the wal-index header.
1446** This is the only routine tha will increase the value of nBackfill.
1447** (A WAL reset or recovery will revert nBackfill to zero, but not increase
1448** its value.)
1449**
1450** The caller must be holding sufficient locks to ensure that no other
1451** checkpoint is running (in any other thread or process) at the same
1452** time.
dan7c246102010-04-12 19:00:29 +00001453*/
drh7ed91f22010-04-29 22:34:07 +00001454static int walCheckpoint(
1455 Wal *pWal, /* Wal connection */
danc5118782010-04-17 17:34:41 +00001456 int sync_flags, /* Flags for OsSync() (or 0) */
danb6e099a2010-05-04 14:47:39 +00001457 int nBuf, /* Size of zBuf in bytes */
dan7c246102010-04-12 19:00:29 +00001458 u8 *zBuf /* Temporary buffer to use */
1459){
1460 int rc; /* Return code */
drh6e810962010-05-19 17:49:50 +00001461 int szPage = pWal->hdr.szPage; /* Database page-size */
drh7ed91f22010-04-29 22:34:07 +00001462 WalIterator *pIter = 0; /* Wal iterator context */
dan7c246102010-04-12 19:00:29 +00001463 u32 iDbpage = 0; /* Next database page to write */
drh7ed91f22010-04-29 22:34:07 +00001464 u32 iFrame = 0; /* Wal frame containing data for iDbpage */
drh73b64e42010-05-30 19:55:15 +00001465 u32 mxSafeFrame; /* Max frame that can be backfilled */
1466 int i; /* Loop counter */
drh73b64e42010-05-30 19:55:15 +00001467 volatile WalCkptInfo *pInfo; /* The checkpoint status information */
dan7c246102010-04-12 19:00:29 +00001468
1469 /* Allocate the iterator */
dan8f6097c2010-05-06 07:43:58 +00001470 rc = walIteratorInit(pWal, &pIter);
drh027a1282010-05-19 01:53:53 +00001471 if( rc!=SQLITE_OK || pWal->hdr.mxFrame==0 ){
dan83f42d12010-06-04 10:37:05 +00001472 goto walcheckpoint_out;
danb6e099a2010-05-04 14:47:39 +00001473 }
1474
drh73b64e42010-05-30 19:55:15 +00001475 /*** TODO: Move this test out to the caller. Make it an assert() here ***/
drh6e810962010-05-19 17:49:50 +00001476 if( pWal->hdr.szPage!=nBuf ){
dan83f42d12010-06-04 10:37:05 +00001477 rc = SQLITE_CORRUPT_BKPT;
1478 goto walcheckpoint_out;
danb6e099a2010-05-04 14:47:39 +00001479 }
1480
drh73b64e42010-05-30 19:55:15 +00001481 /* Compute in mxSafeFrame the index of the last frame of the WAL that is
1482 ** safe to write into the database. Frames beyond mxSafeFrame might
1483 ** overwrite database pages that are in use by active readers and thus
1484 ** cannot be backfilled from the WAL.
1485 */
dand54ff602010-05-31 11:16:30 +00001486 mxSafeFrame = pWal->hdr.mxFrame;
dan13a3cb82010-06-11 19:04:21 +00001487 pInfo = walCkptInfo(pWal);
drh73b64e42010-05-30 19:55:15 +00001488 for(i=1; i<WAL_NREADER; i++){
1489 u32 y = pInfo->aReadMark[i];
drhdb7f6472010-06-09 14:45:12 +00001490 if( mxSafeFrame>=y ){
dan83f42d12010-06-04 10:37:05 +00001491 assert( y<=pWal->hdr.mxFrame );
1492 rc = walLockExclusive(pWal, WAL_READ_LOCK(i), 1);
1493 if( rc==SQLITE_OK ){
drhdb7f6472010-06-09 14:45:12 +00001494 pInfo->aReadMark[i] = READMARK_NOT_USED;
drh73b64e42010-05-30 19:55:15 +00001495 walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1);
drh2d37e1c2010-06-02 20:38:20 +00001496 }else if( rc==SQLITE_BUSY ){
drhdb7f6472010-06-09 14:45:12 +00001497 mxSafeFrame = y;
drh2d37e1c2010-06-02 20:38:20 +00001498 }else{
dan83f42d12010-06-04 10:37:05 +00001499 goto walcheckpoint_out;
drh73b64e42010-05-30 19:55:15 +00001500 }
1501 }
danc5118782010-04-17 17:34:41 +00001502 }
dan7c246102010-04-12 19:00:29 +00001503
drh73b64e42010-05-30 19:55:15 +00001504 if( pInfo->nBackfill<mxSafeFrame
1505 && (rc = walLockExclusive(pWal, WAL_READ_LOCK(0), 1))==SQLITE_OK
1506 ){
1507 u32 nBackfill = pInfo->nBackfill;
1508
1509 /* Sync the WAL to disk */
1510 if( sync_flags ){
1511 rc = sqlite3OsSync(pWal->pWalFd, sync_flags);
1512 }
1513
1514 /* Iterate through the contents of the WAL, copying data to the db file. */
1515 while( rc==SQLITE_OK && 0==walIteratorNext(pIter, &iDbpage, &iFrame) ){
dan13a3cb82010-06-11 19:04:21 +00001516 assert( walFramePgno(pWal, iFrame)==iDbpage );
drh73b64e42010-05-30 19:55:15 +00001517 if( iFrame<=nBackfill || iFrame>mxSafeFrame ) continue;
1518 rc = sqlite3OsRead(pWal->pWalFd, zBuf, szPage,
1519 walFrameOffset(iFrame, szPage) + WAL_FRAME_HDRSIZE
1520 );
1521 if( rc!=SQLITE_OK ) break;
1522 rc = sqlite3OsWrite(pWal->pDbFd, zBuf, szPage, (iDbpage-1)*szPage);
1523 if( rc!=SQLITE_OK ) break;
1524 }
1525
1526 /* If work was actually accomplished... */
dand764c7d2010-06-04 11:56:22 +00001527 if( rc==SQLITE_OK ){
dan4280eb32010-06-12 12:02:35 +00001528 if( mxSafeFrame==walIndexHdr(pWal)->mxFrame ){
drh73b64e42010-05-30 19:55:15 +00001529 rc = sqlite3OsTruncate(pWal->pDbFd, ((i64)pWal->hdr.nPage*(i64)szPage));
1530 if( rc==SQLITE_OK && sync_flags ){
1531 rc = sqlite3OsSync(pWal->pDbFd, sync_flags);
1532 }
1533 }
dand764c7d2010-06-04 11:56:22 +00001534 if( rc==SQLITE_OK ){
1535 pInfo->nBackfill = mxSafeFrame;
1536 }
drh73b64e42010-05-30 19:55:15 +00001537 }
1538
1539 /* Release the reader lock held while backfilling */
1540 walUnlockExclusive(pWal, WAL_READ_LOCK(0), 1);
drh2d37e1c2010-06-02 20:38:20 +00001541 }else if( rc==SQLITE_BUSY ){
drh34116ea2010-05-31 12:30:52 +00001542 /* Reset the return code so as not to report a checkpoint failure
1543 ** just because active readers prevent any backfill.
1544 */
1545 rc = SQLITE_OK;
dan7c246102010-04-12 19:00:29 +00001546 }
1547
dan83f42d12010-06-04 10:37:05 +00001548 walcheckpoint_out:
drh7ed91f22010-04-29 22:34:07 +00001549 walIteratorFree(pIter);
dan7c246102010-04-12 19:00:29 +00001550 return rc;
1551}
1552
1553/*
1554** Close a connection to a log file.
1555*/
drhc438efd2010-04-26 00:19:45 +00001556int sqlite3WalClose(
drh7ed91f22010-04-29 22:34:07 +00001557 Wal *pWal, /* Wal to close */
danc5118782010-04-17 17:34:41 +00001558 int sync_flags, /* Flags to pass to OsSync() (or 0) */
danb6e099a2010-05-04 14:47:39 +00001559 int nBuf,
1560 u8 *zBuf /* Buffer of at least nBuf bytes */
dan7c246102010-04-12 19:00:29 +00001561){
1562 int rc = SQLITE_OK;
drh7ed91f22010-04-29 22:34:07 +00001563 if( pWal ){
dan30c86292010-04-30 16:24:46 +00001564 int isDelete = 0; /* True to unlink wal and wal-index files */
1565
1566 /* If an EXCLUSIVE lock can be obtained on the database file (using the
1567 ** ordinary, rollback-mode locking methods, this guarantees that the
1568 ** connection associated with this log file is the only connection to
1569 ** the database. In this case checkpoint the database and unlink both
1570 ** the wal and wal-index files.
1571 **
1572 ** The EXCLUSIVE lock is not released before returning.
1573 */
drhd9e5c4f2010-05-12 18:01:39 +00001574 rc = sqlite3OsLock(pWal->pDbFd, SQLITE_LOCK_EXCLUSIVE);
dan30c86292010-04-30 16:24:46 +00001575 if( rc==SQLITE_OK ){
drh73b64e42010-05-30 19:55:15 +00001576 pWal->exclusiveMode = 1;
dan1beb9392010-05-31 12:02:30 +00001577 rc = sqlite3WalCheckpoint(pWal, sync_flags, nBuf, zBuf);
dan30c86292010-04-30 16:24:46 +00001578 if( rc==SQLITE_OK ){
1579 isDelete = 1;
1580 }
dan30c86292010-04-30 16:24:46 +00001581 }
1582
dan1018e902010-05-05 15:33:05 +00001583 walIndexClose(pWal, isDelete);
drhd9e5c4f2010-05-12 18:01:39 +00001584 sqlite3OsClose(pWal->pWalFd);
dan30c86292010-04-30 16:24:46 +00001585 if( isDelete ){
drhd9e5c4f2010-05-12 18:01:39 +00001586 sqlite3OsDelete(pWal->pVfs, pWal->zWalName, 0);
dan30c86292010-04-30 16:24:46 +00001587 }
drhc74c3332010-05-31 12:15:19 +00001588 WALTRACE(("WAL%p: closed\n", pWal));
dan13a3cb82010-06-11 19:04:21 +00001589 sqlite3_free(pWal->apWiData);
drh7ed91f22010-04-29 22:34:07 +00001590 sqlite3_free(pWal);
dan7c246102010-04-12 19:00:29 +00001591 }
1592 return rc;
1593}
1594
1595/*
drha2a42012010-05-18 18:01:08 +00001596** Try to read the wal-index header. Return 0 on success and 1 if
1597** there is a problem.
1598**
1599** The wal-index is in shared memory. Another thread or process might
1600** be writing the header at the same time this procedure is trying to
1601** read it, which might result in inconsistency. A dirty read is detected
drh73b64e42010-05-30 19:55:15 +00001602** by verifying that both copies of the header are the same and also by
1603** a checksum on the header.
drha2a42012010-05-18 18:01:08 +00001604**
1605** If and only if the read is consistent and the header is different from
1606** pWal->hdr, then pWal->hdr is updated to the content of the new header
1607** and *pChanged is set to 1.
danb9bf16b2010-04-14 11:23:30 +00001608**
dan84670502010-05-07 05:46:23 +00001609** If the checksum cannot be verified return non-zero. If the header
1610** is read successfully and the checksum verified, return zero.
danb9bf16b2010-04-14 11:23:30 +00001611*/
dan84670502010-05-07 05:46:23 +00001612int walIndexTryHdr(Wal *pWal, int *pChanged){
dan4280eb32010-06-12 12:02:35 +00001613 u32 aCksum[2]; /* Checksum on the header content */
1614 WalIndexHdr h1, h2; /* Two copies of the header content */
1615 WalIndexHdr volatile *aHdr; /* Header in shared memory */
danb9bf16b2010-04-14 11:23:30 +00001616
dan4280eb32010-06-12 12:02:35 +00001617 /* The first page of the wal-index must be mapped at this point. */
1618 assert( pWal->nWiData>0 && pWal->apWiData[0] );
drh79e6c782010-04-30 02:13:26 +00001619
drh73b64e42010-05-30 19:55:15 +00001620 /* Read the header. This might happen currently with a write to the
1621 ** same area of shared memory on a different CPU in a SMP,
1622 ** meaning it is possible that an inconsistent snapshot is read
dan84670502010-05-07 05:46:23 +00001623 ** from the file. If this happens, return non-zero.
drhf0b20f82010-05-21 13:16:18 +00001624 **
1625 ** There are two copies of the header at the beginning of the wal-index.
1626 ** When reading, read [0] first then [1]. Writes are in the reverse order.
1627 ** Memory barriers are used to prevent the compiler or the hardware from
1628 ** reordering the reads and writes.
danb9bf16b2010-04-14 11:23:30 +00001629 */
dan4280eb32010-06-12 12:02:35 +00001630 aHdr = walIndexHdr(pWal);
1631 memcpy(&h1, (void *)&aHdr[0], sizeof(h1));
drh286a2882010-05-20 23:51:06 +00001632 sqlite3OsShmBarrier(pWal->pDbFd);
dan4280eb32010-06-12 12:02:35 +00001633 memcpy(&h2, (void *)&aHdr[1], sizeof(h2));
drh286a2882010-05-20 23:51:06 +00001634
drhf0b20f82010-05-21 13:16:18 +00001635 if( memcmp(&h1, &h2, sizeof(h1))!=0 ){
1636 return 1; /* Dirty read */
drh286a2882010-05-20 23:51:06 +00001637 }
drh4b82c382010-05-31 18:24:19 +00001638 if( h1.isInit==0 ){
drhf0b20f82010-05-21 13:16:18 +00001639 return 1; /* Malformed header - probably all zeros */
1640 }
danb8fd6c22010-05-24 10:39:36 +00001641 walChecksumBytes(1, (u8*)&h1, sizeof(h1)-sizeof(h1.aCksum), 0, aCksum);
drhf0b20f82010-05-21 13:16:18 +00001642 if( aCksum[0]!=h1.aCksum[0] || aCksum[1]!=h1.aCksum[1] ){
1643 return 1; /* Checksum does not match */
danb9bf16b2010-04-14 11:23:30 +00001644 }
1645
drhf0b20f82010-05-21 13:16:18 +00001646 if( memcmp(&pWal->hdr, &h1, sizeof(WalIndexHdr)) ){
dana8614692010-05-06 14:42:34 +00001647 *pChanged = 1;
drhf0b20f82010-05-21 13:16:18 +00001648 memcpy(&pWal->hdr, &h1, sizeof(WalIndexHdr));
drh7e263722010-05-20 21:21:09 +00001649 pWal->szPage = pWal->hdr.szPage;
danb9bf16b2010-04-14 11:23:30 +00001650 }
dan84670502010-05-07 05:46:23 +00001651
1652 /* The header was successfully read. Return zero. */
1653 return 0;
danb9bf16b2010-04-14 11:23:30 +00001654}
1655
1656/*
drha2a42012010-05-18 18:01:08 +00001657** Read the wal-index header from the wal-index and into pWal->hdr.
1658** If the wal-header appears to be corrupt, try to recover the log
1659** before returning.
1660**
1661** Set *pChanged to 1 if the wal-index header value in pWal->hdr is
1662** changed by this opertion. If pWal->hdr is unchanged, set *pChanged
1663** to 0.
1664**
1665** This routine also maps the wal-index content into memory and assigns
1666** ownership of that mapping to the current thread. In some implementations,
1667** only one thread at a time can hold a mapping of the wal-index. Hence,
1668** the caller should strive to invoke walIndexUnmap() as soon as possible
1669** after this routine returns.
danb9bf16b2010-04-14 11:23:30 +00001670**
drh7ed91f22010-04-29 22:34:07 +00001671** If the wal-index header is successfully read, return SQLITE_OK.
danb9bf16b2010-04-14 11:23:30 +00001672** Otherwise an SQLite error code.
1673*/
drh7ed91f22010-04-29 22:34:07 +00001674static int walIndexReadHdr(Wal *pWal, int *pChanged){
dan84670502010-05-07 05:46:23 +00001675 int rc; /* Return code */
drh73b64e42010-05-30 19:55:15 +00001676 int badHdr; /* True if a header read failed */
dan4280eb32010-06-12 12:02:35 +00001677 volatile u32 *page0;
danb9bf16b2010-04-14 11:23:30 +00001678
dan4280eb32010-06-12 12:02:35 +00001679 /* Ensure that page 0 of the wal-index (the page that contains the
1680 ** wal-index header) is mapped. Return early if an error occurs here.
1681 */
dana8614692010-05-06 14:42:34 +00001682 assert( pChanged );
dan4280eb32010-06-12 12:02:35 +00001683 rc = walIndexPage(pWal, 0, &page0);
danc7991bd2010-05-05 19:04:59 +00001684 if( rc!=SQLITE_OK ){
1685 return rc;
dan4280eb32010-06-12 12:02:35 +00001686 };
1687 assert( page0 || pWal->writeLock==0 );
drh7ed91f22010-04-29 22:34:07 +00001688
dan4280eb32010-06-12 12:02:35 +00001689 /* If the first page of the wal-index has been mapped, try to read the
1690 ** wal-index header immediately, without holding any lock. This usually
1691 ** works, but may fail if the wal-index header is corrupt or currently
1692 ** being modified by another user.
danb9bf16b2010-04-14 11:23:30 +00001693 */
dan4280eb32010-06-12 12:02:35 +00001694 badHdr = (page0 ? walIndexTryHdr(pWal, pChanged) : 1);
dan10f5a502010-06-23 15:55:43 +00001695 if( badHdr==0 && pWal->hdr.iVersion!=WALINDEX_MAX_VERSION ){
1696 rc = SQLITE_CANTOPEN_BKPT;
1697 }
drhbab7b912010-05-26 17:31:58 +00001698
drh73b64e42010-05-30 19:55:15 +00001699 /* If the first attempt failed, it might have been due to a race
1700 ** with a writer. So get a WRITE lock and try again.
1701 */
dand54ff602010-05-31 11:16:30 +00001702 assert( badHdr==0 || pWal->writeLock==0 );
dan4280eb32010-06-12 12:02:35 +00001703 if( badHdr && SQLITE_OK==(rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1)) ){
1704 pWal->writeLock = 1;
1705 if( SQLITE_OK==(rc = walIndexPage(pWal, 0, &page0)) ){
drh73b64e42010-05-30 19:55:15 +00001706 badHdr = walIndexTryHdr(pWal, pChanged);
1707 if( badHdr ){
1708 /* If the wal-index header is still malformed even while holding
1709 ** a WRITE lock, it can only mean that the header is corrupted and
1710 ** needs to be reconstructed. So run recovery to do exactly that.
1711 */
drhbab7b912010-05-26 17:31:58 +00001712 rc = walIndexRecover(pWal);
dan3dee6da2010-05-31 16:17:54 +00001713 *pChanged = 1;
drhbab7b912010-05-26 17:31:58 +00001714 }
drhbab7b912010-05-26 17:31:58 +00001715 }
dan4280eb32010-06-12 12:02:35 +00001716 pWal->writeLock = 0;
1717 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
danb9bf16b2010-04-14 11:23:30 +00001718 }
1719
danb9bf16b2010-04-14 11:23:30 +00001720 return rc;
1721}
1722
1723/*
drh73b64e42010-05-30 19:55:15 +00001724** This is the value that walTryBeginRead returns when it needs to
1725** be retried.
dan7c246102010-04-12 19:00:29 +00001726*/
drh73b64e42010-05-30 19:55:15 +00001727#define WAL_RETRY (-1)
dan64d039e2010-04-13 19:27:31 +00001728
drh73b64e42010-05-30 19:55:15 +00001729/*
1730** Attempt to start a read transaction. This might fail due to a race or
1731** other transient condition. When that happens, it returns WAL_RETRY to
1732** indicate to the caller that it is safe to retry immediately.
1733**
1734** On success return SQLITE_OK. On a permantent failure (such an
1735** I/O error or an SQLITE_BUSY because another process is running
1736** recovery) return a positive error code.
1737**
1738** On success, this routine obtains a read lock on
1739** WAL_READ_LOCK(pWal->readLock). The pWal->readLock integer is
1740** in the range 0 <= pWal->readLock < WAL_NREADER. If pWal->readLock==(-1)
1741** that means the Wal does not hold any read lock. The reader must not
1742** access any database page that is modified by a WAL frame up to and
1743** including frame number aReadMark[pWal->readLock]. The reader will
1744** use WAL frames up to and including pWal->hdr.mxFrame if pWal->readLock>0
1745** Or if pWal->readLock==0, then the reader will ignore the WAL
1746** completely and get all content directly from the database file.
1747** When the read transaction is completed, the caller must release the
1748** lock on WAL_READ_LOCK(pWal->readLock) and set pWal->readLock to -1.
1749**
1750** This routine uses the nBackfill and aReadMark[] fields of the header
1751** to select a particular WAL_READ_LOCK() that strives to let the
1752** checkpoint process do as much work as possible. This routine might
1753** update values of the aReadMark[] array in the header, but if it does
1754** so it takes care to hold an exclusive lock on the corresponding
1755** WAL_READ_LOCK() while changing values.
1756*/
drhaab4c022010-06-02 14:45:51 +00001757static int walTryBeginRead(Wal *pWal, int *pChanged, int useWal, int cnt){
drh73b64e42010-05-30 19:55:15 +00001758 volatile WalCkptInfo *pInfo; /* Checkpoint information in wal-index */
1759 u32 mxReadMark; /* Largest aReadMark[] value */
1760 int mxI; /* Index of largest aReadMark[] value */
1761 int i; /* Loop counter */
dan13a3cb82010-06-11 19:04:21 +00001762 int rc = SQLITE_OK; /* Return code */
dan64d039e2010-04-13 19:27:31 +00001763
drh61e4ace2010-05-31 20:28:37 +00001764 assert( pWal->readLock<0 ); /* Not currently locked */
drh73b64e42010-05-30 19:55:15 +00001765
drhaab4c022010-06-02 14:45:51 +00001766 /* Take steps to avoid spinning forever if there is a protocol error. */
1767 if( cnt>5 ){
1768 if( cnt>100 ) return SQLITE_PROTOCOL;
1769 sqlite3OsSleep(pWal->pVfs, 1);
1770 }
1771
drh73b64e42010-05-30 19:55:15 +00001772 if( !useWal ){
drh7ed91f22010-04-29 22:34:07 +00001773 rc = walIndexReadHdr(pWal, pChanged);
drh73b64e42010-05-30 19:55:15 +00001774 if( rc==SQLITE_BUSY ){
1775 /* If there is not a recovery running in another thread or process
1776 ** then convert BUSY errors to WAL_RETRY. If recovery is known to
1777 ** be running, convert BUSY to BUSY_RECOVERY. There is a race here
1778 ** which might cause WAL_RETRY to be returned even if BUSY_RECOVERY
1779 ** would be technically correct. But the race is benign since with
1780 ** WAL_RETRY this routine will be called again and will probably be
1781 ** right on the second iteration.
1782 */
1783 rc = walLockShared(pWal, WAL_RECOVER_LOCK);
1784 if( rc==SQLITE_OK ){
1785 walUnlockShared(pWal, WAL_RECOVER_LOCK);
1786 rc = WAL_RETRY;
1787 }else if( rc==SQLITE_BUSY ){
1788 rc = SQLITE_BUSY_RECOVERY;
1789 }
1790 }
drh73b64e42010-05-30 19:55:15 +00001791 }
1792 if( rc!=SQLITE_OK ){
1793 return rc;
1794 }
1795
dan13a3cb82010-06-11 19:04:21 +00001796 pInfo = walCkptInfo(pWal);
drh73b64e42010-05-30 19:55:15 +00001797 if( !useWal && pInfo->nBackfill==pWal->hdr.mxFrame ){
1798 /* The WAL has been completely backfilled (or it is empty).
1799 ** and can be safely ignored.
1800 */
1801 rc = walLockShared(pWal, WAL_READ_LOCK(0));
daneb8cb3a2010-06-05 18:34:26 +00001802 sqlite3OsShmBarrier(pWal->pDbFd);
drh73b64e42010-05-30 19:55:15 +00001803 if( rc==SQLITE_OK ){
dan4280eb32010-06-12 12:02:35 +00001804 if( memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr)) ){
dan493cc592010-06-05 18:12:23 +00001805 /* It is not safe to allow the reader to continue here if frames
1806 ** may have been appended to the log before READ_LOCK(0) was obtained.
1807 ** When holding READ_LOCK(0), the reader ignores the entire log file,
1808 ** which implies that the database file contains a trustworthy
1809 ** snapshoT. Since holding READ_LOCK(0) prevents a checkpoint from
1810 ** happening, this is usually correct.
1811 **
1812 ** However, if frames have been appended to the log (or if the log
1813 ** is wrapped and written for that matter) before the READ_LOCK(0)
1814 ** is obtained, that is not necessarily true. A checkpointer may
1815 ** have started to backfill the appended frames but crashed before
1816 ** it finished. Leaving a corrupt image in the database file.
1817 */
drh73b64e42010-05-30 19:55:15 +00001818 walUnlockShared(pWal, WAL_READ_LOCK(0));
1819 return WAL_RETRY;
1820 }
1821 pWal->readLock = 0;
1822 return SQLITE_OK;
1823 }else if( rc!=SQLITE_BUSY ){
1824 return rc;
dan64d039e2010-04-13 19:27:31 +00001825 }
dan7c246102010-04-12 19:00:29 +00001826 }
danba515902010-04-30 09:32:06 +00001827
drh73b64e42010-05-30 19:55:15 +00001828 /* If we get this far, it means that the reader will want to use
1829 ** the WAL to get at content from recent commits. The job now is
1830 ** to select one of the aReadMark[] entries that is closest to
1831 ** but not exceeding pWal->hdr.mxFrame and lock that entry.
1832 */
1833 mxReadMark = 0;
1834 mxI = 0;
1835 for(i=1; i<WAL_NREADER; i++){
1836 u32 thisMark = pInfo->aReadMark[i];
drhdb7f6472010-06-09 14:45:12 +00001837 if( mxReadMark<=thisMark && thisMark<=pWal->hdr.mxFrame ){
1838 assert( thisMark!=READMARK_NOT_USED );
drh73b64e42010-05-30 19:55:15 +00001839 mxReadMark = thisMark;
1840 mxI = i;
1841 }
1842 }
1843 if( mxI==0 ){
1844 /* If we get here, it means that all of the aReadMark[] entries between
1845 ** 1 and WAL_NREADER-1 are zero. Try to initialize aReadMark[1] to
1846 ** be mxFrame, then retry.
1847 */
1848 rc = walLockExclusive(pWal, WAL_READ_LOCK(1), 1);
1849 if( rc==SQLITE_OK ){
drhdb7f6472010-06-09 14:45:12 +00001850 pInfo->aReadMark[1] = pWal->hdr.mxFrame;
drh73b64e42010-05-30 19:55:15 +00001851 walUnlockExclusive(pWal, WAL_READ_LOCK(1), 1);
drh38933f22010-06-02 15:43:18 +00001852 rc = WAL_RETRY;
1853 }else if( rc==SQLITE_BUSY ){
1854 rc = WAL_RETRY;
drh73b64e42010-05-30 19:55:15 +00001855 }
drh38933f22010-06-02 15:43:18 +00001856 return rc;
drh73b64e42010-05-30 19:55:15 +00001857 }else{
1858 if( mxReadMark < pWal->hdr.mxFrame ){
dand54ff602010-05-31 11:16:30 +00001859 for(i=1; i<WAL_NREADER; i++){
drh73b64e42010-05-30 19:55:15 +00001860 rc = walLockExclusive(pWal, WAL_READ_LOCK(i), 1);
1861 if( rc==SQLITE_OK ){
drhdb7f6472010-06-09 14:45:12 +00001862 mxReadMark = pInfo->aReadMark[i] = pWal->hdr.mxFrame;
drh73b64e42010-05-30 19:55:15 +00001863 mxI = i;
1864 walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1);
1865 break;
drh38933f22010-06-02 15:43:18 +00001866 }else if( rc!=SQLITE_BUSY ){
1867 return rc;
drh73b64e42010-05-30 19:55:15 +00001868 }
1869 }
1870 }
1871
1872 rc = walLockShared(pWal, WAL_READ_LOCK(mxI));
1873 if( rc ){
1874 return rc==SQLITE_BUSY ? WAL_RETRY : rc;
1875 }
daneb8cb3a2010-06-05 18:34:26 +00001876 /* Now that the read-lock has been obtained, check that neither the
1877 ** value in the aReadMark[] array or the contents of the wal-index
1878 ** header have changed.
1879 **
1880 ** It is necessary to check that the wal-index header did not change
1881 ** between the time it was read and when the shared-lock was obtained
1882 ** on WAL_READ_LOCK(mxI) was obtained to account for the possibility
1883 ** that the log file may have been wrapped by a writer, or that frames
1884 ** that occur later in the log than pWal->hdr.mxFrame may have been
1885 ** copied into the database by a checkpointer. If either of these things
1886 ** happened, then reading the database with the current value of
1887 ** pWal->hdr.mxFrame risks reading a corrupted snapshot. So, retry
1888 ** instead.
1889 **
dan640aac42010-06-05 19:18:59 +00001890 ** This does not guarantee that the copy of the wal-index header is up to
1891 ** date before proceeding. That would not be possible without somehow
1892 ** blocking writers. It only guarantees that a dangerous checkpoint or
daneb8cb3a2010-06-05 18:34:26 +00001893 ** log-wrap (either of which would require an exclusive lock on
1894 ** WAL_READ_LOCK(mxI)) has not occurred since the snapshot was valid.
1895 */
1896 sqlite3OsShmBarrier(pWal->pDbFd);
drh73b64e42010-05-30 19:55:15 +00001897 if( pInfo->aReadMark[mxI]!=mxReadMark
dan4280eb32010-06-12 12:02:35 +00001898 || memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr))
drh73b64e42010-05-30 19:55:15 +00001899 ){
1900 walUnlockShared(pWal, WAL_READ_LOCK(mxI));
1901 return WAL_RETRY;
1902 }else{
drhdb7f6472010-06-09 14:45:12 +00001903 assert( mxReadMark<=pWal->hdr.mxFrame );
drh73b64e42010-05-30 19:55:15 +00001904 pWal->readLock = mxI;
1905 }
1906 }
1907 return rc;
1908}
1909
1910/*
1911** Begin a read transaction on the database.
1912**
1913** This routine used to be called sqlite3OpenSnapshot() and with good reason:
1914** it takes a snapshot of the state of the WAL and wal-index for the current
1915** instant in time. The current thread will continue to use this snapshot.
1916** Other threads might append new content to the WAL and wal-index but
1917** that extra content is ignored by the current thread.
1918**
1919** If the database contents have changes since the previous read
1920** transaction, then *pChanged is set to 1 before returning. The
1921** Pager layer will use this to know that is cache is stale and
1922** needs to be flushed.
1923*/
1924int sqlite3WalBeginReadTransaction(Wal *pWal, int *pChanged){
1925 int rc; /* Return code */
drhaab4c022010-06-02 14:45:51 +00001926 int cnt = 0; /* Number of TryBeginRead attempts */
drh73b64e42010-05-30 19:55:15 +00001927
1928 do{
drhaab4c022010-06-02 14:45:51 +00001929 rc = walTryBeginRead(pWal, pChanged, 0, ++cnt);
drh73b64e42010-05-30 19:55:15 +00001930 }while( rc==WAL_RETRY );
dan7c246102010-04-12 19:00:29 +00001931 return rc;
1932}
1933
1934/*
drh73b64e42010-05-30 19:55:15 +00001935** Finish with a read transaction. All this does is release the
1936** read-lock.
dan7c246102010-04-12 19:00:29 +00001937*/
drh73b64e42010-05-30 19:55:15 +00001938void sqlite3WalEndReadTransaction(Wal *pWal){
1939 if( pWal->readLock>=0 ){
1940 walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock));
1941 pWal->readLock = -1;
1942 }
dan7c246102010-04-12 19:00:29 +00001943}
1944
dan5e0ce872010-04-28 17:48:44 +00001945/*
drh73b64e42010-05-30 19:55:15 +00001946** Read a page from the WAL, if it is present in the WAL and if the
1947** current read transaction is configured to use the WAL.
1948**
1949** The *pInWal is set to 1 if the requested page is in the WAL and
1950** has been loaded. Or *pInWal is set to 0 if the page was not in
1951** the WAL and needs to be read out of the database.
dan7c246102010-04-12 19:00:29 +00001952*/
danb6e099a2010-05-04 14:47:39 +00001953int sqlite3WalRead(
danbb23aff2010-05-10 14:46:09 +00001954 Wal *pWal, /* WAL handle */
1955 Pgno pgno, /* Database page number to read data for */
1956 int *pInWal, /* OUT: True if data is read from WAL */
1957 int nOut, /* Size of buffer pOut in bytes */
1958 u8 *pOut /* Buffer to write page data to */
danb6e099a2010-05-04 14:47:39 +00001959){
danbb23aff2010-05-10 14:46:09 +00001960 u32 iRead = 0; /* If !=0, WAL frame to return data from */
drh027a1282010-05-19 01:53:53 +00001961 u32 iLast = pWal->hdr.mxFrame; /* Last page in WAL for this reader */
danbb23aff2010-05-10 14:46:09 +00001962 int iHash; /* Used to loop through N hash tables */
dan7c246102010-04-12 19:00:29 +00001963
drhaab4c022010-06-02 14:45:51 +00001964 /* This routine is only be called from within a read transaction. */
1965 assert( pWal->readLock>=0 || pWal->lockError );
drh73b64e42010-05-30 19:55:15 +00001966
danbb23aff2010-05-10 14:46:09 +00001967 /* If the "last page" field of the wal-index header snapshot is 0, then
1968 ** no data will be read from the wal under any circumstances. Return early
drh73b64e42010-05-30 19:55:15 +00001969 ** in this case to avoid the walIndexMap/Unmap overhead. Likewise, if
1970 ** pWal->readLock==0, then the WAL is ignored by the reader so
1971 ** return early, as if the WAL were empty.
danbb23aff2010-05-10 14:46:09 +00001972 */
drh73b64e42010-05-30 19:55:15 +00001973 if( iLast==0 || pWal->readLock==0 ){
danbb23aff2010-05-10 14:46:09 +00001974 *pInWal = 0;
1975 return SQLITE_OK;
1976 }
1977
danbb23aff2010-05-10 14:46:09 +00001978 /* Search the hash table or tables for an entry matching page number
1979 ** pgno. Each iteration of the following for() loop searches one
1980 ** hash table (each hash table indexes up to HASHTABLE_NPAGE frames).
1981 **
1982 ** This code may run concurrently to the code in walIndexAppend()
1983 ** that adds entries to the wal-index (and possibly to this hash
drh6e810962010-05-19 17:49:50 +00001984 ** table). This means the value just read from the hash
danbb23aff2010-05-10 14:46:09 +00001985 ** slot (aHash[iKey]) may have been added before or after the
1986 ** current read transaction was opened. Values added after the
1987 ** read transaction was opened may have been written incorrectly -
1988 ** i.e. these slots may contain garbage data. However, we assume
1989 ** that any slots written before the current read transaction was
1990 ** opened remain unmodified.
1991 **
1992 ** For the reasons above, the if(...) condition featured in the inner
1993 ** loop of the following block is more stringent that would be required
1994 ** if we had exclusive access to the hash-table:
1995 **
1996 ** (aPgno[iFrame]==pgno):
1997 ** This condition filters out normal hash-table collisions.
1998 **
1999 ** (iFrame<=iLast):
2000 ** This condition filters out entries that were added to the hash
2001 ** table after the current read-transaction had started.
dan7c246102010-04-12 19:00:29 +00002002 */
dan13a3cb82010-06-11 19:04:21 +00002003 for(iHash=walFramePage(iLast); iHash>=0 && iRead==0; iHash--){
dan067f3162010-06-14 10:30:12 +00002004 volatile ht_slot *aHash; /* Pointer to hash table */
2005 volatile u32 *aPgno; /* Pointer to array of page numbers */
danbb23aff2010-05-10 14:46:09 +00002006 u32 iZero; /* Frame number corresponding to aPgno[0] */
2007 int iKey; /* Hash slot index */
dan4280eb32010-06-12 12:02:35 +00002008 int rc;
danbb23aff2010-05-10 14:46:09 +00002009
dan4280eb32010-06-12 12:02:35 +00002010 rc = walHashGet(pWal, iHash, &aHash, &aPgno, &iZero);
2011 if( rc!=SQLITE_OK ){
2012 return rc;
2013 }
dan6f150142010-05-21 15:31:56 +00002014 for(iKey=walHash(pgno); aHash[iKey]; iKey=walNextHash(iKey)){
danbb23aff2010-05-10 14:46:09 +00002015 u32 iFrame = aHash[iKey] + iZero;
dand60bf112010-06-14 11:18:50 +00002016 if( iFrame<=iLast && aPgno[aHash[iKey]]==pgno ){
dan493cc592010-06-05 18:12:23 +00002017 assert( iFrame>iRead );
danbb23aff2010-05-10 14:46:09 +00002018 iRead = iFrame;
2019 }
dan7c246102010-04-12 19:00:29 +00002020 }
2021 }
dan7c246102010-04-12 19:00:29 +00002022
danbb23aff2010-05-10 14:46:09 +00002023#ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
2024 /* If expensive assert() statements are available, do a linear search
2025 ** of the wal-index file content. Make sure the results agree with the
2026 ** result obtained using the hash indexes above. */
2027 {
2028 u32 iRead2 = 0;
2029 u32 iTest;
2030 for(iTest=iLast; iTest>0; iTest--){
dan13a3cb82010-06-11 19:04:21 +00002031 if( walFramePgno(pWal, iTest)==pgno ){
danbb23aff2010-05-10 14:46:09 +00002032 iRead2 = iTest;
dan7c246102010-04-12 19:00:29 +00002033 break;
2034 }
dan7c246102010-04-12 19:00:29 +00002035 }
danbb23aff2010-05-10 14:46:09 +00002036 assert( iRead==iRead2 );
dan7c246102010-04-12 19:00:29 +00002037 }
danbb23aff2010-05-10 14:46:09 +00002038#endif
dancd11fb22010-04-26 10:40:52 +00002039
dan7c246102010-04-12 19:00:29 +00002040 /* If iRead is non-zero, then it is the log frame number that contains the
2041 ** required page. Read and return data from the log file.
2042 */
2043 if( iRead ){
drh6e810962010-05-19 17:49:50 +00002044 i64 iOffset = walFrameOffset(iRead, pWal->hdr.szPage) + WAL_FRAME_HDRSIZE;
drh7ed91f22010-04-29 22:34:07 +00002045 *pInWal = 1;
drhd9e5c4f2010-05-12 18:01:39 +00002046 return sqlite3OsRead(pWal->pWalFd, pOut, nOut, iOffset);
dan7c246102010-04-12 19:00:29 +00002047 }
2048
drh7ed91f22010-04-29 22:34:07 +00002049 *pInWal = 0;
dan7c246102010-04-12 19:00:29 +00002050 return SQLITE_OK;
2051}
2052
2053
2054/*
2055** Set *pPgno to the size of the database file (or zero, if unknown).
2056*/
drh7ed91f22010-04-29 22:34:07 +00002057void sqlite3WalDbsize(Wal *pWal, Pgno *pPgno){
drhaab4c022010-06-02 14:45:51 +00002058 assert( pWal->readLock>=0 || pWal->lockError );
drh7ed91f22010-04-29 22:34:07 +00002059 *pPgno = pWal->hdr.nPage;
dan7c246102010-04-12 19:00:29 +00002060}
2061
dan30c86292010-04-30 16:24:46 +00002062
drh73b64e42010-05-30 19:55:15 +00002063/*
2064** This function starts a write transaction on the WAL.
2065**
2066** A read transaction must have already been started by a prior call
2067** to sqlite3WalBeginReadTransaction().
2068**
2069** If another thread or process has written into the database since
2070** the read transaction was started, then it is not possible for this
2071** thread to write as doing so would cause a fork. So this routine
2072** returns SQLITE_BUSY in that case and no write transaction is started.
2073**
2074** There can only be a single writer active at a time.
2075*/
2076int sqlite3WalBeginWriteTransaction(Wal *pWal){
2077 int rc;
drh73b64e42010-05-30 19:55:15 +00002078
2079 /* Cannot start a write transaction without first holding a read
2080 ** transaction. */
2081 assert( pWal->readLock>=0 );
2082
2083 /* Only one writer allowed at a time. Get the write lock. Return
2084 ** SQLITE_BUSY if unable.
2085 */
2086 rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1);
2087 if( rc ){
2088 return rc;
2089 }
drhc99597c2010-05-31 01:41:15 +00002090 pWal->writeLock = 1;
drh73b64e42010-05-30 19:55:15 +00002091
2092 /* If another connection has written to the database file since the
2093 ** time the read transaction on this connection was started, then
2094 ** the write is disallowed.
2095 */
dan4280eb32010-06-12 12:02:35 +00002096 if( memcmp(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr))!=0 ){
drh73b64e42010-05-30 19:55:15 +00002097 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
drhc99597c2010-05-31 01:41:15 +00002098 pWal->writeLock = 0;
dan9971e712010-06-01 15:44:57 +00002099 rc = SQLITE_BUSY;
drh73b64e42010-05-30 19:55:15 +00002100 }
2101
drh7ed91f22010-04-29 22:34:07 +00002102 return rc;
dan7c246102010-04-12 19:00:29 +00002103}
2104
dan74d6cd82010-04-24 18:44:05 +00002105/*
drh73b64e42010-05-30 19:55:15 +00002106** End a write transaction. The commit has already been done. This
2107** routine merely releases the lock.
2108*/
2109int sqlite3WalEndWriteTransaction(Wal *pWal){
2110 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
dand54ff602010-05-31 11:16:30 +00002111 pWal->writeLock = 0;
drh73b64e42010-05-30 19:55:15 +00002112 return SQLITE_OK;
2113}
2114
2115/*
dan74d6cd82010-04-24 18:44:05 +00002116** If any data has been written (but not committed) to the log file, this
2117** function moves the write-pointer back to the start of the transaction.
2118**
2119** Additionally, the callback function is invoked for each frame written
drh73b64e42010-05-30 19:55:15 +00002120** to the WAL since the start of the transaction. If the callback returns
dan74d6cd82010-04-24 18:44:05 +00002121** other than SQLITE_OK, it is not invoked again and the error code is
2122** returned to the caller.
2123**
2124** Otherwise, if the callback function does not return an error, this
2125** function returns SQLITE_OK.
2126*/
drh7ed91f22010-04-29 22:34:07 +00002127int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *pUndoCtx){
dan55437592010-05-11 12:19:26 +00002128 int rc = SQLITE_OK;
drh73b64e42010-05-30 19:55:15 +00002129 if( pWal->writeLock ){
drh027a1282010-05-19 01:53:53 +00002130 Pgno iMax = pWal->hdr.mxFrame;
dan55437592010-05-11 12:19:26 +00002131 Pgno iFrame;
2132
dan5d656852010-06-14 07:53:26 +00002133 /* Restore the clients cache of the wal-index header to the state it
2134 ** was in before the client began writing to the database.
2135 */
dan067f3162010-06-14 10:30:12 +00002136 memcpy(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr));
dan5d656852010-06-14 07:53:26 +00002137
2138 for(iFrame=pWal->hdr.mxFrame+1;
2139 ALWAYS(rc==SQLITE_OK) && iFrame<=iMax;
2140 iFrame++
2141 ){
2142 /* This call cannot fail. Unless the page for which the page number
2143 ** is passed as the second argument is (a) in the cache and
2144 ** (b) has an outstanding reference, then xUndo is either a no-op
2145 ** (if (a) is false) or simply expels the page from the cache (if (b)
2146 ** is false).
2147 **
2148 ** If the upper layer is doing a rollback, it is guaranteed that there
2149 ** are no outstanding references to any page other than page 1. And
2150 ** page 1 is never written to the log until the transaction is
2151 ** committed. As a result, the call to xUndo may not fail.
2152 */
dan5d656852010-06-14 07:53:26 +00002153 assert( walFramePgno(pWal, iFrame)!=1 );
2154 rc = xUndo(pUndoCtx, walFramePgno(pWal, iFrame));
dan6f150142010-05-21 15:31:56 +00002155 }
dan5d656852010-06-14 07:53:26 +00002156 walCleanupHash(pWal);
dan74d6cd82010-04-24 18:44:05 +00002157 }
dan5d656852010-06-14 07:53:26 +00002158 assert( rc==SQLITE_OK );
dan74d6cd82010-04-24 18:44:05 +00002159 return rc;
2160}
2161
dan71d89912010-05-24 13:57:42 +00002162/*
2163** Argument aWalData must point to an array of WAL_SAVEPOINT_NDATA u32
2164** values. This function populates the array with values required to
2165** "rollback" the write position of the WAL handle back to the current
2166** point in the event of a savepoint rollback (via WalSavepointUndo()).
drh7ed91f22010-04-29 22:34:07 +00002167*/
dan71d89912010-05-24 13:57:42 +00002168void sqlite3WalSavepoint(Wal *pWal, u32 *aWalData){
drh73b64e42010-05-30 19:55:15 +00002169 assert( pWal->writeLock );
dan71d89912010-05-24 13:57:42 +00002170 aWalData[0] = pWal->hdr.mxFrame;
2171 aWalData[1] = pWal->hdr.aFrameCksum[0];
2172 aWalData[2] = pWal->hdr.aFrameCksum[1];
dan6e6bd562010-06-02 18:59:03 +00002173 aWalData[3] = pWal->nCkpt;
dan4cd78b42010-04-26 16:57:10 +00002174}
2175
dan71d89912010-05-24 13:57:42 +00002176/*
2177** Move the write position of the WAL back to the point identified by
2178** the values in the aWalData[] array. aWalData must point to an array
2179** of WAL_SAVEPOINT_NDATA u32 values that has been previously populated
2180** by a call to WalSavepoint().
drh7ed91f22010-04-29 22:34:07 +00002181*/
dan71d89912010-05-24 13:57:42 +00002182int sqlite3WalSavepointUndo(Wal *pWal, u32 *aWalData){
dan4cd78b42010-04-26 16:57:10 +00002183 int rc = SQLITE_OK;
dan4cd78b42010-04-26 16:57:10 +00002184
dan6e6bd562010-06-02 18:59:03 +00002185 assert( pWal->writeLock );
2186 assert( aWalData[3]!=pWal->nCkpt || aWalData[0]<=pWal->hdr.mxFrame );
2187
2188 if( aWalData[3]!=pWal->nCkpt ){
2189 /* This savepoint was opened immediately after the write-transaction
2190 ** was started. Right after that, the writer decided to wrap around
2191 ** to the start of the log. Update the savepoint values to match.
2192 */
2193 aWalData[0] = 0;
2194 aWalData[3] = pWal->nCkpt;
2195 }
2196
dan71d89912010-05-24 13:57:42 +00002197 if( aWalData[0]<pWal->hdr.mxFrame ){
dan71d89912010-05-24 13:57:42 +00002198 pWal->hdr.mxFrame = aWalData[0];
2199 pWal->hdr.aFrameCksum[0] = aWalData[1];
2200 pWal->hdr.aFrameCksum[1] = aWalData[2];
dan5d656852010-06-14 07:53:26 +00002201 walCleanupHash(pWal);
dan6f150142010-05-21 15:31:56 +00002202 }
dan6e6bd562010-06-02 18:59:03 +00002203
dan4cd78b42010-04-26 16:57:10 +00002204 return rc;
2205}
2206
dan9971e712010-06-01 15:44:57 +00002207/*
2208** This function is called just before writing a set of frames to the log
2209** file (see sqlite3WalFrames()). It checks to see if, instead of appending
2210** to the current log file, it is possible to overwrite the start of the
2211** existing log file with the new frames (i.e. "reset" the log). If so,
2212** it sets pWal->hdr.mxFrame to 0. Otherwise, pWal->hdr.mxFrame is left
2213** unchanged.
2214**
2215** SQLITE_OK is returned if no error is encountered (regardless of whether
2216** or not pWal->hdr.mxFrame is modified). An SQLite error code is returned
2217** if some error
2218*/
2219static int walRestartLog(Wal *pWal){
2220 int rc = SQLITE_OK;
drhaab4c022010-06-02 14:45:51 +00002221 int cnt;
2222
dan13a3cb82010-06-11 19:04:21 +00002223 if( pWal->readLock==0 ){
dan9971e712010-06-01 15:44:57 +00002224 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);
2225 assert( pInfo->nBackfill==pWal->hdr.mxFrame );
2226 if( pInfo->nBackfill>0 ){
2227 rc = walLockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);
2228 if( rc==SQLITE_OK ){
2229 /* If all readers are using WAL_READ_LOCK(0) (in other words if no
2230 ** readers are currently using the WAL), then the transactions
2231 ** frames will overwrite the start of the existing log. Update the
2232 ** wal-index header to reflect this.
2233 **
2234 ** In theory it would be Ok to update the cache of the header only
2235 ** at this point. But updating the actual wal-index header is also
2236 ** safe and means there is no special case for sqlite3WalUndo()
2237 ** to handle if this transaction is rolled back.
2238 */
dan199100e2010-06-09 16:58:49 +00002239 int i; /* Loop counter */
dan9971e712010-06-01 15:44:57 +00002240 u32 *aSalt = pWal->hdr.aSalt; /* Big-endian salt values */
2241 pWal->nCkpt++;
2242 pWal->hdr.mxFrame = 0;
2243 sqlite3Put4byte((u8*)&aSalt[0], 1 + sqlite3Get4byte((u8*)&aSalt[0]));
2244 sqlite3_randomness(4, &aSalt[1]);
2245 walIndexWriteHdr(pWal);
dan199100e2010-06-09 16:58:49 +00002246 pInfo->nBackfill = 0;
2247 for(i=1; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED;
2248 assert( pInfo->aReadMark[0]==0 );
dan9971e712010-06-01 15:44:57 +00002249 walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);
2250 }
2251 }
2252 walUnlockShared(pWal, WAL_READ_LOCK(0));
2253 pWal->readLock = -1;
drhaab4c022010-06-02 14:45:51 +00002254 cnt = 0;
dan9971e712010-06-01 15:44:57 +00002255 do{
2256 int notUsed;
drhaab4c022010-06-02 14:45:51 +00002257 rc = walTryBeginRead(pWal, &notUsed, 1, ++cnt);
dan9971e712010-06-01 15:44:57 +00002258 }while( rc==WAL_RETRY );
dan9971e712010-06-01 15:44:57 +00002259 }
2260 return rc;
2261}
2262
dan7c246102010-04-12 19:00:29 +00002263/*
dan4cd78b42010-04-26 16:57:10 +00002264** Write a set of frames to the log. The caller must hold the write-lock
dan9971e712010-06-01 15:44:57 +00002265** on the log file (obtained using sqlite3WalBeginWriteTransaction()).
dan7c246102010-04-12 19:00:29 +00002266*/
drhc438efd2010-04-26 00:19:45 +00002267int sqlite3WalFrames(
drh7ed91f22010-04-29 22:34:07 +00002268 Wal *pWal, /* Wal handle to write to */
drh6e810962010-05-19 17:49:50 +00002269 int szPage, /* Database page-size in bytes */
dan7c246102010-04-12 19:00:29 +00002270 PgHdr *pList, /* List of dirty pages to write */
2271 Pgno nTruncate, /* Database size after this commit */
2272 int isCommit, /* True if this is a commit */
danc5118782010-04-17 17:34:41 +00002273 int sync_flags /* Flags to pass to OsSync() (or 0) */
dan7c246102010-04-12 19:00:29 +00002274){
dan7c246102010-04-12 19:00:29 +00002275 int rc; /* Used to catch return codes */
2276 u32 iFrame; /* Next frame address */
drh7ed91f22010-04-29 22:34:07 +00002277 u8 aFrame[WAL_FRAME_HDRSIZE]; /* Buffer to assemble frame-header in */
dan7c246102010-04-12 19:00:29 +00002278 PgHdr *p; /* Iterator to run through pList with. */
drhe874d9e2010-05-07 20:02:23 +00002279 PgHdr *pLast = 0; /* Last frame in list */
dan7c246102010-04-12 19:00:29 +00002280 int nLast = 0; /* Number of extra copies of last page */
2281
dan7c246102010-04-12 19:00:29 +00002282 assert( pList );
drh73b64e42010-05-30 19:55:15 +00002283 assert( pWal->writeLock );
dan7c246102010-04-12 19:00:29 +00002284
drhc74c3332010-05-31 12:15:19 +00002285#if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
2286 { int cnt; for(cnt=0, p=pList; p; p=p->pDirty, cnt++){}
2287 WALTRACE(("WAL%p: frame write begin. %d frames. mxFrame=%d. %s\n",
2288 pWal, cnt, pWal->hdr.mxFrame, isCommit ? "Commit" : "Spill"));
2289 }
2290#endif
2291
dan9971e712010-06-01 15:44:57 +00002292 /* See if it is possible to write these frames into the start of the
2293 ** log file, instead of appending to it at pWal->hdr.mxFrame.
2294 */
2295 if( SQLITE_OK!=(rc = walRestartLog(pWal)) ){
dan9971e712010-06-01 15:44:57 +00002296 return rc;
2297 }
dan9971e712010-06-01 15:44:57 +00002298
drha2a42012010-05-18 18:01:08 +00002299 /* If this is the first frame written into the log, write the WAL
2300 ** header to the start of the WAL file. See comments at the top of
2301 ** this source file for a description of the WAL header format.
dan97a31352010-04-16 13:59:31 +00002302 */
drh027a1282010-05-19 01:53:53 +00002303 iFrame = pWal->hdr.mxFrame;
dan97a31352010-04-16 13:59:31 +00002304 if( iFrame==0 ){
dan10f5a502010-06-23 15:55:43 +00002305 u8 aWalHdr[WAL_HDRSIZE]; /* Buffer to assemble wal-header in */
2306 u32 aCksum[2]; /* Checksum for wal-header */
2307
danb8fd6c22010-05-24 10:39:36 +00002308 sqlite3Put4byte(&aWalHdr[0], (WAL_MAGIC | SQLITE_BIGENDIAN));
dan10f5a502010-06-23 15:55:43 +00002309 sqlite3Put4byte(&aWalHdr[4], WAL_MAX_VERSION);
drh23ea97b2010-05-20 16:45:58 +00002310 sqlite3Put4byte(&aWalHdr[8], szPage);
2311 sqlite3Put4byte(&aWalHdr[12], pWal->nCkpt);
drh7e263722010-05-20 21:21:09 +00002312 memcpy(&aWalHdr[16], pWal->hdr.aSalt, 8);
dan10f5a502010-06-23 15:55:43 +00002313 walChecksumBytes(1, aWalHdr, WAL_HDRSIZE-2*4, 0, aCksum);
2314 sqlite3Put4byte(&aWalHdr[24], aCksum[0]);
2315 sqlite3Put4byte(&aWalHdr[28], aCksum[1]);
2316
2317 pWal->szPage = szPage;
2318 pWal->hdr.bigEndCksum = SQLITE_BIGENDIAN;
2319 pWal->hdr.aFrameCksum[0] = aCksum[0];
2320 pWal->hdr.aFrameCksum[1] = aCksum[1];
2321
drh23ea97b2010-05-20 16:45:58 +00002322 rc = sqlite3OsWrite(pWal->pWalFd, aWalHdr, sizeof(aWalHdr), 0);
drhc74c3332010-05-31 12:15:19 +00002323 WALTRACE(("WAL%p: wal-header write %s\n", pWal, rc ? "failed" : "ok"));
dan97a31352010-04-16 13:59:31 +00002324 if( rc!=SQLITE_OK ){
2325 return rc;
2326 }
2327 }
drh7e263722010-05-20 21:21:09 +00002328 assert( pWal->szPage==szPage );
dan97a31352010-04-16 13:59:31 +00002329
dan9971e712010-06-01 15:44:57 +00002330 /* Write the log file. */
dan7c246102010-04-12 19:00:29 +00002331 for(p=pList; p; p=p->pDirty){
2332 u32 nDbsize; /* Db-size field for frame header */
2333 i64 iOffset; /* Write offset in log file */
dan47ee3862010-06-22 15:18:44 +00002334 void *pData;
2335
2336
drh6e810962010-05-19 17:49:50 +00002337 iOffset = walFrameOffset(++iFrame, szPage);
dan7c246102010-04-12 19:00:29 +00002338
2339 /* Populate and write the frame header */
2340 nDbsize = (isCommit && p->pDirty==0) ? nTruncate : 0;
drha7152112010-06-22 21:15:49 +00002341#if defined(SQLITE_HAS_CODEC)
dan47ee3862010-06-22 15:18:44 +00002342 if( (pData = sqlite3PagerCodec(p))==0 ) return SQLITE_NOMEM;
drha7152112010-06-22 21:15:49 +00002343#else
2344 pData = p->pData;
2345#endif
dan47ee3862010-06-22 15:18:44 +00002346 walEncodeFrame(pWal, p->pgno, nDbsize, pData, aFrame);
drhd9e5c4f2010-05-12 18:01:39 +00002347 rc = sqlite3OsWrite(pWal->pWalFd, aFrame, sizeof(aFrame), iOffset);
dan7c246102010-04-12 19:00:29 +00002348 if( rc!=SQLITE_OK ){
2349 return rc;
2350 }
2351
2352 /* Write the page data */
dan47ee3862010-06-22 15:18:44 +00002353 rc = sqlite3OsWrite(pWal->pWalFd, pData, szPage, iOffset+sizeof(aFrame));
dan7c246102010-04-12 19:00:29 +00002354 if( rc!=SQLITE_OK ){
2355 return rc;
2356 }
2357 pLast = p;
2358 }
2359
2360 /* Sync the log file if the 'isSync' flag was specified. */
danc5118782010-04-17 17:34:41 +00002361 if( sync_flags ){
drhd9e5c4f2010-05-12 18:01:39 +00002362 i64 iSegment = sqlite3OsSectorSize(pWal->pWalFd);
drh6e810962010-05-19 17:49:50 +00002363 i64 iOffset = walFrameOffset(iFrame+1, szPage);
dan67032392010-04-17 15:42:43 +00002364
2365 assert( isCommit );
drh69c46962010-05-17 20:16:50 +00002366 assert( iSegment>0 );
dan7c246102010-04-12 19:00:29 +00002367
dan7c246102010-04-12 19:00:29 +00002368 iSegment = (((iOffset+iSegment-1)/iSegment) * iSegment);
2369 while( iOffset<iSegment ){
dan47ee3862010-06-22 15:18:44 +00002370 void *pData;
drha7152112010-06-22 21:15:49 +00002371#if defined(SQLITE_HAS_CODEC)
dan47ee3862010-06-22 15:18:44 +00002372 if( (pData = sqlite3PagerCodec(pLast))==0 ) return SQLITE_NOMEM;
drha7152112010-06-22 21:15:49 +00002373#else
2374 pData = pLast->pData;
2375#endif
dan47ee3862010-06-22 15:18:44 +00002376 walEncodeFrame(pWal, pLast->pgno, nTruncate, pData, aFrame);
drhd9e5c4f2010-05-12 18:01:39 +00002377 rc = sqlite3OsWrite(pWal->pWalFd, aFrame, sizeof(aFrame), iOffset);
dan7c246102010-04-12 19:00:29 +00002378 if( rc!=SQLITE_OK ){
2379 return rc;
2380 }
drh7ed91f22010-04-29 22:34:07 +00002381 iOffset += WAL_FRAME_HDRSIZE;
dan47ee3862010-06-22 15:18:44 +00002382 rc = sqlite3OsWrite(pWal->pWalFd, pData, szPage, iOffset);
dan7c246102010-04-12 19:00:29 +00002383 if( rc!=SQLITE_OK ){
2384 return rc;
2385 }
2386 nLast++;
drh6e810962010-05-19 17:49:50 +00002387 iOffset += szPage;
dan7c246102010-04-12 19:00:29 +00002388 }
dan7c246102010-04-12 19:00:29 +00002389
drhd9e5c4f2010-05-12 18:01:39 +00002390 rc = sqlite3OsSync(pWal->pWalFd, sync_flags);
dan7c246102010-04-12 19:00:29 +00002391 }
2392
drhe730fec2010-05-18 12:56:50 +00002393 /* Append data to the wal-index. It is not necessary to lock the
drha2a42012010-05-18 18:01:08 +00002394 ** wal-index to do this as the SQLITE_SHM_WRITE lock held on the wal-index
dan7c246102010-04-12 19:00:29 +00002395 ** guarantees that there are no other writers, and no data that may
2396 ** be in use by existing readers is being overwritten.
2397 */
drh027a1282010-05-19 01:53:53 +00002398 iFrame = pWal->hdr.mxFrame;
danc7991bd2010-05-05 19:04:59 +00002399 for(p=pList; p && rc==SQLITE_OK; p=p->pDirty){
dan7c246102010-04-12 19:00:29 +00002400 iFrame++;
danc7991bd2010-05-05 19:04:59 +00002401 rc = walIndexAppend(pWal, iFrame, p->pgno);
dan7c246102010-04-12 19:00:29 +00002402 }
danc7991bd2010-05-05 19:04:59 +00002403 while( nLast>0 && rc==SQLITE_OK ){
dan7c246102010-04-12 19:00:29 +00002404 iFrame++;
2405 nLast--;
danc7991bd2010-05-05 19:04:59 +00002406 rc = walIndexAppend(pWal, iFrame, pLast->pgno);
dan7c246102010-04-12 19:00:29 +00002407 }
2408
danc7991bd2010-05-05 19:04:59 +00002409 if( rc==SQLITE_OK ){
2410 /* Update the private copy of the header. */
drh6e810962010-05-19 17:49:50 +00002411 pWal->hdr.szPage = szPage;
drh027a1282010-05-19 01:53:53 +00002412 pWal->hdr.mxFrame = iFrame;
danc7991bd2010-05-05 19:04:59 +00002413 if( isCommit ){
2414 pWal->hdr.iChange++;
2415 pWal->hdr.nPage = nTruncate;
2416 }
danc7991bd2010-05-05 19:04:59 +00002417 /* If this is a commit, update the wal-index header too. */
2418 if( isCommit ){
drh7e263722010-05-20 21:21:09 +00002419 walIndexWriteHdr(pWal);
danc7991bd2010-05-05 19:04:59 +00002420 pWal->iCallback = iFrame;
2421 }
dan7c246102010-04-12 19:00:29 +00002422 }
danc7991bd2010-05-05 19:04:59 +00002423
drhc74c3332010-05-31 12:15:19 +00002424 WALTRACE(("WAL%p: frame write %s\n", pWal, rc ? "failed" : "ok"));
dan8d22a172010-04-19 18:03:51 +00002425 return rc;
dan7c246102010-04-12 19:00:29 +00002426}
2427
2428/*
drh73b64e42010-05-30 19:55:15 +00002429** This routine is called to implement sqlite3_wal_checkpoint() and
2430** related interfaces.
danb9bf16b2010-04-14 11:23:30 +00002431**
drh73b64e42010-05-30 19:55:15 +00002432** Obtain a CHECKPOINT lock and then backfill as much information as
2433** we can from WAL into the database.
dan7c246102010-04-12 19:00:29 +00002434*/
drhc438efd2010-04-26 00:19:45 +00002435int sqlite3WalCheckpoint(
drh7ed91f22010-04-29 22:34:07 +00002436 Wal *pWal, /* Wal connection */
danc5118782010-04-17 17:34:41 +00002437 int sync_flags, /* Flags to sync db file with (or 0) */
danb6e099a2010-05-04 14:47:39 +00002438 int nBuf, /* Size of temporary buffer */
drh73b64e42010-05-30 19:55:15 +00002439 u8 *zBuf /* Temporary buffer to use */
dan7c246102010-04-12 19:00:29 +00002440){
danb9bf16b2010-04-14 11:23:30 +00002441 int rc; /* Return code */
dan31c03902010-04-29 14:51:33 +00002442 int isChanged = 0; /* True if a new wal-index header is loaded */
dan7c246102010-04-12 19:00:29 +00002443
dand54ff602010-05-31 11:16:30 +00002444 assert( pWal->ckptLock==0 );
dan39c79f52010-04-15 10:58:51 +00002445
drhc74c3332010-05-31 12:15:19 +00002446 WALTRACE(("WAL%p: checkpoint begins\n", pWal));
drh73b64e42010-05-30 19:55:15 +00002447 rc = walLockExclusive(pWal, WAL_CKPT_LOCK, 1);
2448 if( rc ){
2449 /* Usually this is SQLITE_BUSY meaning that another thread or process
2450 ** is already running a checkpoint, or maybe a recovery. But it might
2451 ** also be SQLITE_IOERR. */
danb9bf16b2010-04-14 11:23:30 +00002452 return rc;
2453 }
dand54ff602010-05-31 11:16:30 +00002454 pWal->ckptLock = 1;
dan64d039e2010-04-13 19:27:31 +00002455
danb9bf16b2010-04-14 11:23:30 +00002456 /* Copy data from the log to the database file. */
drh7ed91f22010-04-29 22:34:07 +00002457 rc = walIndexReadHdr(pWal, &isChanged);
danb9bf16b2010-04-14 11:23:30 +00002458 if( rc==SQLITE_OK ){
drhd9e5c4f2010-05-12 18:01:39 +00002459 rc = walCheckpoint(pWal, sync_flags, nBuf, zBuf);
danb9bf16b2010-04-14 11:23:30 +00002460 }
dan31c03902010-04-29 14:51:33 +00002461 if( isChanged ){
2462 /* If a new wal-index header was loaded before the checkpoint was
drha2a42012010-05-18 18:01:08 +00002463 ** performed, then the pager-cache associated with pWal is now
dan31c03902010-04-29 14:51:33 +00002464 ** out of date. So zero the cached wal-index header to ensure that
2465 ** next time the pager opens a snapshot on this database it knows that
2466 ** the cache needs to be reset.
2467 */
drh7ed91f22010-04-29 22:34:07 +00002468 memset(&pWal->hdr, 0, sizeof(WalIndexHdr));
dan31c03902010-04-29 14:51:33 +00002469 }
danb9bf16b2010-04-14 11:23:30 +00002470
2471 /* Release the locks. */
drh73b64e42010-05-30 19:55:15 +00002472 walUnlockExclusive(pWal, WAL_CKPT_LOCK, 1);
dand54ff602010-05-31 11:16:30 +00002473 pWal->ckptLock = 0;
drhc74c3332010-05-31 12:15:19 +00002474 WALTRACE(("WAL%p: checkpoint %s\n", pWal, rc ? "failed" : "ok"));
dan64d039e2010-04-13 19:27:31 +00002475 return rc;
dan7c246102010-04-12 19:00:29 +00002476}
2477
drh7ed91f22010-04-29 22:34:07 +00002478/* Return the value to pass to a sqlite3_wal_hook callback, the
2479** number of frames in the WAL at the point of the last commit since
2480** sqlite3WalCallback() was called. If no commits have occurred since
2481** the last call, then return 0.
2482*/
2483int sqlite3WalCallback(Wal *pWal){
dan8d22a172010-04-19 18:03:51 +00002484 u32 ret = 0;
drh7ed91f22010-04-29 22:34:07 +00002485 if( pWal ){
2486 ret = pWal->iCallback;
2487 pWal->iCallback = 0;
dan8d22a172010-04-19 18:03:51 +00002488 }
2489 return (int)ret;
2490}
dan55437592010-05-11 12:19:26 +00002491
2492/*
drh61e4ace2010-05-31 20:28:37 +00002493** This function is called to change the WAL subsystem into or out
2494** of locking_mode=EXCLUSIVE.
dan55437592010-05-11 12:19:26 +00002495**
drh61e4ace2010-05-31 20:28:37 +00002496** If op is zero, then attempt to change from locking_mode=EXCLUSIVE
2497** into locking_mode=NORMAL. This means that we must acquire a lock
2498** on the pWal->readLock byte. If the WAL is already in locking_mode=NORMAL
2499** or if the acquisition of the lock fails, then return 0. If the
2500** transition out of exclusive-mode is successful, return 1. This
2501** operation must occur while the pager is still holding the exclusive
2502** lock on the main database file.
dan55437592010-05-11 12:19:26 +00002503**
drh61e4ace2010-05-31 20:28:37 +00002504** If op is one, then change from locking_mode=NORMAL into
2505** locking_mode=EXCLUSIVE. This means that the pWal->readLock must
2506** be released. Return 1 if the transition is made and 0 if the
2507** WAL is already in exclusive-locking mode - meaning that this
2508** routine is a no-op. The pager must already hold the exclusive lock
2509** on the main database file before invoking this operation.
2510**
2511** If op is negative, then do a dry-run of the op==1 case but do
2512** not actually change anything. The pager uses this to see if it
2513** should acquire the database exclusive lock prior to invoking
2514** the op==1 case.
dan55437592010-05-11 12:19:26 +00002515*/
2516int sqlite3WalExclusiveMode(Wal *pWal, int op){
drh61e4ace2010-05-31 20:28:37 +00002517 int rc;
drhaab4c022010-06-02 14:45:51 +00002518 assert( pWal->writeLock==0 );
dan3cac5dc2010-06-04 18:37:59 +00002519
2520 /* pWal->readLock is usually set, but might be -1 if there was a
2521 ** prior error while attempting to acquire are read-lock. This cannot
2522 ** happen if the connection is actually in exclusive mode (as no xShmLock
2523 ** locks are taken in this case). Nor should the pager attempt to
2524 ** upgrade to exclusive-mode following such an error.
2525 */
drhaab4c022010-06-02 14:45:51 +00002526 assert( pWal->readLock>=0 || pWal->lockError );
dan3cac5dc2010-06-04 18:37:59 +00002527 assert( pWal->readLock>=0 || (op<=0 && pWal->exclusiveMode==0) );
2528
drh61e4ace2010-05-31 20:28:37 +00002529 if( op==0 ){
2530 if( pWal->exclusiveMode ){
2531 pWal->exclusiveMode = 0;
dan3cac5dc2010-06-04 18:37:59 +00002532 if( walLockShared(pWal, WAL_READ_LOCK(pWal->readLock))!=SQLITE_OK ){
drh61e4ace2010-05-31 20:28:37 +00002533 pWal->exclusiveMode = 1;
2534 }
2535 rc = pWal->exclusiveMode==0;
2536 }else{
drhaab4c022010-06-02 14:45:51 +00002537 /* Already in locking_mode=NORMAL */
drh61e4ace2010-05-31 20:28:37 +00002538 rc = 0;
2539 }
2540 }else if( op>0 ){
2541 assert( pWal->exclusiveMode==0 );
drhaab4c022010-06-02 14:45:51 +00002542 assert( pWal->readLock>=0 );
drh61e4ace2010-05-31 20:28:37 +00002543 walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock));
2544 pWal->exclusiveMode = 1;
2545 rc = 1;
2546 }else{
2547 rc = pWal->exclusiveMode==0;
dan55437592010-05-11 12:19:26 +00002548 }
drh61e4ace2010-05-31 20:28:37 +00002549 return rc;
dan55437592010-05-11 12:19:26 +00002550}
2551
dan5cf53532010-05-01 16:40:20 +00002552#endif /* #ifndef SQLITE_OMIT_WAL */