blob: e9ba1512e9a71a8a75085f69e87679048fada89c [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
19(RFC 6901) query syntax. It reads UTF-8 JSON from stdin and writes
20canonicalized, formatted UTF-8 JSON to stdout.
21
22See the "const char* usage" string below for details.
23
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
30One benefit of simplicity is that this program's JSON and JSON Pointer
31implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
36The core JSON implementation is also written in the Wuffs programming language
Nigel Taof2eb7012020-03-16 21:10:20 +110037(and then transpiled to C/C++), which is memory-safe (e.g. array indexing is
38bounds-checked) but also guards against integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao0cd2f982020-03-03 23:03:02 +110045All together, this program aims to safely handle untrusted JSON files without
46fear of security bugs such as remote code execution.
47
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
63changes global state (e.g. the `depth` and `context` variables) and prints
64output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao1b073492020-02-16 22:11:36 +110089This example program differs from most other example Wuffs programs in that it
90is written in C++, not C.
91
92$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
93
94for a C++ compiler $CXX, such as clang++ or g++.
95*/
96
Nigel Taofe0cbbd2020-03-05 22:01:30 +110097#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +110098#include <fcntl.h>
99#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100100#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100101#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100102
103// Wuffs ships as a "single file C library" or "header file library" as per
104// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
105//
106// To use that single file as a "foo.c"-like implementation, instead of a
107// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
108// compiling it.
109#define WUFFS_IMPLEMENTATION
110
111// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
112// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
113// the entire Wuffs standard library, implementing a variety of codecs and file
114// formats. Without this macro definition, an optimizing compiler or linker may
115// very well discard Wuffs code for unused codecs, but listing the Wuffs
116// modules we use makes that process explicit. Preprocessing means that such
117// code simply isn't compiled.
118#define WUFFS_CONFIG__MODULES
119#define WUFFS_CONFIG__MODULE__BASE
120#define WUFFS_CONFIG__MODULE__JSON
121
122// If building this program in an environment that doesn't easily accommodate
123// relative includes, you can use the script/inline-c-relative-includes.go
124// program to generate a stand-alone C++ file.
125#include "../../release/c/wuffs-unsupported-snapshot.c"
126
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100127#if defined(__linux__)
128#include <linux/prctl.h>
129#include <linux/seccomp.h>
130#include <sys/prctl.h>
131#include <sys/syscall.h>
132#define WUFFS_EXAMPLE_USE_SECCOMP
133#endif
134
Nigel Tao2cf76db2020-02-27 22:42:01 +1100135#define TRY(error_msg) \
136 do { \
137 const char* z = error_msg; \
138 if (z) { \
139 return z; \
140 } \
141 } while (false)
142
143static const char* eod = "main: end of data";
144
Nigel Tao0cd2f982020-03-03 23:03:02 +1100145static const char* usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100146 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100147 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100148 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100149 " -c -compact-output\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100150 " -i=NUM -indent=NUM\n"
151 " -o=NUM -max-output-depth=NUM\n"
152 " -q=STR -query=STR\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100153 " -s -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100154 " -t -tabs\n"
155 " -fail-if-unsandboxed\n"
156 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100157 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100158 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100159 "----\n"
160 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100161 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
162 "Pointer (RFC 6901) query syntax. It reads UTF-8 JSON from stdin and\n"
163 "writes canonicalized, formatted UTF-8 JSON to stdout.\n"
164 "\n"
165 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
166 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100167 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100168 "\n"
169 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100170 "on its own line. Configure this with the -c / -compact-output, -i=NUM /\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100171 "-indent=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100172 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100173 "----\n"
174 "\n"
175 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100176 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100177 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
178 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100179 "will print:\n"
180 " \"baz\"\n"
181 "\n"
182 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100183 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100184 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
185 "child (the value in a key-value pair) of the root whose key is the empty\n"
186 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100187 "\n"
188 "If the query found a valid JSON value, this program will return a zero\n"
189 "exit code even if the rest of the input isn't valid JSON. If the query\n"
190 "did not find a value, or found an invalid one, this program returns a\n"
191 "non-zero exit code, but may still print partial output to stdout.\n"
192 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100193 "The JSON specification (https://json.org/) permits implementations that\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100194 "allow duplicate keys, as this one does. This JSON Pointer implementation\n"
195 "is also greedy, following the first match for each fragment without\n"
196 "back-tracking. For example, the \"/foo/bar\" query will fail if the root\n"
197 "object has multiple \"foo\" children but the first one doesn't have a\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100198 "\"bar\" child, even if later ones do.\n"
199 "\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100200 "The -s or -strict-json-pointer-syntax flag restricts the -query=STR\n"
201 "string to exactly RFC 6901, with only two escape sequences: \"~0\" and\n"
202 "\"~1\" for \"~\" and \"/\". Without this flag, this program also lets\n"
203 "\"~n\" and \"~r\" escape the New Line and Carriage Return ASCII control\n"
204 "characters, which can work better with line oriented Unix tools that\n"
205 "assume exactly one value (i.e. one JSON Pointer string) per line.\n"
206 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100207 "----\n"
208 "\n"
209 "The -o=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
210 "output depth. JSON containers ([] arrays and {} objects) can hold other\n"
211 "containers. When this flag is set, containers at depth NUM are replaced\n"
212 "with \"[…]\" or \"{…}\". A bare -o or -max-output-depth is equivalent to\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100213 "-o=1. The flag's absence is equivalent to an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100214 "\n"
215 "The -max-output-depth flag only affects the program's output. It doesn't\n"
216 "affect whether or not the input is considered valid JSON. The JSON\n"
217 "specification permits implementations to set their own maximum input\n"
218 "depth. This JSON implementation sets it to 1024.\n"
219 "\n"
220 "Depth is measured in terms of nested containers. It is unaffected by the\n"
221 "number of spaces or tabs used to indent.\n"
222 "\n"
223 "When both -max-output-depth and -query are set, the output depth is\n"
224 "measured from when the query resolves, not from the input root. The\n"
225 "input depth (measured from the root) is still limited to 1024.\n"
226 "\n"
227 "----\n"
228 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100229 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
230 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100231 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100232
Nigel Tao2cf76db2020-02-27 22:42:01 +1100233// ----
234
Nigel Taof3146c22020-03-26 08:47:42 +1100235// Wuffs allows either statically or dynamically allocated work buffers. This
236// program exercises static allocation.
237#define WORK_BUFFER_ARRAY_SIZE \
238 WUFFS_JSON__DECODER_WORKBUF_LEN_MAX_INCL_WORST_CASE
239#if WORK_BUFFER_ARRAY_SIZE > 0
240uint8_t work_buffer_array[WORK_BUFFER_ARRAY_SIZE];
241#else
242// Not all C/C++ compilers support 0-length arrays.
243uint8_t work_buffer_array[1];
244#endif
245
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100246bool sandboxed = false;
247
Nigel Tao01abc842020-03-06 21:42:33 +1100248int input_file_descriptor = 0; // A 0 default means stdin.
249
Nigel Tao2cf76db2020-02-27 22:42:01 +1100250#define MAX_INDENT 8
Nigel Tao107f0ef2020-03-01 21:35:02 +1100251#define INDENT_SPACES_STRING " "
Nigel Tao6e7d1412020-03-06 09:21:35 +1100252#define INDENT_TAB_STRING "\t"
Nigel Tao107f0ef2020-03-01 21:35:02 +1100253
Nigel Taofdac24a2020-03-06 21:53:08 +1100254#ifndef DST_BUFFER_ARRAY_SIZE
255#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100256#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100257#ifndef SRC_BUFFER_ARRAY_SIZE
258#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100259#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100260#ifndef TOKEN_BUFFER_ARRAY_SIZE
261#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100262#endif
263
Nigel Taofdac24a2020-03-06 21:53:08 +1100264uint8_t dst_array[DST_BUFFER_ARRAY_SIZE];
265uint8_t src_array[SRC_BUFFER_ARRAY_SIZE];
266wuffs_base__token tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100267
268wuffs_base__io_buffer dst;
269wuffs_base__io_buffer src;
270wuffs_base__token_buffer tok;
271
Nigel Tao2cf76db2020-02-27 22:42:01 +1100272// curr_token_end_src_index is the src.data.ptr index of the end of the current
273// token. An invariant is that (curr_token_end_src_index <= src.meta.ri).
274size_t curr_token_end_src_index;
275
Nigel Tao0cd2f982020-03-03 23:03:02 +1100276uint32_t depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100277
278enum class context {
279 none,
280 in_list_after_bracket,
281 in_list_after_value,
282 in_dict_after_brace,
283 in_dict_after_key,
284 in_dict_after_value,
285} ctx;
286
Nigel Tao0cd2f982020-03-03 23:03:02 +1100287bool //
288in_dict_before_key() {
289 return (ctx == context::in_dict_after_brace) ||
290 (ctx == context::in_dict_after_value);
291}
292
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100293uint32_t suppress_write_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100294bool wrote_to_dst;
295
Nigel Tao1b073492020-02-16 22:11:36 +1100296wuffs_json__decoder dec;
Nigel Tao1b073492020-02-16 22:11:36 +1100297
Nigel Tao0cd2f982020-03-03 23:03:02 +1100298// ----
299
300// Query is a JSON Pointer query. After initializing with a NUL-terminated C
301// string, its multiple fragments are consumed as the program walks the JSON
302// data from stdin. For example, letting "$" denote a NUL, suppose that we
303// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100304// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100305//
306// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
307// / a p p l e / b a n a n a / 1 2 / d u r i a n $
308// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
309// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100310// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100311//
Nigel Taob48ee752020-03-13 09:27:33 +1100312// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
313// start (inclusive) and end (exclusive) of the query fragment. They satisfy
314// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
315// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100316//
Nigel Taob48ee752020-03-13 09:27:33 +1100317// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
318// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100319//
320// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
321// tokens, as backslash-escaped values within that JSON string may each get
322// their own token.
323//
Nigel Taob48ee752020-03-13 09:27:33 +1100324// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100325//
Nigel Taob48ee752020-03-13 09:27:33 +1100326// While mfj remains non-nullptr, each token's unescaped contents are then
327// compared to that part of the fragment from mfj to mfk. If it is a prefix
328// (including the case of an exact match), then mfj is advanced by the
329// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100330//
331// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
332// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100333// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
334// responsible for calling Query::validate (with a strict_json_pointer_syntax
335// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100336//
Nigel Taob48ee752020-03-13 09:27:33 +1100337// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
338// incrementally match the object key with the query fragment. For example, if
339// we have already matched the "ban" of "banana", then we would accept any of
340// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
341// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100342//
Nigel Taob48ee752020-03-13 09:27:33 +1100343// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100344// v
345// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
346// / a p p l e / b a n a n a / 1 2 / d u r i a n $
347// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
348// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100349// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100350//
351// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100352// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
353// have a fragment match: the query fragment equals the object key. If there is
354// a next fragment (in this example, "12") we move the frag_etc pointers to its
355// start and end and increment Query::m_depth. Otherwise, we have matched the
356// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100357//
358// The discussion above centers on object keys. If the query fragment is
359// numeric then it can also match as an array index: the string fragment "12"
360// will match an array's 13th element (starting counting from zero). See RFC
361// 6901 for its precise definition of an "array index" number.
362//
Nigel Taob48ee752020-03-13 09:27:33 +1100363// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100364// whose type (wuffs_base__result_u64) is a result type. An error result means
365// that the fragment is not an array index. A value result holds the number of
366// list elements remaining. When matching a query fragment in an array (instead
367// of in an object), each element ticks this number down towards zero. At zero,
368// the upcoming JSON value is the one that matches the query fragment.
369class Query {
370 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100371 uint8_t* m_frag_i;
372 uint8_t* m_frag_j;
373 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100374
Nigel Taob48ee752020-03-13 09:27:33 +1100375 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100376
Nigel Taob48ee752020-03-13 09:27:33 +1100377 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100378
379 public:
380 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100381 m_frag_i = (uint8_t*)query_c_string;
382 m_frag_j = (uint8_t*)query_c_string;
383 m_frag_k = (uint8_t*)query_c_string;
384 m_depth = 0;
385 m_array_index.status.repr = "#main: not an array index query fragment";
386 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100387 }
388
Nigel Taob48ee752020-03-13 09:27:33 +1100389 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100390
Nigel Taob48ee752020-03-13 09:27:33 +1100391 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100392
393 // tick returns whether the fragment is a valid array index whose value is
394 // zero. If valid but non-zero, it decrements it and returns false.
395 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100396 if (m_array_index.status.is_ok()) {
397 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100398 return true;
399 }
Nigel Taob48ee752020-03-13 09:27:33 +1100400 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100401 }
402 return false;
403 }
404
405 // next_fragment moves to the next fragment, returning whether it existed.
406 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100407 uint8_t* k = m_frag_k;
408 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100409
410 this->reset(nullptr);
411
412 if (!k || (*k != '/')) {
413 return false;
414 }
415 k++;
416
417 bool all_digits = true;
418 uint8_t* i = k;
419 while ((*k != '\x00') && (*k != '/')) {
420 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
421 k++;
422 }
Nigel Taob48ee752020-03-13 09:27:33 +1100423 m_frag_i = i;
424 m_frag_j = i;
425 m_frag_k = k;
426 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100427 if (all_digits) {
428 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Taob48ee752020-03-13 09:27:33 +1100429 m_array_index =
Nigel Tao0cd2f982020-03-03 23:03:02 +1100430 wuffs_base__parse_number_u64(wuffs_base__make_slice_u8(i, k - i));
431 }
432 return true;
433 }
434
Nigel Taob48ee752020-03-13 09:27:33 +1100435 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100436
Nigel Taob48ee752020-03-13 09:27:33 +1100437 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100438
439 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100440 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100441 return;
442 }
Nigel Taob48ee752020-03-13 09:27:33 +1100443 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100444 while (true) {
445 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100446 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100447 return;
448 }
449
450 if (*j == '\x00') {
451 break;
452
453 } else if (*j == '~') {
454 j++;
455 if (*j == '0') {
456 if (*ptr != '~') {
457 break;
458 }
459 } else if (*j == '1') {
460 if (*ptr != '/') {
461 break;
462 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100463 } else if (*j == 'n') {
464 if (*ptr != '\n') {
465 break;
466 }
467 } else if (*j == 'r') {
468 if (*ptr != '\r') {
469 break;
470 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100471 } else {
472 break;
473 }
474
475 } else if (*j != *ptr) {
476 break;
477 }
478
479 j++;
480 ptr++;
481 len--;
482 }
Nigel Taob48ee752020-03-13 09:27:33 +1100483 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100484 }
485
486 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100487 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100488 return;
489 }
490 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
491 size_t n = wuffs_base__utf_8__encode(
492 wuffs_base__make_slice_u8(&u[0],
493 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
494 code_point);
495 if (n > 0) {
496 this->incremental_match_slice(&u[0], n);
497 }
498 }
499
500 // validate returns whether the (ptr, len) arguments form a valid JSON
501 // Pointer. In particular, it must be valid UTF-8, and either be empty or
502 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100503 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
504 // followed by either 'n' or 'r'.
505 static bool validate(char* query_c_string,
506 size_t length,
507 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100508 if (length <= 0) {
509 return true;
510 }
511 if (query_c_string[0] != '/') {
512 return false;
513 }
514 wuffs_base__slice_u8 s =
515 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
516 bool previous_was_tilde = false;
517 while (s.len > 0) {
518 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s);
519 if (!o.is_valid()) {
520 return false;
521 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100522
523 if (previous_was_tilde) {
524 switch (o.code_point) {
525 case '0':
526 case '1':
527 break;
528 case 'n':
529 case 'r':
530 if (strict_json_pointer_syntax) {
531 return false;
532 }
533 break;
534 default:
535 return false;
536 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100537 }
538 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100539
Nigel Tao0cd2f982020-03-03 23:03:02 +1100540 s.ptr += o.byte_length;
541 s.len -= o.byte_length;
542 }
543 return !previous_was_tilde;
544 }
545} query;
546
547// ----
548
Nigel Tao68920952020-03-03 11:25:18 +1100549struct {
550 int remaining_argc;
551 char** remaining_argv;
552
Nigel Tao3690e832020-03-12 16:52:26 +1100553 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100554 bool fail_if_unsandboxed;
Nigel Tao68920952020-03-03 11:25:18 +1100555 size_t indent;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100556 uint32_t max_output_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100557 char* query_c_string;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100558 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100559 bool tabs;
560} flags = {0};
561
562const char* //
563parse_flags(int argc, char** argv) {
Nigel Tao6e7d1412020-03-06 09:21:35 +1100564 flags.indent = 4;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100565 flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100566
567 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
568 for (; c < argc; c++) {
569 char* arg = argv[c];
570 if (*arg++ != '-') {
571 break;
572 }
573
574 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
575 // cases, a bare "-" is not a flag (some programs may interpret it as
576 // stdin) and a bare "--" means to stop parsing flags.
577 if (*arg == '\x00') {
578 break;
579 } else if (*arg == '-') {
580 arg++;
581 if (*arg == '\x00') {
582 c++;
583 break;
584 }
585 }
586
Nigel Tao3690e832020-03-12 16:52:26 +1100587 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
588 flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100589 continue;
590 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100591 if (!strcmp(arg, "fail-if-unsandboxed")) {
592 flags.fail_if_unsandboxed = true;
593 continue;
594 }
Nigel Tao68920952020-03-03 11:25:18 +1100595 if (!strncmp(arg, "i=", 2) || !strncmp(arg, "indent=", 7)) {
596 while (*arg++ != '=') {
597 }
598 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
599 flags.indent = arg[0] - '0';
Nigel Tao68920952020-03-03 11:25:18 +1100600 continue;
601 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100602 return usage;
603 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100604 if (!strcmp(arg, "o") || !strcmp(arg, "max-output-depth")) {
605 flags.max_output_depth = 1;
606 continue;
607 } else if (!strncmp(arg, "o=", 2) ||
608 !strncmp(arg, "max-output-depth=", 16)) {
609 while (*arg++ != '=') {
610 }
611 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
612 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)));
613 if (wuffs_base__status__is_ok(&u.status) && (u.value <= 0xFFFFFFFF)) {
614 flags.max_output_depth = (uint32_t)(u.value);
615 continue;
616 }
617 return usage;
618 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100619 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
620 while (*arg++ != '=') {
621 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100622 flags.query_c_string = arg;
623 continue;
624 }
625 if (!strcmp(arg, "s") || !strcmp(arg, "strict-json-pointer-syntax")) {
626 flags.strict_json_pointer_syntax = true;
627 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100628 }
629 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
630 flags.tabs = true;
631 continue;
632 }
633
Nigel Tao0cd2f982020-03-03 23:03:02 +1100634 return usage;
Nigel Tao68920952020-03-03 11:25:18 +1100635 }
636
Nigel Taod6fdfb12020-03-11 12:24:14 +1100637 if (flags.query_c_string &&
638 !Query::validate(flags.query_c_string, strlen(flags.query_c_string),
639 flags.strict_json_pointer_syntax)) {
640 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
641 }
642
Nigel Tao68920952020-03-03 11:25:18 +1100643 flags.remaining_argc = argc - c;
644 flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100645 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100646}
647
Nigel Tao2cf76db2020-02-27 22:42:01 +1100648const char* //
649initialize_globals(int argc, char** argv) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100650 dst = wuffs_base__make_io_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100651 wuffs_base__make_slice_u8(dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100652 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100653
Nigel Tao2cf76db2020-02-27 22:42:01 +1100654 src = wuffs_base__make_io_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100655 wuffs_base__make_slice_u8(src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100656 wuffs_base__empty_io_buffer_meta());
657
658 tok = wuffs_base__make_token_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100659 wuffs_base__make_slice_token(tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100660 wuffs_base__empty_token_buffer_meta());
661
662 curr_token_end_src_index = 0;
663
Nigel Tao2cf76db2020-02-27 22:42:01 +1100664 depth = 0;
665
666 ctx = context::none;
667
Nigel Tao68920952020-03-03 11:25:18 +1100668 TRY(parse_flags(argc, argv));
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100669 if (flags.fail_if_unsandboxed && !sandboxed) {
670 return "main: unsandboxed";
671 }
Nigel Tao01abc842020-03-06 21:42:33 +1100672 const int stdin_fd = 0;
673 if (flags.remaining_argc > ((input_file_descriptor != stdin_fd) ? 1 : 0)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100674 return usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100675 }
676
Nigel Tao0cd2f982020-03-03 23:03:02 +1100677 query.reset(flags.query_c_string);
678
679 // If the query is non-empty, suprress writing to stdout until we've
680 // completed the query.
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100681 suppress_write_dst = query.next_fragment() ? 1 : 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100682 wrote_to_dst = false;
683
Nigel Tao4b186b02020-03-18 14:25:21 +1100684 TRY(dec.initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
685 .message());
686
687 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
688 // but it works better with line oriented Unix tools (such as "echo 123 |
689 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
690 // can accidentally contain trailing whitespace.
691 dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
692
693 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100694}
Nigel Tao1b073492020-02-16 22:11:36 +1100695
696// ----
697
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100698// ignore_return_value suppresses errors from -Wall -Werror.
699static void //
700ignore_return_value(int ignored) {}
701
Nigel Tao2914bae2020-02-26 09:40:30 +1100702const char* //
703read_src() {
Nigel Taoa8406922020-02-19 12:22:00 +1100704 if (src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100705 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100706 }
Nigel Tao1b073492020-02-16 22:11:36 +1100707 src.compact();
708 if (src.meta.wi >= src.data.len) {
709 return "main: src buffer is full";
710 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100711 while (true) {
Nigel Tao01abc842020-03-06 21:42:33 +1100712 ssize_t n = read(input_file_descriptor, src.data.ptr + src.meta.wi,
713 src.data.len - src.meta.wi);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100714 if (n >= 0) {
715 src.meta.wi += n;
716 src.meta.closed = n == 0;
717 break;
718 } else if (errno != EINTR) {
719 return strerror(errno);
720 }
Nigel Tao1b073492020-02-16 22:11:36 +1100721 }
722 return nullptr;
723}
724
Nigel Tao2914bae2020-02-26 09:40:30 +1100725const char* //
726flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100727 while (true) {
728 size_t n = dst.meta.wi - dst.meta.ri;
729 if (n == 0) {
730 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100731 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100732 const int stdout_fd = 1;
733 ssize_t i = write(stdout_fd, dst.data.ptr + dst.meta.ri, n);
734 if (i >= 0) {
735 dst.meta.ri += i;
736 } else if (errno != EINTR) {
737 return strerror(errno);
738 }
Nigel Tao1b073492020-02-16 22:11:36 +1100739 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100740 dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100741 return nullptr;
742}
743
Nigel Tao2914bae2020-02-26 09:40:30 +1100744const char* //
745write_dst(const void* s, size_t n) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100746 if (suppress_write_dst > 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100747 return nullptr;
748 }
Nigel Tao1b073492020-02-16 22:11:36 +1100749 const uint8_t* p = static_cast<const uint8_t*>(s);
750 while (n > 0) {
751 size_t i = dst.writer_available();
752 if (i == 0) {
753 const char* z = flush_dst();
754 if (z) {
755 return z;
756 }
757 i = dst.writer_available();
758 if (i == 0) {
759 return "main: dst buffer is full";
760 }
761 }
762
763 if (i > n) {
764 i = n;
765 }
766 memcpy(dst.data.ptr + dst.meta.wi, p, i);
767 dst.meta.wi += i;
768 p += i;
769 n -= i;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100770 wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100771 }
772 return nullptr;
773}
774
775// ----
776
Nigel Tao2914bae2020-02-26 09:40:30 +1100777uint8_t //
778hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +1100779 nibble &= 0x0F;
780 if (nibble <= 9) {
781 return '0' + nibble;
782 }
783 return ('A' - 10) + nibble;
784}
785
Nigel Tao2914bae2020-02-26 09:40:30 +1100786const char* //
Nigel Tao3b486982020-02-27 15:05:59 +1100787handle_unicode_code_point(uint32_t ucp) {
788 if (ucp < 0x0020) {
789 switch (ucp) {
790 case '\b':
791 return write_dst("\\b", 2);
792 case '\f':
793 return write_dst("\\f", 2);
794 case '\n':
795 return write_dst("\\n", 2);
796 case '\r':
797 return write_dst("\\r", 2);
798 case '\t':
799 return write_dst("\\t", 2);
800 default: {
801 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
802 // JSON string. They need to remain escaped.
803 uint8_t esc6[6];
804 esc6[0] = '\\';
805 esc6[1] = 'u';
806 esc6[2] = '0';
807 esc6[3] = '0';
808 esc6[4] = hex_digit(ucp >> 4);
809 esc6[5] = hex_digit(ucp >> 0);
810 return write_dst(&esc6[0], 6);
811 }
812 }
813
Nigel Taob9ad34f2020-03-03 12:44:01 +1100814 } else if (ucp == '\"') {
815 return write_dst("\\\"", 2);
816
817 } else if (ucp == '\\') {
818 return write_dst("\\\\", 2);
819
820 } else {
821 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
822 size_t n = wuffs_base__utf_8__encode(
823 wuffs_base__make_slice_u8(&u[0],
824 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
825 ucp);
826 if (n > 0) {
827 return write_dst(&u[0], n);
Nigel Tao3b486982020-02-27 15:05:59 +1100828 }
Nigel Tao3b486982020-02-27 15:05:59 +1100829 }
830
Nigel Tao2cf76db2020-02-27 22:42:01 +1100831 return "main: internal error: unexpected Unicode code point";
Nigel Tao3b486982020-02-27 15:05:59 +1100832}
833
834const char* //
Nigel Tao2cf76db2020-02-27 22:42:01 +1100835handle_token(wuffs_base__token t) {
836 do {
837 uint64_t vbc = t.value_base_category();
838 uint64_t vbd = t.value_base_detail();
839 uint64_t len = t.length();
Nigel Tao1b073492020-02-16 22:11:36 +1100840
841 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +1100842 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +1100843 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100844 if (query.is_at(depth)) {
845 return "main: no match for query";
846 }
Nigel Tao1b073492020-02-16 22:11:36 +1100847 if (depth <= 0) {
848 return "main: internal error: inconsistent depth";
849 }
850 depth--;
851
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100852 if (query.matched_all() && (depth >= flags.max_output_depth)) {
853 suppress_write_dst--;
854 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
855 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
856 ? "\"[…]\""
857 : "\"{…}\"",
858 7));
859 } else {
860 // Write preceding whitespace.
861 if ((ctx != context::in_list_after_bracket) &&
Nigel Tao3690e832020-03-12 16:52:26 +1100862 (ctx != context::in_dict_after_brace) && !flags.compact_output) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100863 TRY(write_dst("\n", 1));
864 for (uint32_t i = 0; i < depth; i++) {
865 TRY(write_dst(flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
866 flags.tabs ? 1 : flags.indent));
867 }
Nigel Tao1b073492020-02-16 22:11:36 +1100868 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100869
870 TRY(write_dst(
871 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
872 1));
Nigel Tao1b073492020-02-16 22:11:36 +1100873 }
874
Nigel Tao9f7a2502020-02-23 09:42:02 +1100875 ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
876 ? context::in_list_after_value
877 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +1100878 goto after_value;
879 }
880
Nigel Taod1c928a2020-02-28 12:43:53 +1100881 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
882 // continuation of a multi-token chain.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100883 if (!t.link_prev()) {
884 if (ctx == context::in_dict_after_key) {
Nigel Tao3690e832020-03-12 16:52:26 +1100885 TRY(write_dst(": ", flags.compact_output ? 1 : 2));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100886 } else if (ctx != context::none) {
887 if ((ctx != context::in_list_after_bracket) &&
888 (ctx != context::in_dict_after_brace)) {
889 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +1100890 }
Nigel Tao3690e832020-03-12 16:52:26 +1100891 if (!flags.compact_output) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100892 TRY(write_dst("\n", 1));
893 for (size_t i = 0; i < depth; i++) {
Nigel Tao6e7d1412020-03-06 09:21:35 +1100894 TRY(write_dst(flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
895 flags.tabs ? 1 : flags.indent));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100896 }
897 }
898 }
899
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100900 bool query_matched_fragment = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100901 if (query.is_at(depth)) {
902 switch (ctx) {
903 case context::in_list_after_bracket:
904 case context::in_list_after_value:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100905 query_matched_fragment = query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100906 break;
907 case context::in_dict_after_key:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100908 query_matched_fragment = query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100909 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +1100910 default:
911 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100912 }
913 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100914 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100915 // No-op.
916 } else if (!query.next_fragment()) {
917 // There is no next fragment. We have matched the complete query, and
918 // the upcoming JSON value is the result of that query.
919 //
920 // Un-suppress writing to stdout and reset the ctx and depth as if we
921 // were about to decode a top-level value. This makes any subsequent
922 // indentation be relative to this point, and we will return eod after
923 // the upcoming JSON value is complete.
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100924 if (suppress_write_dst != 1) {
925 return "main: internal error: inconsistent suppress_write_dst";
926 }
927 suppress_write_dst = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100928 ctx = context::none;
929 depth = 0;
930 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
931 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
932 // The query has moved on to the next fragment but the upcoming JSON
933 // value is not a container.
934 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +1100935 }
936 }
937
938 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +1100939 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +1100940 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +1100941 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100942 if (query.matched_all() && (depth >= flags.max_output_depth)) {
943 suppress_write_dst++;
944 } else {
945 TRY(write_dst(
946 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
947 1));
948 }
Nigel Tao85fba7f2020-02-29 16:28:06 +1100949 depth++;
950 ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
951 ? context::in_list_after_bracket
952 : context::in_dict_after_brace;
953 return nullptr;
954
Nigel Tao2cf76db2020-02-27 22:42:01 +1100955 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Taod1c928a2020-02-28 12:43:53 +1100956 if (!t.link_prev()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100957 TRY(write_dst("\"", 1));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100958 query.restart_fragment(in_dict_before_key() && query.is_at(depth));
Nigel Tao2cf76db2020-02-27 22:42:01 +1100959 }
Nigel Taocb37a562020-02-28 09:56:24 +1100960
961 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
962 // No-op.
963 } else if (vbd &
964 WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100965 uint8_t* ptr = src.data.ptr + curr_token_end_src_index - len;
966 TRY(write_dst(ptr, len));
967 query.incremental_match_slice(ptr, len);
Nigel Taocb37a562020-02-28 09:56:24 +1100968 } else {
969 return "main: internal error: unexpected string-token conversion";
970 }
971
Nigel Taod1c928a2020-02-28 12:43:53 +1100972 if (t.link_next()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100973 return nullptr;
974 }
975 TRY(write_dst("\"", 1));
976 goto after_value;
977
978 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100979 if (!t.link_prev() || !t.link_next()) {
980 return "main: internal error: unexpected unlinked token";
981 }
982 TRY(handle_unicode_code_point(vbd));
983 query.incremental_match_code_point(vbd);
984 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100985
Nigel Tao85fba7f2020-02-29 16:28:06 +1100986 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao2cf76db2020-02-27 22:42:01 +1100987 case WUFFS_BASE__TOKEN__VBC__NUMBER:
988 TRY(write_dst(src.data.ptr + curr_token_end_src_index - len, len));
989 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +1100990 }
991
992 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +1100993 return "main: internal error: unexpected token";
994 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +1100995
Nigel Tao2cf76db2020-02-27 22:42:01 +1100996 // Book-keeping after completing a value (whether a container value or a
997 // simple value). Empty parent containers are no longer empty. If the parent
998 // container is a "{...}" object, toggle between keys and values.
999after_value:
1000 if (depth == 0) {
1001 return eod;
1002 }
1003 switch (ctx) {
1004 case context::in_list_after_bracket:
1005 ctx = context::in_list_after_value;
1006 break;
1007 case context::in_dict_after_brace:
1008 ctx = context::in_dict_after_key;
1009 break;
1010 case context::in_dict_after_key:
1011 ctx = context::in_dict_after_value;
1012 break;
1013 case context::in_dict_after_value:
1014 ctx = context::in_dict_after_key;
1015 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001016 default:
1017 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001018 }
1019 return nullptr;
1020}
1021
1022const char* //
1023main1(int argc, char** argv) {
1024 TRY(initialize_globals(argc, argv));
1025
1026 while (true) {
Nigel Taof3146c22020-03-26 08:47:42 +11001027 wuffs_base__status status = dec.decode_tokens(
1028 &tok, &src,
1029 wuffs_base__make_slice_u8(work_buffer_array, WORK_BUFFER_ARRAY_SIZE));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001030
1031 while (tok.meta.ri < tok.meta.wi) {
1032 wuffs_base__token t = tok.data.ptr[tok.meta.ri++];
1033 uint64_t n = t.length();
1034 if ((src.meta.ri - curr_token_end_src_index) < n) {
1035 return "main: internal error: inconsistent src indexes";
1036 }
1037 curr_token_end_src_index += n;
1038
Nigel Taod0b16cb2020-03-14 10:15:54 +11001039 // Skip filler tokens (e.g. whitespace).
Nigel Tao2cf76db2020-02-27 22:42:01 +11001040 if (t.value() == 0) {
1041 continue;
1042 }
1043
1044 const char* z = handle_token(t);
1045 if (z == nullptr) {
1046 continue;
1047 } else if (z == eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001048 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001049 }
1050 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001051 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001052
1053 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001054 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001055 } else if (status.repr == wuffs_base__suspension__short_read) {
1056 if (curr_token_end_src_index != src.meta.ri) {
1057 return "main: internal error: inconsistent src indexes";
1058 }
1059 TRY(read_src());
1060 curr_token_end_src_index = src.meta.ri;
1061 } else if (status.repr == wuffs_base__suspension__short_write) {
1062 tok.compact();
1063 } else {
1064 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001065 }
1066 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001067end_of_data:
1068
1069 // With a non-empty query, don't try to consume trailing whitespace or
1070 // confirm that we've processed all the tokens.
1071 if (flags.query_c_string && *flags.query_c_string) {
1072 return nullptr;
1073 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001074
Nigel Tao6b161af2020-02-24 11:01:48 +11001075 // Check that we've exhausted the input.
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001076 if ((src.meta.ri == src.meta.wi) && !src.meta.closed) {
1077 TRY(read_src());
1078 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001079 if ((src.meta.ri < src.meta.wi) || !src.meta.closed) {
1080 return "main: valid JSON followed by further (unexpected) data";
1081 }
1082
1083 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001084 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1085 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1086 // filler token for the "\n".
Nigel Tao6b161af2020-02-24 11:01:48 +11001087 for (; tok.meta.ri < tok.meta.wi; tok.meta.ri++) {
1088 if (tok.data.ptr[tok.meta.ri].value_base_category() !=
1089 WUFFS_BASE__TOKEN__VBC__FILLER) {
1090 return "main: internal error: decoded OK but unprocessed tokens remain";
1091 }
1092 }
1093
1094 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001095}
1096
Nigel Tao2914bae2020-02-26 09:40:30 +11001097int //
1098compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001099 if (!status_msg) {
1100 return 0;
1101 }
Nigel Tao01abc842020-03-06 21:42:33 +11001102 size_t n;
1103 if (status_msg == usage) {
1104 n = strlen(status_msg);
1105 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001106 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001107 if (n >= 2047) {
1108 status_msg = "main: internal error: error message is too long";
1109 n = strnlen(status_msg, 2047);
1110 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001111 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001112 const int stderr_fd = 2;
1113 ignore_return_value(write(stderr_fd, status_msg, n));
1114 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001115 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1116 // formatted or unsupported input.
1117 //
1118 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1119 // run-time checks found that an internal invariant did not hold.
1120 //
1121 // Automated testing, including badly formatted inputs, can therefore
1122 // discriminate between expected failure (exit code 1) and unexpected failure
1123 // (other non-zero exit codes). Specifically, exit code 2 for internal
1124 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1125 // linux) for a segmentation fault (e.g. null pointer dereference).
1126 return strstr(status_msg, "internal error:") ? 2 : 1;
1127}
1128
Nigel Tao2914bae2020-02-26 09:40:30 +11001129int //
1130main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001131 // Look for an input filename (the first non-flag argument) in argv. If there
1132 // is one, open it (but do not read from it) before we self-impose a sandbox.
1133 //
1134 // Flags start with "-", unless it comes after a bare "--" arg.
1135 {
1136 bool dash_dash = false;
1137 int a;
1138 for (a = 1; a < argc; a++) {
1139 char* arg = argv[a];
1140 if ((arg[0] == '-') && !dash_dash) {
1141 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1142 continue;
1143 }
1144 input_file_descriptor = open(arg, O_RDONLY);
1145 if (input_file_descriptor < 0) {
1146 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1147 return 1;
1148 }
1149 break;
1150 }
1151 }
1152
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001153#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1154 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
1155 sandboxed = true;
1156#endif
1157
Nigel Tao0cd2f982020-03-03 23:03:02 +11001158 const char* z = main1(argc, argv);
1159 if (wrote_to_dst) {
1160 const char* z1 = write_dst("\n", 1);
1161 const char* z2 = flush_dst();
1162 z = z ? z : (z1 ? z1 : z2);
1163 }
1164 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001165
1166#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1167 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1168 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1169 // only SYS_exit.
1170 syscall(SYS_exit, exit_code);
1171#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001172 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001173}