blob: 11f4201fed1741e6b4155190d17a3d32ff838e94 [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
19(RFC 6901) query syntax. It reads UTF-8 JSON from stdin and writes
20canonicalized, formatted UTF-8 JSON to stdout.
21
22See the "const char* usage" string below for details.
23
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
30One benefit of simplicity is that this program's JSON and JSON Pointer
31implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
36The core JSON implementation is also written in the Wuffs programming language
37(and then transpiled to C/C++), which is memory-safe but also guards against
38integer arithmetic overflows.
39
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao0cd2f982020-03-03 23:03:02 +110045All together, this program aims to safely handle untrusted JSON files without
46fear of security bugs such as remote code execution.
47
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
63changes global state (e.g. the `depth` and `context` variables) and prints
64output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao1b073492020-02-16 22:11:36 +110089This example program differs from most other example Wuffs programs in that it
90is written in C++, not C.
91
92$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
93
94for a C++ compiler $CXX, such as clang++ or g++.
95*/
96
Nigel Taofe0cbbd2020-03-05 22:01:30 +110097#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +110098#include <fcntl.h>
99#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100100#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100101#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100102
103// Wuffs ships as a "single file C library" or "header file library" as per
104// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
105//
106// To use that single file as a "foo.c"-like implementation, instead of a
107// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
108// compiling it.
109#define WUFFS_IMPLEMENTATION
110
111// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
112// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
113// the entire Wuffs standard library, implementing a variety of codecs and file
114// formats. Without this macro definition, an optimizing compiler or linker may
115// very well discard Wuffs code for unused codecs, but listing the Wuffs
116// modules we use makes that process explicit. Preprocessing means that such
117// code simply isn't compiled.
118#define WUFFS_CONFIG__MODULES
119#define WUFFS_CONFIG__MODULE__BASE
120#define WUFFS_CONFIG__MODULE__JSON
121
122// If building this program in an environment that doesn't easily accommodate
123// relative includes, you can use the script/inline-c-relative-includes.go
124// program to generate a stand-alone C++ file.
125#include "../../release/c/wuffs-unsupported-snapshot.c"
126
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100127#if defined(__linux__)
128#include <linux/prctl.h>
129#include <linux/seccomp.h>
130#include <sys/prctl.h>
131#include <sys/syscall.h>
132#define WUFFS_EXAMPLE_USE_SECCOMP
133#endif
134
Nigel Tao2cf76db2020-02-27 22:42:01 +1100135#define TRY(error_msg) \
136 do { \
137 const char* z = error_msg; \
138 if (z) { \
139 return z; \
140 } \
141 } while (false)
142
143static const char* eod = "main: end of data";
144
Nigel Tao0cd2f982020-03-03 23:03:02 +1100145static const char* usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100146 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100147 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100148 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100149 " -c -compact-output\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100150 " -i=NUM -indent=NUM\n"
151 " -o=NUM -max-output-depth=NUM\n"
152 " -q=STR -query=STR\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100153 " -s -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100154 " -t -tabs\n"
155 " -fail-if-unsandboxed\n"
156 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100157 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100158 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100159 "----\n"
160 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100161 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
162 "Pointer (RFC 6901) query syntax. It reads UTF-8 JSON from stdin and\n"
163 "writes canonicalized, formatted UTF-8 JSON to stdout.\n"
164 "\n"
165 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
166 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100167 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100168 "\n"
169 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100170 "on its own line. Configure this with the -c / -compact-output, -i=NUM /\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100171 "-indent=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100172 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100173 "----\n"
174 "\n"
175 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100176 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100177 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
178 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100179 "will print:\n"
180 " \"baz\"\n"
181 "\n"
182 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100183 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100184 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
185 "child (the value in a key-value pair) of the root whose key is the empty\n"
186 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100187 "\n"
188 "If the query found a valid JSON value, this program will return a zero\n"
189 "exit code even if the rest of the input isn't valid JSON. If the query\n"
190 "did not find a value, or found an invalid one, this program returns a\n"
191 "non-zero exit code, but may still print partial output to stdout.\n"
192 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100193 "The JSON specification (https://json.org/) permits implementations that\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100194 "allow duplicate keys, as this one does. This JSON Pointer implementation\n"
195 "is also greedy, following the first match for each fragment without\n"
196 "back-tracking. For example, the \"/foo/bar\" query will fail if the root\n"
197 "object has multiple \"foo\" children but the first one doesn't have a\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100198 "\"bar\" child, even if later ones do.\n"
199 "\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100200 "The -s or -strict-json-pointer-syntax flag restricts the -query=STR\n"
201 "string to exactly RFC 6901, with only two escape sequences: \"~0\" and\n"
202 "\"~1\" for \"~\" and \"/\". Without this flag, this program also lets\n"
203 "\"~n\" and \"~r\" escape the New Line and Carriage Return ASCII control\n"
204 "characters, which can work better with line oriented Unix tools that\n"
205 "assume exactly one value (i.e. one JSON Pointer string) per line.\n"
206 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100207 "----\n"
208 "\n"
209 "The -o=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
210 "output depth. JSON containers ([] arrays and {} objects) can hold other\n"
211 "containers. When this flag is set, containers at depth NUM are replaced\n"
212 "with \"[…]\" or \"{…}\". A bare -o or -max-output-depth is equivalent to\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100213 "-o=1. The flag's absence is equivalent to an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100214 "\n"
215 "The -max-output-depth flag only affects the program's output. It doesn't\n"
216 "affect whether or not the input is considered valid JSON. The JSON\n"
217 "specification permits implementations to set their own maximum input\n"
218 "depth. This JSON implementation sets it to 1024.\n"
219 "\n"
220 "Depth is measured in terms of nested containers. It is unaffected by the\n"
221 "number of spaces or tabs used to indent.\n"
222 "\n"
223 "When both -max-output-depth and -query are set, the output depth is\n"
224 "measured from when the query resolves, not from the input root. The\n"
225 "input depth (measured from the root) is still limited to 1024.\n"
226 "\n"
227 "----\n"
228 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100229 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
230 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100231 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100232
Nigel Tao2cf76db2020-02-27 22:42:01 +1100233// ----
234
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100235bool sandboxed = false;
236
Nigel Tao01abc842020-03-06 21:42:33 +1100237int input_file_descriptor = 0; // A 0 default means stdin.
238
Nigel Tao2cf76db2020-02-27 22:42:01 +1100239#define MAX_INDENT 8
Nigel Tao107f0ef2020-03-01 21:35:02 +1100240#define INDENT_SPACES_STRING " "
Nigel Tao6e7d1412020-03-06 09:21:35 +1100241#define INDENT_TAB_STRING "\t"
Nigel Tao107f0ef2020-03-01 21:35:02 +1100242
Nigel Taofdac24a2020-03-06 21:53:08 +1100243#ifndef DST_BUFFER_ARRAY_SIZE
244#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100245#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100246#ifndef SRC_BUFFER_ARRAY_SIZE
247#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100248#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100249#ifndef TOKEN_BUFFER_ARRAY_SIZE
250#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100251#endif
252
Nigel Taofdac24a2020-03-06 21:53:08 +1100253uint8_t dst_array[DST_BUFFER_ARRAY_SIZE];
254uint8_t src_array[SRC_BUFFER_ARRAY_SIZE];
255wuffs_base__token tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100256
257wuffs_base__io_buffer dst;
258wuffs_base__io_buffer src;
259wuffs_base__token_buffer tok;
260
Nigel Tao2cf76db2020-02-27 22:42:01 +1100261// curr_token_end_src_index is the src.data.ptr index of the end of the current
262// token. An invariant is that (curr_token_end_src_index <= src.meta.ri).
263size_t curr_token_end_src_index;
264
Nigel Tao0cd2f982020-03-03 23:03:02 +1100265uint32_t depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100266
267enum class context {
268 none,
269 in_list_after_bracket,
270 in_list_after_value,
271 in_dict_after_brace,
272 in_dict_after_key,
273 in_dict_after_value,
274} ctx;
275
Nigel Tao0cd2f982020-03-03 23:03:02 +1100276bool //
277in_dict_before_key() {
278 return (ctx == context::in_dict_after_brace) ||
279 (ctx == context::in_dict_after_value);
280}
281
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100282uint32_t suppress_write_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100283bool wrote_to_dst;
284
Nigel Tao1b073492020-02-16 22:11:36 +1100285wuffs_json__decoder dec;
Nigel Tao1b073492020-02-16 22:11:36 +1100286
Nigel Tao0cd2f982020-03-03 23:03:02 +1100287// ----
288
289// Query is a JSON Pointer query. After initializing with a NUL-terminated C
290// string, its multiple fragments are consumed as the program walks the JSON
291// data from stdin. For example, letting "$" denote a NUL, suppose that we
292// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100293// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100294//
295// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
296// / a p p l e / b a n a n a / 1 2 / d u r i a n $
297// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
298// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100299// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100300//
Nigel Taob48ee752020-03-13 09:27:33 +1100301// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
302// start (inclusive) and end (exclusive) of the query fragment. They satisfy
303// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
304// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100305//
Nigel Taob48ee752020-03-13 09:27:33 +1100306// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
307// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100308//
309// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
310// tokens, as backslash-escaped values within that JSON string may each get
311// their own token.
312//
Nigel Taob48ee752020-03-13 09:27:33 +1100313// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100314//
Nigel Taob48ee752020-03-13 09:27:33 +1100315// While mfj remains non-nullptr, each token's unescaped contents are then
316// compared to that part of the fragment from mfj to mfk. If it is a prefix
317// (including the case of an exact match), then mfj is advanced by the
318// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100319//
320// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
321// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100322// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
323// responsible for calling Query::validate (with a strict_json_pointer_syntax
324// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100325//
Nigel Taob48ee752020-03-13 09:27:33 +1100326// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
327// incrementally match the object key with the query fragment. For example, if
328// we have already matched the "ban" of "banana", then we would accept any of
329// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
330// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100331//
Nigel Taob48ee752020-03-13 09:27:33 +1100332// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100333// v
334// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
335// / a p p l e / b a n a n a / 1 2 / d u r i a n $
336// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
337// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100338// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100339//
340// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100341// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
342// have a fragment match: the query fragment equals the object key. If there is
343// a next fragment (in this example, "12") we move the frag_etc pointers to its
344// start and end and increment Query::m_depth. Otherwise, we have matched the
345// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100346//
347// The discussion above centers on object keys. If the query fragment is
348// numeric then it can also match as an array index: the string fragment "12"
349// will match an array's 13th element (starting counting from zero). See RFC
350// 6901 for its precise definition of an "array index" number.
351//
Nigel Taob48ee752020-03-13 09:27:33 +1100352// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100353// whose type (wuffs_base__result_u64) is a result type. An error result means
354// that the fragment is not an array index. A value result holds the number of
355// list elements remaining. When matching a query fragment in an array (instead
356// of in an object), each element ticks this number down towards zero. At zero,
357// the upcoming JSON value is the one that matches the query fragment.
358class Query {
359 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100360 uint8_t* m_frag_i;
361 uint8_t* m_frag_j;
362 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100363
Nigel Taob48ee752020-03-13 09:27:33 +1100364 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100365
Nigel Taob48ee752020-03-13 09:27:33 +1100366 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100367
368 public:
369 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100370 m_frag_i = (uint8_t*)query_c_string;
371 m_frag_j = (uint8_t*)query_c_string;
372 m_frag_k = (uint8_t*)query_c_string;
373 m_depth = 0;
374 m_array_index.status.repr = "#main: not an array index query fragment";
375 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100376 }
377
Nigel Taob48ee752020-03-13 09:27:33 +1100378 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100379
Nigel Taob48ee752020-03-13 09:27:33 +1100380 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100381
382 // tick returns whether the fragment is a valid array index whose value is
383 // zero. If valid but non-zero, it decrements it and returns false.
384 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100385 if (m_array_index.status.is_ok()) {
386 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100387 return true;
388 }
Nigel Taob48ee752020-03-13 09:27:33 +1100389 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100390 }
391 return false;
392 }
393
394 // next_fragment moves to the next fragment, returning whether it existed.
395 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100396 uint8_t* k = m_frag_k;
397 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100398
399 this->reset(nullptr);
400
401 if (!k || (*k != '/')) {
402 return false;
403 }
404 k++;
405
406 bool all_digits = true;
407 uint8_t* i = k;
408 while ((*k != '\x00') && (*k != '/')) {
409 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
410 k++;
411 }
Nigel Taob48ee752020-03-13 09:27:33 +1100412 m_frag_i = i;
413 m_frag_j = i;
414 m_frag_k = k;
415 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100416 if (all_digits) {
417 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Taob48ee752020-03-13 09:27:33 +1100418 m_array_index =
Nigel Tao0cd2f982020-03-03 23:03:02 +1100419 wuffs_base__parse_number_u64(wuffs_base__make_slice_u8(i, k - i));
420 }
421 return true;
422 }
423
Nigel Taob48ee752020-03-13 09:27:33 +1100424 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100425
Nigel Taob48ee752020-03-13 09:27:33 +1100426 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100427
428 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100429 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100430 return;
431 }
Nigel Taob48ee752020-03-13 09:27:33 +1100432 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100433 while (true) {
434 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100435 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100436 return;
437 }
438
439 if (*j == '\x00') {
440 break;
441
442 } else if (*j == '~') {
443 j++;
444 if (*j == '0') {
445 if (*ptr != '~') {
446 break;
447 }
448 } else if (*j == '1') {
449 if (*ptr != '/') {
450 break;
451 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100452 } else if (*j == 'n') {
453 if (*ptr != '\n') {
454 break;
455 }
456 } else if (*j == 'r') {
457 if (*ptr != '\r') {
458 break;
459 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100460 } else {
461 break;
462 }
463
464 } else if (*j != *ptr) {
465 break;
466 }
467
468 j++;
469 ptr++;
470 len--;
471 }
Nigel Taob48ee752020-03-13 09:27:33 +1100472 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100473 }
474
475 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100476 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100477 return;
478 }
479 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
480 size_t n = wuffs_base__utf_8__encode(
481 wuffs_base__make_slice_u8(&u[0],
482 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
483 code_point);
484 if (n > 0) {
485 this->incremental_match_slice(&u[0], n);
486 }
487 }
488
489 // validate returns whether the (ptr, len) arguments form a valid JSON
490 // Pointer. In particular, it must be valid UTF-8, and either be empty or
491 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100492 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
493 // followed by either 'n' or 'r'.
494 static bool validate(char* query_c_string,
495 size_t length,
496 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100497 if (length <= 0) {
498 return true;
499 }
500 if (query_c_string[0] != '/') {
501 return false;
502 }
503 wuffs_base__slice_u8 s =
504 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
505 bool previous_was_tilde = false;
506 while (s.len > 0) {
507 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s);
508 if (!o.is_valid()) {
509 return false;
510 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100511
512 if (previous_was_tilde) {
513 switch (o.code_point) {
514 case '0':
515 case '1':
516 break;
517 case 'n':
518 case 'r':
519 if (strict_json_pointer_syntax) {
520 return false;
521 }
522 break;
523 default:
524 return false;
525 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100526 }
527 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100528
Nigel Tao0cd2f982020-03-03 23:03:02 +1100529 s.ptr += o.byte_length;
530 s.len -= o.byte_length;
531 }
532 return !previous_was_tilde;
533 }
534} query;
535
536// ----
537
Nigel Tao68920952020-03-03 11:25:18 +1100538struct {
539 int remaining_argc;
540 char** remaining_argv;
541
Nigel Tao3690e832020-03-12 16:52:26 +1100542 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100543 bool fail_if_unsandboxed;
Nigel Tao68920952020-03-03 11:25:18 +1100544 size_t indent;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100545 uint32_t max_output_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100546 char* query_c_string;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100547 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100548 bool tabs;
549} flags = {0};
550
551const char* //
552parse_flags(int argc, char** argv) {
Nigel Tao6e7d1412020-03-06 09:21:35 +1100553 flags.indent = 4;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100554 flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100555
556 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
557 for (; c < argc; c++) {
558 char* arg = argv[c];
559 if (*arg++ != '-') {
560 break;
561 }
562
563 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
564 // cases, a bare "-" is not a flag (some programs may interpret it as
565 // stdin) and a bare "--" means to stop parsing flags.
566 if (*arg == '\x00') {
567 break;
568 } else if (*arg == '-') {
569 arg++;
570 if (*arg == '\x00') {
571 c++;
572 break;
573 }
574 }
575
Nigel Tao3690e832020-03-12 16:52:26 +1100576 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
577 flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100578 continue;
579 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100580 if (!strcmp(arg, "fail-if-unsandboxed")) {
581 flags.fail_if_unsandboxed = true;
582 continue;
583 }
Nigel Tao68920952020-03-03 11:25:18 +1100584 if (!strncmp(arg, "i=", 2) || !strncmp(arg, "indent=", 7)) {
585 while (*arg++ != '=') {
586 }
587 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
588 flags.indent = arg[0] - '0';
Nigel Tao68920952020-03-03 11:25:18 +1100589 continue;
590 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100591 return usage;
592 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100593 if (!strcmp(arg, "o") || !strcmp(arg, "max-output-depth")) {
594 flags.max_output_depth = 1;
595 continue;
596 } else if (!strncmp(arg, "o=", 2) ||
597 !strncmp(arg, "max-output-depth=", 16)) {
598 while (*arg++ != '=') {
599 }
600 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
601 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)));
602 if (wuffs_base__status__is_ok(&u.status) && (u.value <= 0xFFFFFFFF)) {
603 flags.max_output_depth = (uint32_t)(u.value);
604 continue;
605 }
606 return usage;
607 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100608 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
609 while (*arg++ != '=') {
610 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100611 flags.query_c_string = arg;
612 continue;
613 }
614 if (!strcmp(arg, "s") || !strcmp(arg, "strict-json-pointer-syntax")) {
615 flags.strict_json_pointer_syntax = true;
616 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100617 }
618 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
619 flags.tabs = true;
620 continue;
621 }
622
Nigel Tao0cd2f982020-03-03 23:03:02 +1100623 return usage;
Nigel Tao68920952020-03-03 11:25:18 +1100624 }
625
Nigel Taod6fdfb12020-03-11 12:24:14 +1100626 if (flags.query_c_string &&
627 !Query::validate(flags.query_c_string, strlen(flags.query_c_string),
628 flags.strict_json_pointer_syntax)) {
629 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
630 }
631
Nigel Tao68920952020-03-03 11:25:18 +1100632 flags.remaining_argc = argc - c;
633 flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100634 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100635}
636
Nigel Tao2cf76db2020-02-27 22:42:01 +1100637const char* //
638initialize_globals(int argc, char** argv) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100639 dst = wuffs_base__make_io_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100640 wuffs_base__make_slice_u8(dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100641 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100642
Nigel Tao2cf76db2020-02-27 22:42:01 +1100643 src = wuffs_base__make_io_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100644 wuffs_base__make_slice_u8(src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100645 wuffs_base__empty_io_buffer_meta());
646
647 tok = wuffs_base__make_token_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100648 wuffs_base__make_slice_token(tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100649 wuffs_base__empty_token_buffer_meta());
650
651 curr_token_end_src_index = 0;
652
Nigel Tao2cf76db2020-02-27 22:42:01 +1100653 depth = 0;
654
655 ctx = context::none;
656
Nigel Tao68920952020-03-03 11:25:18 +1100657 TRY(parse_flags(argc, argv));
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100658 if (flags.fail_if_unsandboxed && !sandboxed) {
659 return "main: unsandboxed";
660 }
Nigel Tao01abc842020-03-06 21:42:33 +1100661 const int stdin_fd = 0;
662 if (flags.remaining_argc > ((input_file_descriptor != stdin_fd) ? 1 : 0)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100663 return usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100664 }
665
Nigel Tao0cd2f982020-03-03 23:03:02 +1100666 query.reset(flags.query_c_string);
667
668 // If the query is non-empty, suprress writing to stdout until we've
669 // completed the query.
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100670 suppress_write_dst = query.next_fragment() ? 1 : 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100671 wrote_to_dst = false;
672
Nigel Tao2cf76db2020-02-27 22:42:01 +1100673 return dec.initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
674 .message();
675}
Nigel Tao1b073492020-02-16 22:11:36 +1100676
677// ----
678
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100679// ignore_return_value suppresses errors from -Wall -Werror.
680static void //
681ignore_return_value(int ignored) {}
682
Nigel Tao2914bae2020-02-26 09:40:30 +1100683const char* //
684read_src() {
Nigel Taoa8406922020-02-19 12:22:00 +1100685 if (src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100686 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100687 }
Nigel Tao1b073492020-02-16 22:11:36 +1100688 src.compact();
689 if (src.meta.wi >= src.data.len) {
690 return "main: src buffer is full";
691 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100692 while (true) {
Nigel Tao01abc842020-03-06 21:42:33 +1100693 ssize_t n = read(input_file_descriptor, src.data.ptr + src.meta.wi,
694 src.data.len - src.meta.wi);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100695 if (n >= 0) {
696 src.meta.wi += n;
697 src.meta.closed = n == 0;
698 break;
699 } else if (errno != EINTR) {
700 return strerror(errno);
701 }
Nigel Tao1b073492020-02-16 22:11:36 +1100702 }
703 return nullptr;
704}
705
Nigel Tao2914bae2020-02-26 09:40:30 +1100706const char* //
707flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100708 while (true) {
709 size_t n = dst.meta.wi - dst.meta.ri;
710 if (n == 0) {
711 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100712 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100713 const int stdout_fd = 1;
714 ssize_t i = write(stdout_fd, dst.data.ptr + dst.meta.ri, n);
715 if (i >= 0) {
716 dst.meta.ri += i;
717 } else if (errno != EINTR) {
718 return strerror(errno);
719 }
Nigel Tao1b073492020-02-16 22:11:36 +1100720 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100721 dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100722 return nullptr;
723}
724
Nigel Tao2914bae2020-02-26 09:40:30 +1100725const char* //
726write_dst(const void* s, size_t n) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100727 if (suppress_write_dst > 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100728 return nullptr;
729 }
Nigel Tao1b073492020-02-16 22:11:36 +1100730 const uint8_t* p = static_cast<const uint8_t*>(s);
731 while (n > 0) {
732 size_t i = dst.writer_available();
733 if (i == 0) {
734 const char* z = flush_dst();
735 if (z) {
736 return z;
737 }
738 i = dst.writer_available();
739 if (i == 0) {
740 return "main: dst buffer is full";
741 }
742 }
743
744 if (i > n) {
745 i = n;
746 }
747 memcpy(dst.data.ptr + dst.meta.wi, p, i);
748 dst.meta.wi += i;
749 p += i;
750 n -= i;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100751 wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100752 }
753 return nullptr;
754}
755
756// ----
757
Nigel Tao2914bae2020-02-26 09:40:30 +1100758uint8_t //
759hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +1100760 nibble &= 0x0F;
761 if (nibble <= 9) {
762 return '0' + nibble;
763 }
764 return ('A' - 10) + nibble;
765}
766
Nigel Tao2914bae2020-02-26 09:40:30 +1100767const char* //
Nigel Tao3b486982020-02-27 15:05:59 +1100768handle_unicode_code_point(uint32_t ucp) {
769 if (ucp < 0x0020) {
770 switch (ucp) {
771 case '\b':
772 return write_dst("\\b", 2);
773 case '\f':
774 return write_dst("\\f", 2);
775 case '\n':
776 return write_dst("\\n", 2);
777 case '\r':
778 return write_dst("\\r", 2);
779 case '\t':
780 return write_dst("\\t", 2);
781 default: {
782 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
783 // JSON string. They need to remain escaped.
784 uint8_t esc6[6];
785 esc6[0] = '\\';
786 esc6[1] = 'u';
787 esc6[2] = '0';
788 esc6[3] = '0';
789 esc6[4] = hex_digit(ucp >> 4);
790 esc6[5] = hex_digit(ucp >> 0);
791 return write_dst(&esc6[0], 6);
792 }
793 }
794
Nigel Taob9ad34f2020-03-03 12:44:01 +1100795 } else if (ucp == '\"') {
796 return write_dst("\\\"", 2);
797
798 } else if (ucp == '\\') {
799 return write_dst("\\\\", 2);
800
801 } else {
802 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
803 size_t n = wuffs_base__utf_8__encode(
804 wuffs_base__make_slice_u8(&u[0],
805 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
806 ucp);
807 if (n > 0) {
808 return write_dst(&u[0], n);
Nigel Tao3b486982020-02-27 15:05:59 +1100809 }
Nigel Tao3b486982020-02-27 15:05:59 +1100810 }
811
Nigel Tao2cf76db2020-02-27 22:42:01 +1100812 return "main: internal error: unexpected Unicode code point";
Nigel Tao3b486982020-02-27 15:05:59 +1100813}
814
815const char* //
Nigel Tao2cf76db2020-02-27 22:42:01 +1100816handle_token(wuffs_base__token t) {
817 do {
818 uint64_t vbc = t.value_base_category();
819 uint64_t vbd = t.value_base_detail();
820 uint64_t len = t.length();
Nigel Tao1b073492020-02-16 22:11:36 +1100821
822 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +1100823 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +1100824 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100825 if (query.is_at(depth)) {
826 return "main: no match for query";
827 }
Nigel Tao1b073492020-02-16 22:11:36 +1100828 if (depth <= 0) {
829 return "main: internal error: inconsistent depth";
830 }
831 depth--;
832
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100833 if (query.matched_all() && (depth >= flags.max_output_depth)) {
834 suppress_write_dst--;
835 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
836 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
837 ? "\"[…]\""
838 : "\"{…}\"",
839 7));
840 } else {
841 // Write preceding whitespace.
842 if ((ctx != context::in_list_after_bracket) &&
Nigel Tao3690e832020-03-12 16:52:26 +1100843 (ctx != context::in_dict_after_brace) && !flags.compact_output) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100844 TRY(write_dst("\n", 1));
845 for (uint32_t i = 0; i < depth; i++) {
846 TRY(write_dst(flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
847 flags.tabs ? 1 : flags.indent));
848 }
Nigel Tao1b073492020-02-16 22:11:36 +1100849 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100850
851 TRY(write_dst(
852 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
853 1));
Nigel Tao1b073492020-02-16 22:11:36 +1100854 }
855
Nigel Tao9f7a2502020-02-23 09:42:02 +1100856 ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
857 ? context::in_list_after_value
858 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +1100859 goto after_value;
860 }
861
Nigel Taod1c928a2020-02-28 12:43:53 +1100862 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
863 // continuation of a multi-token chain.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100864 if (!t.link_prev()) {
865 if (ctx == context::in_dict_after_key) {
Nigel Tao3690e832020-03-12 16:52:26 +1100866 TRY(write_dst(": ", flags.compact_output ? 1 : 2));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100867 } else if (ctx != context::none) {
868 if ((ctx != context::in_list_after_bracket) &&
869 (ctx != context::in_dict_after_brace)) {
870 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +1100871 }
Nigel Tao3690e832020-03-12 16:52:26 +1100872 if (!flags.compact_output) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100873 TRY(write_dst("\n", 1));
874 for (size_t i = 0; i < depth; i++) {
Nigel Tao6e7d1412020-03-06 09:21:35 +1100875 TRY(write_dst(flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
876 flags.tabs ? 1 : flags.indent));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100877 }
878 }
879 }
880
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100881 bool query_matched_fragment = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100882 if (query.is_at(depth)) {
883 switch (ctx) {
884 case context::in_list_after_bracket:
885 case context::in_list_after_value:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100886 query_matched_fragment = query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100887 break;
888 case context::in_dict_after_key:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100889 query_matched_fragment = query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100890 break;
891 }
892 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100893 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100894 // No-op.
895 } else if (!query.next_fragment()) {
896 // There is no next fragment. We have matched the complete query, and
897 // the upcoming JSON value is the result of that query.
898 //
899 // Un-suppress writing to stdout and reset the ctx and depth as if we
900 // were about to decode a top-level value. This makes any subsequent
901 // indentation be relative to this point, and we will return eod after
902 // the upcoming JSON value is complete.
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100903 if (suppress_write_dst != 1) {
904 return "main: internal error: inconsistent suppress_write_dst";
905 }
906 suppress_write_dst = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100907 ctx = context::none;
908 depth = 0;
909 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
910 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
911 // The query has moved on to the next fragment but the upcoming JSON
912 // value is not a container.
913 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +1100914 }
915 }
916
917 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +1100918 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +1100919 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +1100920 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100921 if (query.matched_all() && (depth >= flags.max_output_depth)) {
922 suppress_write_dst++;
923 } else {
924 TRY(write_dst(
925 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
926 1));
927 }
Nigel Tao85fba7f2020-02-29 16:28:06 +1100928 depth++;
929 ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
930 ? context::in_list_after_bracket
931 : context::in_dict_after_brace;
932 return nullptr;
933
Nigel Tao2cf76db2020-02-27 22:42:01 +1100934 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Taod1c928a2020-02-28 12:43:53 +1100935 if (!t.link_prev()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100936 TRY(write_dst("\"", 1));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100937 query.restart_fragment(in_dict_before_key() && query.is_at(depth));
Nigel Tao2cf76db2020-02-27 22:42:01 +1100938 }
Nigel Taocb37a562020-02-28 09:56:24 +1100939
940 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
941 // No-op.
942 } else if (vbd &
943 WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100944 uint8_t* ptr = src.data.ptr + curr_token_end_src_index - len;
945 TRY(write_dst(ptr, len));
946 query.incremental_match_slice(ptr, len);
Nigel Taocb37a562020-02-28 09:56:24 +1100947 } else {
948 return "main: internal error: unexpected string-token conversion";
949 }
950
Nigel Taod1c928a2020-02-28 12:43:53 +1100951 if (t.link_next()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100952 return nullptr;
953 }
954 TRY(write_dst("\"", 1));
955 goto after_value;
956
957 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100958 if (!t.link_prev() || !t.link_next()) {
959 return "main: internal error: unexpected unlinked token";
960 }
961 TRY(handle_unicode_code_point(vbd));
962 query.incremental_match_code_point(vbd);
963 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100964
Nigel Tao85fba7f2020-02-29 16:28:06 +1100965 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao2cf76db2020-02-27 22:42:01 +1100966 case WUFFS_BASE__TOKEN__VBC__NUMBER:
967 TRY(write_dst(src.data.ptr + curr_token_end_src_index - len, len));
968 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +1100969 }
970
971 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +1100972 return "main: internal error: unexpected token";
973 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +1100974
Nigel Tao2cf76db2020-02-27 22:42:01 +1100975 // Book-keeping after completing a value (whether a container value or a
976 // simple value). Empty parent containers are no longer empty. If the parent
977 // container is a "{...}" object, toggle between keys and values.
978after_value:
979 if (depth == 0) {
980 return eod;
981 }
982 switch (ctx) {
983 case context::in_list_after_bracket:
984 ctx = context::in_list_after_value;
985 break;
986 case context::in_dict_after_brace:
987 ctx = context::in_dict_after_key;
988 break;
989 case context::in_dict_after_key:
990 ctx = context::in_dict_after_value;
991 break;
992 case context::in_dict_after_value:
993 ctx = context::in_dict_after_key;
994 break;
995 }
996 return nullptr;
997}
998
999const char* //
1000main1(int argc, char** argv) {
1001 TRY(initialize_globals(argc, argv));
1002
1003 while (true) {
1004 wuffs_base__status status = dec.decode_tokens(&tok, &src);
1005
1006 while (tok.meta.ri < tok.meta.wi) {
1007 wuffs_base__token t = tok.data.ptr[tok.meta.ri++];
1008 uint64_t n = t.length();
1009 if ((src.meta.ri - curr_token_end_src_index) < n) {
1010 return "main: internal error: inconsistent src indexes";
1011 }
1012 curr_token_end_src_index += n;
1013
Nigel Taod0b16cb2020-03-14 10:15:54 +11001014 // Skip filler tokens (e.g. whitespace).
Nigel Tao2cf76db2020-02-27 22:42:01 +11001015 if (t.value() == 0) {
1016 continue;
1017 }
1018
1019 const char* z = handle_token(t);
1020 if (z == nullptr) {
1021 continue;
1022 } else if (z == eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001023 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001024 }
1025 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001026 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001027
1028 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001029 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001030 } else if (status.repr == wuffs_base__suspension__short_read) {
1031 if (curr_token_end_src_index != src.meta.ri) {
1032 return "main: internal error: inconsistent src indexes";
1033 }
1034 TRY(read_src());
1035 curr_token_end_src_index = src.meta.ri;
1036 } else if (status.repr == wuffs_base__suspension__short_write) {
1037 tok.compact();
1038 } else {
1039 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001040 }
1041 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001042end_of_data:
1043
1044 // With a non-empty query, don't try to consume trailing whitespace or
1045 // confirm that we've processed all the tokens.
1046 if (flags.query_c_string && *flags.query_c_string) {
1047 return nullptr;
1048 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001049
Nigel Tao6b161af2020-02-24 11:01:48 +11001050 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
1051 // but it works better with line oriented Unix tools (such as "echo 123 |
1052 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
1053 // can accidentally contain trailing whitespace.
1054 //
1055 // A whitespace trailer is zero or more ' ' and then zero or one '\n'.
1056 while (true) {
1057 if (src.meta.ri < src.meta.wi) {
1058 uint8_t c = src.data.ptr[src.meta.ri];
1059 if (c == ' ') {
1060 src.meta.ri++;
1061 continue;
1062 } else if (c == '\n') {
1063 src.meta.ri++;
1064 break;
1065 }
1066 // The "exhausted the input" check below will fail.
1067 break;
1068 } else if (src.meta.closed) {
1069 break;
1070 }
1071 TRY(read_src());
1072 }
1073
1074 // Check that we've exhausted the input.
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001075 if ((src.meta.ri == src.meta.wi) && !src.meta.closed) {
1076 TRY(read_src());
1077 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001078 if ((src.meta.ri < src.meta.wi) || !src.meta.closed) {
1079 return "main: valid JSON followed by further (unexpected) data";
1080 }
1081
1082 // Check that we've used all of the decoded tokens, other than trailing
1083 // filler tokens. For example, a bare `"foo"` string is valid JSON, but even
1084 // without a trailing '\n', the Wuffs JSON parser emits a filler token for
1085 // the final '\"'.
1086 for (; tok.meta.ri < tok.meta.wi; tok.meta.ri++) {
1087 if (tok.data.ptr[tok.meta.ri].value_base_category() !=
1088 WUFFS_BASE__TOKEN__VBC__FILLER) {
1089 return "main: internal error: decoded OK but unprocessed tokens remain";
1090 }
1091 }
1092
1093 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001094}
1095
Nigel Tao2914bae2020-02-26 09:40:30 +11001096int //
1097compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001098 if (!status_msg) {
1099 return 0;
1100 }
Nigel Tao01abc842020-03-06 21:42:33 +11001101 size_t n;
1102 if (status_msg == usage) {
1103 n = strlen(status_msg);
1104 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001105 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001106 if (n >= 2047) {
1107 status_msg = "main: internal error: error message is too long";
1108 n = strnlen(status_msg, 2047);
1109 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001110 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001111 const int stderr_fd = 2;
1112 ignore_return_value(write(stderr_fd, status_msg, n));
1113 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001114 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1115 // formatted or unsupported input.
1116 //
1117 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1118 // run-time checks found that an internal invariant did not hold.
1119 //
1120 // Automated testing, including badly formatted inputs, can therefore
1121 // discriminate between expected failure (exit code 1) and unexpected failure
1122 // (other non-zero exit codes). Specifically, exit code 2 for internal
1123 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1124 // linux) for a segmentation fault (e.g. null pointer dereference).
1125 return strstr(status_msg, "internal error:") ? 2 : 1;
1126}
1127
Nigel Tao2914bae2020-02-26 09:40:30 +11001128int //
1129main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001130 // Look for an input filename (the first non-flag argument) in argv. If there
1131 // is one, open it (but do not read from it) before we self-impose a sandbox.
1132 //
1133 // Flags start with "-", unless it comes after a bare "--" arg.
1134 {
1135 bool dash_dash = false;
1136 int a;
1137 for (a = 1; a < argc; a++) {
1138 char* arg = argv[a];
1139 if ((arg[0] == '-') && !dash_dash) {
1140 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1141 continue;
1142 }
1143 input_file_descriptor = open(arg, O_RDONLY);
1144 if (input_file_descriptor < 0) {
1145 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1146 return 1;
1147 }
1148 break;
1149 }
1150 }
1151
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001152#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1153 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
1154 sandboxed = true;
1155#endif
1156
Nigel Tao0cd2f982020-03-03 23:03:02 +11001157 const char* z = main1(argc, argv);
1158 if (wrote_to_dst) {
1159 const char* z1 = write_dst("\n", 1);
1160 const char* z2 = flush_dst();
1161 z = z ? z : (z1 ? z1 : z2);
1162 }
1163 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001164
1165#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1166 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1167 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1168 // only SYS_exit.
1169 syscall(SYS_exit, exit_code);
1170#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001171 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001172}