blob: 9fd764fd80bb376c9211ba05c3f04b1df1ea5516 [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
19(RFC 6901) query syntax. It reads UTF-8 JSON from stdin and writes
20canonicalized, formatted UTF-8 JSON to stdout.
21
Nigel Taod60815c2020-03-26 14:32:35 +110022See the "const char* g_usage" string below for details.
Nigel Tao0cd2f982020-03-03 23:03:02 +110023
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
30One benefit of simplicity is that this program's JSON and JSON Pointer
31implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
36The core JSON implementation is also written in the Wuffs programming language
Nigel Taof2eb7012020-03-16 21:10:20 +110037(and then transpiled to C/C++), which is memory-safe (e.g. array indexing is
38bounds-checked) but also guards against integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao0cd2f982020-03-03 23:03:02 +110045All together, this program aims to safely handle untrusted JSON files without
46fear of security bugs such as remote code execution.
47
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
Nigel Taod60815c2020-03-26 14:32:35 +110063changes global state (e.g. the `g_depth` and `g_ctx` variables) and prints
Nigel Taod0b16cb2020-03-14 10:15:54 +110064output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao1b073492020-02-16 22:11:36 +110089This example program differs from most other example Wuffs programs in that it
90is written in C++, not C.
91
92$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
93
94for a C++ compiler $CXX, such as clang++ or g++.
95*/
96
Nigel Taofe0cbbd2020-03-05 22:01:30 +110097#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +110098#include <fcntl.h>
99#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100100#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100101#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100102
103// Wuffs ships as a "single file C library" or "header file library" as per
104// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
105//
106// To use that single file as a "foo.c"-like implementation, instead of a
107// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
108// compiling it.
109#define WUFFS_IMPLEMENTATION
110
111// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
112// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
113// the entire Wuffs standard library, implementing a variety of codecs and file
114// formats. Without this macro definition, an optimizing compiler or linker may
115// very well discard Wuffs code for unused codecs, but listing the Wuffs
116// modules we use makes that process explicit. Preprocessing means that such
117// code simply isn't compiled.
118#define WUFFS_CONFIG__MODULES
119#define WUFFS_CONFIG__MODULE__BASE
120#define WUFFS_CONFIG__MODULE__JSON
121
122// If building this program in an environment that doesn't easily accommodate
123// relative includes, you can use the script/inline-c-relative-includes.go
124// program to generate a stand-alone C++ file.
125#include "../../release/c/wuffs-unsupported-snapshot.c"
126
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100127#if defined(__linux__)
128#include <linux/prctl.h>
129#include <linux/seccomp.h>
130#include <sys/prctl.h>
131#include <sys/syscall.h>
132#define WUFFS_EXAMPLE_USE_SECCOMP
133#endif
134
Nigel Tao2cf76db2020-02-27 22:42:01 +1100135#define TRY(error_msg) \
136 do { \
137 const char* z = error_msg; \
138 if (z) { \
139 return z; \
140 } \
141 } while (false)
142
Nigel Taod60815c2020-03-26 14:32:35 +1100143static const char* g_eod = "main: end of data";
Nigel Tao2cf76db2020-02-27 22:42:01 +1100144
Nigel Taod60815c2020-03-26 14:32:35 +1100145static const char* g_usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100146 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100147 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100148 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100149 " -c -compact-output\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100150 " -i=NUM -indent=NUM\n"
151 " -o=NUM -max-output-depth=NUM\n"
152 " -q=STR -query=STR\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100153 " -s -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100154 " -t -tabs\n"
155 " -fail-if-unsandboxed\n"
156 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100157 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100158 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100159 "----\n"
160 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100161 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
162 "Pointer (RFC 6901) query syntax. It reads UTF-8 JSON from stdin and\n"
163 "writes canonicalized, formatted UTF-8 JSON to stdout.\n"
164 "\n"
165 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
166 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100167 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100168 "\n"
169 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100170 "on its own line. Configure this with the -c / -compact-output, -i=NUM /\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100171 "-indent=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100172 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100173 "----\n"
174 "\n"
175 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100176 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100177 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
178 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100179 "will print:\n"
180 " \"baz\"\n"
181 "\n"
182 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100183 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100184 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
185 "child (the value in a key-value pair) of the root whose key is the empty\n"
186 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100187 "\n"
188 "If the query found a valid JSON value, this program will return a zero\n"
189 "exit code even if the rest of the input isn't valid JSON. If the query\n"
190 "did not find a value, or found an invalid one, this program returns a\n"
191 "non-zero exit code, but may still print partial output to stdout.\n"
192 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100193 "The JSON specification (https://json.org/) permits implementations that\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100194 "allow duplicate keys, as this one does. This JSON Pointer implementation\n"
195 "is also greedy, following the first match for each fragment without\n"
196 "back-tracking. For example, the \"/foo/bar\" query will fail if the root\n"
197 "object has multiple \"foo\" children but the first one doesn't have a\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100198 "\"bar\" child, even if later ones do.\n"
199 "\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100200 "The -s or -strict-json-pointer-syntax flag restricts the -query=STR\n"
201 "string to exactly RFC 6901, with only two escape sequences: \"~0\" and\n"
202 "\"~1\" for \"~\" and \"/\". Without this flag, this program also lets\n"
203 "\"~n\" and \"~r\" escape the New Line and Carriage Return ASCII control\n"
204 "characters, which can work better with line oriented Unix tools that\n"
205 "assume exactly one value (i.e. one JSON Pointer string) per line.\n"
206 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100207 "----\n"
208 "\n"
209 "The -o=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
210 "output depth. JSON containers ([] arrays and {} objects) can hold other\n"
211 "containers. When this flag is set, containers at depth NUM are replaced\n"
212 "with \"[…]\" or \"{…}\". A bare -o or -max-output-depth is equivalent to\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100213 "-o=1. The flag's absence is equivalent to an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100214 "\n"
215 "The -max-output-depth flag only affects the program's output. It doesn't\n"
216 "affect whether or not the input is considered valid JSON. The JSON\n"
217 "specification permits implementations to set their own maximum input\n"
218 "depth. This JSON implementation sets it to 1024.\n"
219 "\n"
220 "Depth is measured in terms of nested containers. It is unaffected by the\n"
221 "number of spaces or tabs used to indent.\n"
222 "\n"
223 "When both -max-output-depth and -query are set, the output depth is\n"
224 "measured from when the query resolves, not from the input root. The\n"
225 "input depth (measured from the root) is still limited to 1024.\n"
226 "\n"
227 "----\n"
228 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100229 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
230 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100231 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100232
Nigel Tao2cf76db2020-02-27 22:42:01 +1100233// ----
234
Nigel Taof3146c22020-03-26 08:47:42 +1100235// Wuffs allows either statically or dynamically allocated work buffers. This
236// program exercises static allocation.
237#define WORK_BUFFER_ARRAY_SIZE \
238 WUFFS_JSON__DECODER_WORKBUF_LEN_MAX_INCL_WORST_CASE
239#if WORK_BUFFER_ARRAY_SIZE > 0
Nigel Taod60815c2020-03-26 14:32:35 +1100240uint8_t g_work_buffer_array[WORK_BUFFER_ARRAY_SIZE];
Nigel Taof3146c22020-03-26 08:47:42 +1100241#else
242// Not all C/C++ compilers support 0-length arrays.
Nigel Taod60815c2020-03-26 14:32:35 +1100243uint8_t g_work_buffer_array[1];
Nigel Taof3146c22020-03-26 08:47:42 +1100244#endif
245
Nigel Taod60815c2020-03-26 14:32:35 +1100246bool g_sandboxed = false;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100247
Nigel Taod60815c2020-03-26 14:32:35 +1100248int g_input_file_descriptor = 0; // A 0 default means stdin.
Nigel Tao01abc842020-03-06 21:42:33 +1100249
Nigel Tao2cf76db2020-02-27 22:42:01 +1100250#define MAX_INDENT 8
Nigel Tao107f0ef2020-03-01 21:35:02 +1100251#define INDENT_SPACES_STRING " "
Nigel Tao6e7d1412020-03-06 09:21:35 +1100252#define INDENT_TAB_STRING "\t"
Nigel Tao107f0ef2020-03-01 21:35:02 +1100253
Nigel Taofdac24a2020-03-06 21:53:08 +1100254#ifndef DST_BUFFER_ARRAY_SIZE
255#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100256#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100257#ifndef SRC_BUFFER_ARRAY_SIZE
258#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100259#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100260#ifndef TOKEN_BUFFER_ARRAY_SIZE
261#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100262#endif
263
Nigel Taod60815c2020-03-26 14:32:35 +1100264uint8_t g_dst_array[DST_BUFFER_ARRAY_SIZE];
265uint8_t g_src_array[SRC_BUFFER_ARRAY_SIZE];
266wuffs_base__token g_tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100267
Nigel Taod60815c2020-03-26 14:32:35 +1100268wuffs_base__io_buffer g_dst;
269wuffs_base__io_buffer g_src;
270wuffs_base__token_buffer g_tok;
Nigel Tao1b073492020-02-16 22:11:36 +1100271
Nigel Taod60815c2020-03-26 14:32:35 +1100272// g_curr_token_end_src_index is the g_src.data.ptr index of the end of the
273// current token. An invariant is that (g_curr_token_end_src_index <=
274// g_src.meta.ri).
275size_t g_curr_token_end_src_index;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100276
Nigel Taod60815c2020-03-26 14:32:35 +1100277uint32_t g_depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100278
279enum class context {
280 none,
281 in_list_after_bracket,
282 in_list_after_value,
283 in_dict_after_brace,
284 in_dict_after_key,
285 in_dict_after_value,
Nigel Taod60815c2020-03-26 14:32:35 +1100286} g_ctx;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100287
Nigel Tao0cd2f982020-03-03 23:03:02 +1100288bool //
289in_dict_before_key() {
Nigel Taod60815c2020-03-26 14:32:35 +1100290 return (g_ctx == context::in_dict_after_brace) ||
291 (g_ctx == context::in_dict_after_value);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100292}
293
Nigel Taod60815c2020-03-26 14:32:35 +1100294uint32_t g_suppress_write_dst;
295bool g_wrote_to_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100296
Nigel Taod60815c2020-03-26 14:32:35 +1100297wuffs_json__decoder g_dec;
Nigel Tao1b073492020-02-16 22:11:36 +1100298
Nigel Tao0cd2f982020-03-03 23:03:02 +1100299// ----
300
301// Query is a JSON Pointer query. After initializing with a NUL-terminated C
302// string, its multiple fragments are consumed as the program walks the JSON
303// data from stdin. For example, letting "$" denote a NUL, suppose that we
304// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100305// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100306//
307// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
308// / a p p l e / b a n a n a / 1 2 / d u r i a n $
309// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
310// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100311// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100312//
Nigel Taob48ee752020-03-13 09:27:33 +1100313// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
314// start (inclusive) and end (exclusive) of the query fragment. They satisfy
315// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
316// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100317//
Nigel Taob48ee752020-03-13 09:27:33 +1100318// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
319// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100320//
321// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
322// tokens, as backslash-escaped values within that JSON string may each get
323// their own token.
324//
Nigel Taob48ee752020-03-13 09:27:33 +1100325// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100326//
Nigel Taob48ee752020-03-13 09:27:33 +1100327// While mfj remains non-nullptr, each token's unescaped contents are then
328// compared to that part of the fragment from mfj to mfk. If it is a prefix
329// (including the case of an exact match), then mfj is advanced by the
330// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100331//
332// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
333// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100334// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
335// responsible for calling Query::validate (with a strict_json_pointer_syntax
336// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100337//
Nigel Taob48ee752020-03-13 09:27:33 +1100338// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
339// incrementally match the object key with the query fragment. For example, if
340// we have already matched the "ban" of "banana", then we would accept any of
341// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
342// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100343//
Nigel Taob48ee752020-03-13 09:27:33 +1100344// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100345// v
346// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
347// / a p p l e / b a n a n a / 1 2 / d u r i a n $
348// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
349// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100350// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100351//
352// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100353// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
354// have a fragment match: the query fragment equals the object key. If there is
355// a next fragment (in this example, "12") we move the frag_etc pointers to its
356// start and end and increment Query::m_depth. Otherwise, we have matched the
357// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100358//
359// The discussion above centers on object keys. If the query fragment is
360// numeric then it can also match as an array index: the string fragment "12"
361// will match an array's 13th element (starting counting from zero). See RFC
362// 6901 for its precise definition of an "array index" number.
363//
Nigel Taob48ee752020-03-13 09:27:33 +1100364// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100365// whose type (wuffs_base__result_u64) is a result type. An error result means
366// that the fragment is not an array index. A value result holds the number of
367// list elements remaining. When matching a query fragment in an array (instead
368// of in an object), each element ticks this number down towards zero. At zero,
369// the upcoming JSON value is the one that matches the query fragment.
370class Query {
371 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100372 uint8_t* m_frag_i;
373 uint8_t* m_frag_j;
374 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100375
Nigel Taob48ee752020-03-13 09:27:33 +1100376 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100377
Nigel Taob48ee752020-03-13 09:27:33 +1100378 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100379
380 public:
381 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100382 m_frag_i = (uint8_t*)query_c_string;
383 m_frag_j = (uint8_t*)query_c_string;
384 m_frag_k = (uint8_t*)query_c_string;
385 m_depth = 0;
386 m_array_index.status.repr = "#main: not an array index query fragment";
387 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100388 }
389
Nigel Taob48ee752020-03-13 09:27:33 +1100390 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100391
Nigel Taob48ee752020-03-13 09:27:33 +1100392 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100393
394 // tick returns whether the fragment is a valid array index whose value is
395 // zero. If valid but non-zero, it decrements it and returns false.
396 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100397 if (m_array_index.status.is_ok()) {
398 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100399 return true;
400 }
Nigel Taob48ee752020-03-13 09:27:33 +1100401 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100402 }
403 return false;
404 }
405
406 // next_fragment moves to the next fragment, returning whether it existed.
407 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100408 uint8_t* k = m_frag_k;
409 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100410
411 this->reset(nullptr);
412
413 if (!k || (*k != '/')) {
414 return false;
415 }
416 k++;
417
418 bool all_digits = true;
419 uint8_t* i = k;
420 while ((*k != '\x00') && (*k != '/')) {
421 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
422 k++;
423 }
Nigel Taob48ee752020-03-13 09:27:33 +1100424 m_frag_i = i;
425 m_frag_j = i;
426 m_frag_k = k;
427 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100428 if (all_digits) {
429 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Taob48ee752020-03-13 09:27:33 +1100430 m_array_index =
Nigel Tao0cd2f982020-03-03 23:03:02 +1100431 wuffs_base__parse_number_u64(wuffs_base__make_slice_u8(i, k - i));
432 }
433 return true;
434 }
435
Nigel Taob48ee752020-03-13 09:27:33 +1100436 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100437
Nigel Taob48ee752020-03-13 09:27:33 +1100438 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100439
440 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100441 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100442 return;
443 }
Nigel Taob48ee752020-03-13 09:27:33 +1100444 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100445 while (true) {
446 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100447 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100448 return;
449 }
450
451 if (*j == '\x00') {
452 break;
453
454 } else if (*j == '~') {
455 j++;
456 if (*j == '0') {
457 if (*ptr != '~') {
458 break;
459 }
460 } else if (*j == '1') {
461 if (*ptr != '/') {
462 break;
463 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100464 } else if (*j == 'n') {
465 if (*ptr != '\n') {
466 break;
467 }
468 } else if (*j == 'r') {
469 if (*ptr != '\r') {
470 break;
471 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100472 } else {
473 break;
474 }
475
476 } else if (*j != *ptr) {
477 break;
478 }
479
480 j++;
481 ptr++;
482 len--;
483 }
Nigel Taob48ee752020-03-13 09:27:33 +1100484 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100485 }
486
487 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100488 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100489 return;
490 }
491 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
492 size_t n = wuffs_base__utf_8__encode(
493 wuffs_base__make_slice_u8(&u[0],
494 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
495 code_point);
496 if (n > 0) {
497 this->incremental_match_slice(&u[0], n);
498 }
499 }
500
501 // validate returns whether the (ptr, len) arguments form a valid JSON
502 // Pointer. In particular, it must be valid UTF-8, and either be empty or
503 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100504 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
505 // followed by either 'n' or 'r'.
506 static bool validate(char* query_c_string,
507 size_t length,
508 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100509 if (length <= 0) {
510 return true;
511 }
512 if (query_c_string[0] != '/') {
513 return false;
514 }
515 wuffs_base__slice_u8 s =
516 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
517 bool previous_was_tilde = false;
518 while (s.len > 0) {
519 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s);
520 if (!o.is_valid()) {
521 return false;
522 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100523
524 if (previous_was_tilde) {
525 switch (o.code_point) {
526 case '0':
527 case '1':
528 break;
529 case 'n':
530 case 'r':
531 if (strict_json_pointer_syntax) {
532 return false;
533 }
534 break;
535 default:
536 return false;
537 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100538 }
539 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100540
Nigel Tao0cd2f982020-03-03 23:03:02 +1100541 s.ptr += o.byte_length;
542 s.len -= o.byte_length;
543 }
544 return !previous_was_tilde;
545 }
Nigel Taod60815c2020-03-26 14:32:35 +1100546} g_query;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100547
548// ----
549
Nigel Tao68920952020-03-03 11:25:18 +1100550struct {
551 int remaining_argc;
552 char** remaining_argv;
553
Nigel Tao3690e832020-03-12 16:52:26 +1100554 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100555 bool fail_if_unsandboxed;
Nigel Tao68920952020-03-03 11:25:18 +1100556 size_t indent;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100557 uint32_t max_output_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100558 char* query_c_string;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100559 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100560 bool tabs;
Nigel Taod60815c2020-03-26 14:32:35 +1100561} g_flags = {0};
Nigel Tao68920952020-03-03 11:25:18 +1100562
563const char* //
564parse_flags(int argc, char** argv) {
Nigel Taod60815c2020-03-26 14:32:35 +1100565 g_flags.indent = 4;
566 g_flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100567
568 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
569 for (; c < argc; c++) {
570 char* arg = argv[c];
571 if (*arg++ != '-') {
572 break;
573 }
574
575 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
576 // cases, a bare "-" is not a flag (some programs may interpret it as
577 // stdin) and a bare "--" means to stop parsing flags.
578 if (*arg == '\x00') {
579 break;
580 } else if (*arg == '-') {
581 arg++;
582 if (*arg == '\x00') {
583 c++;
584 break;
585 }
586 }
587
Nigel Tao3690e832020-03-12 16:52:26 +1100588 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100589 g_flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100590 continue;
591 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100592 if (!strcmp(arg, "fail-if-unsandboxed")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100593 g_flags.fail_if_unsandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100594 continue;
595 }
Nigel Tao68920952020-03-03 11:25:18 +1100596 if (!strncmp(arg, "i=", 2) || !strncmp(arg, "indent=", 7)) {
597 while (*arg++ != '=') {
598 }
599 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
Nigel Taod60815c2020-03-26 14:32:35 +1100600 g_flags.indent = arg[0] - '0';
Nigel Tao68920952020-03-03 11:25:18 +1100601 continue;
602 }
Nigel Taod60815c2020-03-26 14:32:35 +1100603 return g_usage;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100604 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100605 if (!strcmp(arg, "o") || !strcmp(arg, "max-output-depth")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100606 g_flags.max_output_depth = 1;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100607 continue;
608 } else if (!strncmp(arg, "o=", 2) ||
609 !strncmp(arg, "max-output-depth=", 16)) {
610 while (*arg++ != '=') {
611 }
612 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
613 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)));
614 if (wuffs_base__status__is_ok(&u.status) && (u.value <= 0xFFFFFFFF)) {
Nigel Taod60815c2020-03-26 14:32:35 +1100615 g_flags.max_output_depth = (uint32_t)(u.value);
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100616 continue;
617 }
Nigel Taod60815c2020-03-26 14:32:35 +1100618 return g_usage;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100619 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100620 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
621 while (*arg++ != '=') {
622 }
Nigel Taod60815c2020-03-26 14:32:35 +1100623 g_flags.query_c_string = arg;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100624 continue;
625 }
626 if (!strcmp(arg, "s") || !strcmp(arg, "strict-json-pointer-syntax")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100627 g_flags.strict_json_pointer_syntax = true;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100628 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100629 }
630 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100631 g_flags.tabs = true;
Nigel Tao68920952020-03-03 11:25:18 +1100632 continue;
633 }
634
Nigel Taod60815c2020-03-26 14:32:35 +1100635 return g_usage;
Nigel Tao68920952020-03-03 11:25:18 +1100636 }
637
Nigel Taod60815c2020-03-26 14:32:35 +1100638 if (g_flags.query_c_string &&
639 !Query::validate(g_flags.query_c_string, strlen(g_flags.query_c_string),
640 g_flags.strict_json_pointer_syntax)) {
Nigel Taod6fdfb12020-03-11 12:24:14 +1100641 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
642 }
643
Nigel Taod60815c2020-03-26 14:32:35 +1100644 g_flags.remaining_argc = argc - c;
645 g_flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100646 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100647}
648
Nigel Tao2cf76db2020-02-27 22:42:01 +1100649const char* //
650initialize_globals(int argc, char** argv) {
Nigel Taod60815c2020-03-26 14:32:35 +1100651 g_dst = wuffs_base__make_io_buffer(
652 wuffs_base__make_slice_u8(g_dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100653 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100654
Nigel Taod60815c2020-03-26 14:32:35 +1100655 g_src = wuffs_base__make_io_buffer(
656 wuffs_base__make_slice_u8(g_src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100657 wuffs_base__empty_io_buffer_meta());
658
Nigel Taod60815c2020-03-26 14:32:35 +1100659 g_tok = wuffs_base__make_token_buffer(
660 wuffs_base__make_slice_token(g_tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100661 wuffs_base__empty_token_buffer_meta());
662
Nigel Taod60815c2020-03-26 14:32:35 +1100663 g_curr_token_end_src_index = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100664
Nigel Taod60815c2020-03-26 14:32:35 +1100665 g_depth = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100666
Nigel Taod60815c2020-03-26 14:32:35 +1100667 g_ctx = context::none;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100668
Nigel Tao68920952020-03-03 11:25:18 +1100669 TRY(parse_flags(argc, argv));
Nigel Taod60815c2020-03-26 14:32:35 +1100670 if (g_flags.fail_if_unsandboxed && !g_sandboxed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100671 return "main: unsandboxed";
672 }
Nigel Tao01abc842020-03-06 21:42:33 +1100673 const int stdin_fd = 0;
Nigel Taod60815c2020-03-26 14:32:35 +1100674 if (g_flags.remaining_argc >
675 ((g_input_file_descriptor != stdin_fd) ? 1 : 0)) {
676 return g_usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100677 }
678
Nigel Taod60815c2020-03-26 14:32:35 +1100679 g_query.reset(g_flags.query_c_string);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100680
681 // If the query is non-empty, suprress writing to stdout until we've
682 // completed the query.
Nigel Taod60815c2020-03-26 14:32:35 +1100683 g_suppress_write_dst = g_query.next_fragment() ? 1 : 0;
684 g_wrote_to_dst = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100685
Nigel Taod60815c2020-03-26 14:32:35 +1100686 TRY(g_dec.initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
Nigel Tao4b186b02020-03-18 14:25:21 +1100687 .message());
688
689 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
690 // but it works better with line oriented Unix tools (such as "echo 123 |
691 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
692 // can accidentally contain trailing whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +1100693 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
Nigel Tao4b186b02020-03-18 14:25:21 +1100694
695 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100696}
Nigel Tao1b073492020-02-16 22:11:36 +1100697
698// ----
699
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100700// ignore_return_value suppresses errors from -Wall -Werror.
701static void //
702ignore_return_value(int ignored) {}
703
Nigel Tao2914bae2020-02-26 09:40:30 +1100704const char* //
705read_src() {
Nigel Taod60815c2020-03-26 14:32:35 +1100706 if (g_src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100707 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100708 }
Nigel Taod60815c2020-03-26 14:32:35 +1100709 g_src.compact();
710 if (g_src.meta.wi >= g_src.data.len) {
711 return "main: g_src buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100712 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100713 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +1100714 ssize_t n = read(g_input_file_descriptor, g_src.data.ptr + g_src.meta.wi,
715 g_src.data.len - g_src.meta.wi);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100716 if (n >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100717 g_src.meta.wi += n;
718 g_src.meta.closed = n == 0;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100719 break;
720 } else if (errno != EINTR) {
721 return strerror(errno);
722 }
Nigel Tao1b073492020-02-16 22:11:36 +1100723 }
724 return nullptr;
725}
726
Nigel Tao2914bae2020-02-26 09:40:30 +1100727const char* //
728flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100729 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +1100730 size_t n = g_dst.meta.wi - g_dst.meta.ri;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100731 if (n == 0) {
732 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100733 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100734 const int stdout_fd = 1;
Nigel Taod60815c2020-03-26 14:32:35 +1100735 ssize_t i = write(stdout_fd, g_dst.data.ptr + g_dst.meta.ri, n);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100736 if (i >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100737 g_dst.meta.ri += i;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100738 } else if (errno != EINTR) {
739 return strerror(errno);
740 }
Nigel Tao1b073492020-02-16 22:11:36 +1100741 }
Nigel Taod60815c2020-03-26 14:32:35 +1100742 g_dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100743 return nullptr;
744}
745
Nigel Tao2914bae2020-02-26 09:40:30 +1100746const char* //
747write_dst(const void* s, size_t n) {
Nigel Taod60815c2020-03-26 14:32:35 +1100748 if (g_suppress_write_dst > 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100749 return nullptr;
750 }
Nigel Tao1b073492020-02-16 22:11:36 +1100751 const uint8_t* p = static_cast<const uint8_t*>(s);
752 while (n > 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100753 size_t i = g_dst.writer_available();
Nigel Tao1b073492020-02-16 22:11:36 +1100754 if (i == 0) {
755 const char* z = flush_dst();
756 if (z) {
757 return z;
758 }
Nigel Taod60815c2020-03-26 14:32:35 +1100759 i = g_dst.writer_available();
Nigel Tao1b073492020-02-16 22:11:36 +1100760 if (i == 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100761 return "main: g_dst buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100762 }
763 }
764
765 if (i > n) {
766 i = n;
767 }
Nigel Taod60815c2020-03-26 14:32:35 +1100768 memcpy(g_dst.data.ptr + g_dst.meta.wi, p, i);
769 g_dst.meta.wi += i;
Nigel Tao1b073492020-02-16 22:11:36 +1100770 p += i;
771 n -= i;
Nigel Taod60815c2020-03-26 14:32:35 +1100772 g_wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100773 }
774 return nullptr;
775}
776
777// ----
778
Nigel Tao2914bae2020-02-26 09:40:30 +1100779uint8_t //
780hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +1100781 nibble &= 0x0F;
782 if (nibble <= 9) {
783 return '0' + nibble;
784 }
785 return ('A' - 10) + nibble;
786}
787
Nigel Tao2914bae2020-02-26 09:40:30 +1100788const char* //
Nigel Tao3b486982020-02-27 15:05:59 +1100789handle_unicode_code_point(uint32_t ucp) {
790 if (ucp < 0x0020) {
791 switch (ucp) {
792 case '\b':
793 return write_dst("\\b", 2);
794 case '\f':
795 return write_dst("\\f", 2);
796 case '\n':
797 return write_dst("\\n", 2);
798 case '\r':
799 return write_dst("\\r", 2);
800 case '\t':
801 return write_dst("\\t", 2);
802 default: {
803 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
804 // JSON string. They need to remain escaped.
805 uint8_t esc6[6];
806 esc6[0] = '\\';
807 esc6[1] = 'u';
808 esc6[2] = '0';
809 esc6[3] = '0';
810 esc6[4] = hex_digit(ucp >> 4);
811 esc6[5] = hex_digit(ucp >> 0);
812 return write_dst(&esc6[0], 6);
813 }
814 }
815
Nigel Taob9ad34f2020-03-03 12:44:01 +1100816 } else if (ucp == '\"') {
817 return write_dst("\\\"", 2);
818
819 } else if (ucp == '\\') {
820 return write_dst("\\\\", 2);
821
822 } else {
823 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
824 size_t n = wuffs_base__utf_8__encode(
825 wuffs_base__make_slice_u8(&u[0],
826 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
827 ucp);
828 if (n > 0) {
829 return write_dst(&u[0], n);
Nigel Tao3b486982020-02-27 15:05:59 +1100830 }
Nigel Tao3b486982020-02-27 15:05:59 +1100831 }
832
Nigel Tao2cf76db2020-02-27 22:42:01 +1100833 return "main: internal error: unexpected Unicode code point";
Nigel Tao3b486982020-02-27 15:05:59 +1100834}
835
836const char* //
Nigel Tao2cf76db2020-02-27 22:42:01 +1100837handle_token(wuffs_base__token t) {
838 do {
839 uint64_t vbc = t.value_base_category();
840 uint64_t vbd = t.value_base_detail();
841 uint64_t len = t.length();
Nigel Tao1b073492020-02-16 22:11:36 +1100842
843 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +1100844 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +1100845 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Taod60815c2020-03-26 14:32:35 +1100846 if (g_query.is_at(g_depth)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100847 return "main: no match for query";
848 }
Nigel Taod60815c2020-03-26 14:32:35 +1100849 if (g_depth <= 0) {
850 return "main: internal error: inconsistent g_depth";
Nigel Tao1b073492020-02-16 22:11:36 +1100851 }
Nigel Taod60815c2020-03-26 14:32:35 +1100852 g_depth--;
Nigel Tao1b073492020-02-16 22:11:36 +1100853
Nigel Taod60815c2020-03-26 14:32:35 +1100854 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
855 g_suppress_write_dst--;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100856 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
857 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
858 ? "\"[…]\""
859 : "\"{…}\"",
860 7));
861 } else {
862 // Write preceding whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +1100863 if ((g_ctx != context::in_list_after_bracket) &&
864 (g_ctx != context::in_dict_after_brace) &&
865 !g_flags.compact_output) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100866 TRY(write_dst("\n", 1));
Nigel Taod60815c2020-03-26 14:32:35 +1100867 for (uint32_t i = 0; i < g_depth; i++) {
868 TRY(write_dst(
869 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
870 g_flags.tabs ? 1 : g_flags.indent));
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100871 }
Nigel Tao1b073492020-02-16 22:11:36 +1100872 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100873
874 TRY(write_dst(
875 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
876 1));
Nigel Tao1b073492020-02-16 22:11:36 +1100877 }
878
Nigel Taod60815c2020-03-26 14:32:35 +1100879 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
880 ? context::in_list_after_value
881 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +1100882 goto after_value;
883 }
884
Nigel Taod1c928a2020-02-28 12:43:53 +1100885 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
886 // continuation of a multi-token chain.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100887 if (!t.link_prev()) {
Nigel Taod60815c2020-03-26 14:32:35 +1100888 if (g_ctx == context::in_dict_after_key) {
889 TRY(write_dst(": ", g_flags.compact_output ? 1 : 2));
890 } else if (g_ctx != context::none) {
891 if ((g_ctx != context::in_list_after_bracket) &&
892 (g_ctx != context::in_dict_after_brace)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100893 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +1100894 }
Nigel Taod60815c2020-03-26 14:32:35 +1100895 if (!g_flags.compact_output) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100896 TRY(write_dst("\n", 1));
Nigel Taod60815c2020-03-26 14:32:35 +1100897 for (size_t i = 0; i < g_depth; i++) {
898 TRY(write_dst(
899 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
900 g_flags.tabs ? 1 : g_flags.indent));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100901 }
902 }
903 }
904
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100905 bool query_matched_fragment = false;
Nigel Taod60815c2020-03-26 14:32:35 +1100906 if (g_query.is_at(g_depth)) {
907 switch (g_ctx) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100908 case context::in_list_after_bracket:
909 case context::in_list_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +1100910 query_matched_fragment = g_query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100911 break;
912 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +1100913 query_matched_fragment = g_query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100914 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +1100915 default:
916 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100917 }
918 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100919 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100920 // No-op.
Nigel Taod60815c2020-03-26 14:32:35 +1100921 } else if (!g_query.next_fragment()) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100922 // There is no next fragment. We have matched the complete query, and
923 // the upcoming JSON value is the result of that query.
924 //
Nigel Taod60815c2020-03-26 14:32:35 +1100925 // Un-suppress writing to stdout and reset the g_ctx and g_depth as if
926 // we were about to decode a top-level value. This makes any subsequent
927 // indentation be relative to this point, and we will return g_eod
928 // after the upcoming JSON value is complete.
929 if (g_suppress_write_dst != 1) {
930 return "main: internal error: inconsistent g_suppress_write_dst";
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100931 }
Nigel Taod60815c2020-03-26 14:32:35 +1100932 g_suppress_write_dst = 0;
933 g_ctx = context::none;
934 g_depth = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100935 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
936 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
937 // The query has moved on to the next fragment but the upcoming JSON
938 // value is not a container.
939 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +1100940 }
941 }
942
943 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +1100944 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +1100945 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +1100946 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Taod60815c2020-03-26 14:32:35 +1100947 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
948 g_suppress_write_dst++;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100949 } else {
950 TRY(write_dst(
951 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
952 1));
953 }
Nigel Taod60815c2020-03-26 14:32:35 +1100954 g_depth++;
955 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
956 ? context::in_list_after_bracket
957 : context::in_dict_after_brace;
Nigel Tao85fba7f2020-02-29 16:28:06 +1100958 return nullptr;
959
Nigel Tao2cf76db2020-02-27 22:42:01 +1100960 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Taod1c928a2020-02-28 12:43:53 +1100961 if (!t.link_prev()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100962 TRY(write_dst("\"", 1));
Nigel Taod60815c2020-03-26 14:32:35 +1100963 g_query.restart_fragment(in_dict_before_key() &&
964 g_query.is_at(g_depth));
Nigel Tao2cf76db2020-02-27 22:42:01 +1100965 }
Nigel Taocb37a562020-02-28 09:56:24 +1100966
967 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
968 // No-op.
969 } else if (vbd &
970 WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
Nigel Taod60815c2020-03-26 14:32:35 +1100971 uint8_t* ptr = g_src.data.ptr + g_curr_token_end_src_index - len;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100972 TRY(write_dst(ptr, len));
Nigel Taod60815c2020-03-26 14:32:35 +1100973 g_query.incremental_match_slice(ptr, len);
Nigel Taocb37a562020-02-28 09:56:24 +1100974 } else {
975 return "main: internal error: unexpected string-token conversion";
976 }
977
Nigel Taod1c928a2020-02-28 12:43:53 +1100978 if (t.link_next()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100979 return nullptr;
980 }
981 TRY(write_dst("\"", 1));
982 goto after_value;
983
984 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100985 if (!t.link_prev() || !t.link_next()) {
986 return "main: internal error: unexpected unlinked token";
987 }
988 TRY(handle_unicode_code_point(vbd));
Nigel Taod60815c2020-03-26 14:32:35 +1100989 g_query.incremental_match_code_point(vbd);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100990 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100991
Nigel Tao85fba7f2020-02-29 16:28:06 +1100992 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao2cf76db2020-02-27 22:42:01 +1100993 case WUFFS_BASE__TOKEN__VBC__NUMBER:
Nigel Taod60815c2020-03-26 14:32:35 +1100994 TRY(write_dst(g_src.data.ptr + g_curr_token_end_src_index - len, len));
Nigel Tao2cf76db2020-02-27 22:42:01 +1100995 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +1100996 }
997
998 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +1100999 return "main: internal error: unexpected token";
1000 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +11001001
Nigel Tao2cf76db2020-02-27 22:42:01 +11001002 // Book-keeping after completing a value (whether a container value or a
1003 // simple value). Empty parent containers are no longer empty. If the parent
1004 // container is a "{...}" object, toggle between keys and values.
1005after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001006 if (g_depth == 0) {
1007 return g_eod;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001008 }
Nigel Taod60815c2020-03-26 14:32:35 +11001009 switch (g_ctx) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001010 case context::in_list_after_bracket:
Nigel Taod60815c2020-03-26 14:32:35 +11001011 g_ctx = context::in_list_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001012 break;
1013 case context::in_dict_after_brace:
Nigel Taod60815c2020-03-26 14:32:35 +11001014 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001015 break;
1016 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001017 g_ctx = context::in_dict_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001018 break;
1019 case context::in_dict_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001020 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001021 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001022 default:
1023 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001024 }
1025 return nullptr;
1026}
1027
1028const char* //
1029main1(int argc, char** argv) {
1030 TRY(initialize_globals(argc, argv));
1031
1032 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +11001033 wuffs_base__status status = g_dec.decode_tokens(
1034 &g_tok, &g_src,
1035 wuffs_base__make_slice_u8(g_work_buffer_array, WORK_BUFFER_ARRAY_SIZE));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001036
Nigel Taod60815c2020-03-26 14:32:35 +11001037 while (g_tok.meta.ri < g_tok.meta.wi) {
1038 wuffs_base__token t = g_tok.data.ptr[g_tok.meta.ri++];
Nigel Tao2cf76db2020-02-27 22:42:01 +11001039 uint64_t n = t.length();
Nigel Taod60815c2020-03-26 14:32:35 +11001040 if ((g_src.meta.ri - g_curr_token_end_src_index) < n) {
1041 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001042 }
Nigel Taod60815c2020-03-26 14:32:35 +11001043 g_curr_token_end_src_index += n;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001044
Nigel Taod0b16cb2020-03-14 10:15:54 +11001045 // Skip filler tokens (e.g. whitespace).
Nigel Tao2cf76db2020-02-27 22:42:01 +11001046 if (t.value() == 0) {
1047 continue;
1048 }
1049
1050 const char* z = handle_token(t);
1051 if (z == nullptr) {
1052 continue;
Nigel Taod60815c2020-03-26 14:32:35 +11001053 } else if (z == g_eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001054 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001055 }
1056 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001057 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001058
1059 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001060 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001061 } else if (status.repr == wuffs_base__suspension__short_read) {
Nigel Taod60815c2020-03-26 14:32:35 +11001062 if (g_curr_token_end_src_index != g_src.meta.ri) {
1063 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001064 }
1065 TRY(read_src());
Nigel Taod60815c2020-03-26 14:32:35 +11001066 g_curr_token_end_src_index = g_src.meta.ri;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001067 } else if (status.repr == wuffs_base__suspension__short_write) {
Nigel Taod60815c2020-03-26 14:32:35 +11001068 g_tok.compact();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001069 } else {
1070 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001071 }
1072 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001073end_of_data:
1074
Nigel Taod60815c2020-03-26 14:32:35 +11001075 // With a non-empty g_query, don't try to consume trailing whitespace or
Nigel Tao0cd2f982020-03-03 23:03:02 +11001076 // confirm that we've processed all the tokens.
Nigel Taod60815c2020-03-26 14:32:35 +11001077 if (g_flags.query_c_string && *g_flags.query_c_string) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001078 return nullptr;
1079 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001080
Nigel Tao6b161af2020-02-24 11:01:48 +11001081 // Check that we've exhausted the input.
Nigel Taod60815c2020-03-26 14:32:35 +11001082 if ((g_src.meta.ri == g_src.meta.wi) && !g_src.meta.closed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001083 TRY(read_src());
1084 }
Nigel Taod60815c2020-03-26 14:32:35 +11001085 if ((g_src.meta.ri < g_src.meta.wi) || !g_src.meta.closed) {
Nigel Tao6b161af2020-02-24 11:01:48 +11001086 return "main: valid JSON followed by further (unexpected) data";
1087 }
1088
1089 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001090 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1091 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1092 // filler token for the "\n".
Nigel Taod60815c2020-03-26 14:32:35 +11001093 for (; g_tok.meta.ri < g_tok.meta.wi; g_tok.meta.ri++) {
1094 if (g_tok.data.ptr[g_tok.meta.ri].value_base_category() !=
Nigel Tao6b161af2020-02-24 11:01:48 +11001095 WUFFS_BASE__TOKEN__VBC__FILLER) {
1096 return "main: internal error: decoded OK but unprocessed tokens remain";
1097 }
1098 }
1099
1100 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001101}
1102
Nigel Tao2914bae2020-02-26 09:40:30 +11001103int //
1104compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001105 if (!status_msg) {
1106 return 0;
1107 }
Nigel Tao01abc842020-03-06 21:42:33 +11001108 size_t n;
Nigel Taod60815c2020-03-26 14:32:35 +11001109 if (status_msg == g_usage) {
Nigel Tao01abc842020-03-06 21:42:33 +11001110 n = strlen(status_msg);
1111 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001112 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001113 if (n >= 2047) {
1114 status_msg = "main: internal error: error message is too long";
1115 n = strnlen(status_msg, 2047);
1116 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001117 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001118 const int stderr_fd = 2;
1119 ignore_return_value(write(stderr_fd, status_msg, n));
1120 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001121 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1122 // formatted or unsupported input.
1123 //
1124 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1125 // run-time checks found that an internal invariant did not hold.
1126 //
1127 // Automated testing, including badly formatted inputs, can therefore
1128 // discriminate between expected failure (exit code 1) and unexpected failure
1129 // (other non-zero exit codes). Specifically, exit code 2 for internal
1130 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1131 // linux) for a segmentation fault (e.g. null pointer dereference).
1132 return strstr(status_msg, "internal error:") ? 2 : 1;
1133}
1134
Nigel Tao2914bae2020-02-26 09:40:30 +11001135int //
1136main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001137 // Look for an input filename (the first non-flag argument) in argv. If there
1138 // is one, open it (but do not read from it) before we self-impose a sandbox.
1139 //
1140 // Flags start with "-", unless it comes after a bare "--" arg.
1141 {
1142 bool dash_dash = false;
1143 int a;
1144 for (a = 1; a < argc; a++) {
1145 char* arg = argv[a];
1146 if ((arg[0] == '-') && !dash_dash) {
1147 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1148 continue;
1149 }
Nigel Taod60815c2020-03-26 14:32:35 +11001150 g_input_file_descriptor = open(arg, O_RDONLY);
1151 if (g_input_file_descriptor < 0) {
Nigel Tao01abc842020-03-06 21:42:33 +11001152 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1153 return 1;
1154 }
1155 break;
1156 }
1157 }
1158
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001159#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1160 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
Nigel Taod60815c2020-03-26 14:32:35 +11001161 g_sandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001162#endif
1163
Nigel Tao0cd2f982020-03-03 23:03:02 +11001164 const char* z = main1(argc, argv);
Nigel Taod60815c2020-03-26 14:32:35 +11001165 if (g_wrote_to_dst) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001166 const char* z1 = write_dst("\n", 1);
1167 const char* z2 = flush_dst();
1168 z = z ? z : (z1 ? z1 : z2);
1169 }
1170 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001171
1172#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1173 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1174 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1175 // only SYS_exit.
1176 syscall(SYS_exit, exit_code);
1177#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001178 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001179}