blob: 3064ef0f0a56d1c77e9070bae22db39b6c99a6ec [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
Nigel Tao0291a472020-08-13 22:40:10 +100019(RFC 6901) query syntax. It reads UTF-8 JSON from stdin and writes
20canonicalized, formatted UTF-8 JSON to stdout.
Nigel Tao0cd2f982020-03-03 23:03:02 +110021
Nigel Taod60815c2020-03-26 14:32:35 +110022See the "const char* g_usage" string below for details.
Nigel Tao0cd2f982020-03-03 23:03:02 +110023
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
Nigel Tao0291a472020-08-13 22:40:10 +100030One benefit of simplicity is that this program's JSON and JSON Pointer
Nigel Tao0cd2f982020-03-03 23:03:02 +110031implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
Nigel Tao0291a472020-08-13 22:40:10 +100036The core JSON implementation is also written in the Wuffs programming language
37(and then transpiled to C/C++), which is memory-safe (e.g. array indexing is
38bounds-checked) but also guards against integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao0291a472020-08-13 22:40:10 +100045All together, this program aims to safely handle untrusted JSON files without
46fear of security bugs such as remote code execution.
Nigel Tao0cd2f982020-03-03 23:03:02 +110047
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
Nigel Taod60815c2020-03-26 14:32:35 +110063changes global state (e.g. the `g_depth` and `g_ctx` variables) and prints
Nigel Taod0b16cb2020-03-14 10:15:54 +110064output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao50bfab92020-08-05 11:39:09 +100089To run:
Nigel Tao1b073492020-02-16 22:11:36 +110090
91$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
92
93for a C++ compiler $CXX, such as clang++ or g++.
94*/
95
Nigel Tao721190a2020-04-03 22:25:21 +110096#if defined(__cplusplus) && (__cplusplus < 201103L)
97#error "This C++ program requires -std=c++11 or later"
98#endif
99
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100100#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +1100101#include <fcntl.h>
102#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100103#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100104#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100105
106// Wuffs ships as a "single file C library" or "header file library" as per
107// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
108//
109// To use that single file as a "foo.c"-like implementation, instead of a
110// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
111// compiling it.
112#define WUFFS_IMPLEMENTATION
113
114// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
115// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
116// the entire Wuffs standard library, implementing a variety of codecs and file
117// formats. Without this macro definition, an optimizing compiler or linker may
118// very well discard Wuffs code for unused codecs, but listing the Wuffs
119// modules we use makes that process explicit. Preprocessing means that such
120// code simply isn't compiled.
121#define WUFFS_CONFIG__MODULES
122#define WUFFS_CONFIG__MODULE__BASE
123#define WUFFS_CONFIG__MODULE__JSON
124
125// If building this program in an environment that doesn't easily accommodate
126// relative includes, you can use the script/inline-c-relative-includes.go
127// program to generate a stand-alone C++ file.
128#include "../../release/c/wuffs-unsupported-snapshot.c"
129
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100130#if defined(__linux__)
131#include <linux/prctl.h>
132#include <linux/seccomp.h>
133#include <sys/prctl.h>
134#include <sys/syscall.h>
135#define WUFFS_EXAMPLE_USE_SECCOMP
136#endif
137
Nigel Tao2cf76db2020-02-27 22:42:01 +1100138#define TRY(error_msg) \
139 do { \
140 const char* z = error_msg; \
141 if (z) { \
142 return z; \
143 } \
144 } while (false)
145
Nigel Taod60815c2020-03-26 14:32:35 +1100146static const char* g_eod = "main: end of data";
Nigel Tao2cf76db2020-02-27 22:42:01 +1100147
Nigel Taod60815c2020-03-26 14:32:35 +1100148static const char* g_usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100149 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100150 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100151 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100152 " -c -compact-output\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100153 " -d=NUM -max-output-depth=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100154 " -q=STR -query=STR\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000155 " -s=NUM -spaces=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100156 " -t -tabs\n"
157 " -fail-if-unsandboxed\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000158 " -input-allow-comments\n"
159 " -input-allow-extra-comma\n"
160 " -input-allow-inf-nan-numbers\n"
161 " -output-extra-comma\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000162 " -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100163 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100164 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100165 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100166 "----\n"
167 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100168 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000169 "Pointer (RFC 6901) query syntax. It reads UTF-8 JSON from stdin and\n"
170 "writes canonicalized, formatted UTF-8 JSON to stdout.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100171 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000172 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
173 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100174 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100175 "\n"
176 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000177 "on its own line. Configure this with the -c / -compact-output, -s=NUM /\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000178 "-spaces=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags.\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000179 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000180 "The -input-allow-comments flag allows \"/*slash-star*/\" and\n"
181 "\"//slash-slash\" C-style comments within JSON input. Such comments are\n"
182 "stripped from the output.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100183 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000184 "The -input-allow-extra-comma flag allows input like \"[1,2,]\", with a\n"
185 "comma after the final element of a JSON list or dictionary.\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000186 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000187 "The -input-allow-inf-nan-numbers flag allows non-finite floating point\n"
188 "numbers (infinities and not-a-numbers) within JSON input.\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000189 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000190 "The -output-extra-comma flag writes output like \"[1,2,]\", with a comma\n"
191 "after the final element of a JSON list or dictionary. Such commas are\n"
192 "non-compliant with the JSON specification but many parsers accept them\n"
193 "and they can produce simpler line-based diffs. This flag is ignored when\n"
194 "-compact-output is set.\n"
Nigel Taof8dfc762020-07-23 23:35:44 +1000195 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100196 "----\n"
197 "\n"
198 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100199 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100200 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
201 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100202 "will print:\n"
203 " \"baz\"\n"
204 "\n"
205 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100206 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100207 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
208 "child (the value in a key-value pair) of the root whose key is the empty\n"
209 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100210 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000211 "If the query found a valid JSON value, this program will return a zero\n"
212 "exit code even if the rest of the input isn't valid JSON. If the query\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100213 "did not find a value, or found an invalid one, this program returns a\n"
214 "non-zero exit code, but may still print partial output to stdout.\n"
215 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000216 "The JSON specification (https://json.org/) permits implementations that\n"
217 "allow duplicate keys, as this one does. This JSON Pointer implementation\n"
218 "is also greedy, following the first match for each fragment without\n"
219 "back-tracking. For example, the \"/foo/bar\" query will fail if the root\n"
220 "object has multiple \"foo\" children but the first one doesn't have a\n"
221 "\"bar\" child, even if later ones do.\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100222 "\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000223 "The -strict-json-pointer-syntax flag restricts the -query=STR string to\n"
224 "exactly RFC 6901, with only two escape sequences: \"~0\" and \"~1\" for\n"
225 "\"~\" and \"/\". Without this flag, this program also lets \"~n\" and\n"
226 "\"~r\" escape the New Line and Carriage Return ASCII control characters,\n"
227 "which can work better with line oriented Unix tools that assume exactly\n"
228 "one value (i.e. one JSON Pointer string) per line.\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100229 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100230 "----\n"
231 "\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100232 "The -d=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000233 "output depth. JSON containers ([] arrays and {} objects) can hold other\n"
234 "containers. When this flag is set, containers at depth NUM are replaced\n"
235 "with \"[…]\" or \"{…}\". A bare -d or -max-output-depth is equivalent to\n"
236 "-d=1. The flag's absence is equivalent to an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100237 "\n"
238 "The -max-output-depth flag only affects the program's output. It doesn't\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000239 "affect whether or not the input is considered valid JSON. The JSON\n"
240 "specification permits implementations to set their own maximum input\n"
241 "depth. This JSON implementation sets it to 1024.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100242 "\n"
243 "Depth is measured in terms of nested containers. It is unaffected by the\n"
244 "number of spaces or tabs used to indent.\n"
245 "\n"
246 "When both -max-output-depth and -query are set, the output depth is\n"
247 "measured from when the query resolves, not from the input root. The\n"
248 "input depth (measured from the root) is still limited to 1024.\n"
249 "\n"
250 "----\n"
251 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100252 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
253 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100254 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100255
Nigel Tao2cf76db2020-02-27 22:42:01 +1100256// ----
257
Nigel Taof3146c22020-03-26 08:47:42 +1100258// Wuffs allows either statically or dynamically allocated work buffers. This
259// program exercises static allocation.
260#define WORK_BUFFER_ARRAY_SIZE \
261 WUFFS_JSON__DECODER_WORKBUF_LEN_MAX_INCL_WORST_CASE
262#if WORK_BUFFER_ARRAY_SIZE > 0
Nigel Taod60815c2020-03-26 14:32:35 +1100263uint8_t g_work_buffer_array[WORK_BUFFER_ARRAY_SIZE];
Nigel Taof3146c22020-03-26 08:47:42 +1100264#else
265// Not all C/C++ compilers support 0-length arrays.
Nigel Taod60815c2020-03-26 14:32:35 +1100266uint8_t g_work_buffer_array[1];
Nigel Taof3146c22020-03-26 08:47:42 +1100267#endif
268
Nigel Taod60815c2020-03-26 14:32:35 +1100269bool g_sandboxed = false;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100270
Nigel Taod60815c2020-03-26 14:32:35 +1100271int g_input_file_descriptor = 0; // A 0 default means stdin.
Nigel Tao01abc842020-03-06 21:42:33 +1100272
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000273#define NEW_LINE_THEN_256_SPACES \
274 "\n " \
275 " " \
276 " " \
277 " "
278#define NEW_LINE_THEN_256_TABS \
279 "\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
280 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
281 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
282 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
283 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
284 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
285 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"
286
287const char* g_new_line_then_256_indent_bytes;
288uint32_t g_bytes_per_indent_depth;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100289
Nigel Taofdac24a2020-03-06 21:53:08 +1100290#ifndef DST_BUFFER_ARRAY_SIZE
291#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100292#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100293#ifndef SRC_BUFFER_ARRAY_SIZE
294#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100295#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100296#ifndef TOKEN_BUFFER_ARRAY_SIZE
297#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100298#endif
299
Nigel Taod60815c2020-03-26 14:32:35 +1100300uint8_t g_dst_array[DST_BUFFER_ARRAY_SIZE];
301uint8_t g_src_array[SRC_BUFFER_ARRAY_SIZE];
302wuffs_base__token g_tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100303
Nigel Taod60815c2020-03-26 14:32:35 +1100304wuffs_base__io_buffer g_dst;
305wuffs_base__io_buffer g_src;
306wuffs_base__token_buffer g_tok;
Nigel Tao1b073492020-02-16 22:11:36 +1100307
Nigel Tao991bd512020-08-19 09:38:16 +1000308// g_cursor_index is the g_src.data.ptr index between the previous and current
309// token. An invariant is that (g_cursor_index <= g_src.meta.ri).
310size_t g_cursor_index;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100311
Nigel Taod60815c2020-03-26 14:32:35 +1100312uint32_t g_depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100313
314enum class context {
315 none,
316 in_list_after_bracket,
317 in_list_after_value,
318 in_dict_after_brace,
319 in_dict_after_key,
320 in_dict_after_value,
Nigel Taod60815c2020-03-26 14:32:35 +1100321} g_ctx;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100322
Nigel Tao0cd2f982020-03-03 23:03:02 +1100323bool //
324in_dict_before_key() {
Nigel Taod60815c2020-03-26 14:32:35 +1100325 return (g_ctx == context::in_dict_after_brace) ||
326 (g_ctx == context::in_dict_after_value);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100327}
328
Nigel Taod60815c2020-03-26 14:32:35 +1100329uint32_t g_suppress_write_dst;
330bool g_wrote_to_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100331
Nigel Tao0291a472020-08-13 22:40:10 +1000332wuffs_json__decoder g_dec;
Nigel Taoea532452020-07-27 00:03:00 +1000333
Nigel Tao0cd2f982020-03-03 23:03:02 +1100334// ----
335
336// Query is a JSON Pointer query. After initializing with a NUL-terminated C
337// string, its multiple fragments are consumed as the program walks the JSON
338// data from stdin. For example, letting "$" denote a NUL, suppose that we
339// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100340// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100341//
342// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
343// / a p p l e / b a n a n a / 1 2 / d u r i a n $
344// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
345// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100346// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100347//
Nigel Taob48ee752020-03-13 09:27:33 +1100348// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
349// start (inclusive) and end (exclusive) of the query fragment. They satisfy
350// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
351// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100352//
Nigel Taob48ee752020-03-13 09:27:33 +1100353// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
354// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100355//
356// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
357// tokens, as backslash-escaped values within that JSON string may each get
358// their own token.
359//
Nigel Taob48ee752020-03-13 09:27:33 +1100360// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100361//
Nigel Taob48ee752020-03-13 09:27:33 +1100362// While mfj remains non-nullptr, each token's unescaped contents are then
363// compared to that part of the fragment from mfj to mfk. If it is a prefix
364// (including the case of an exact match), then mfj is advanced by the
365// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100366//
367// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
368// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100369// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
370// responsible for calling Query::validate (with a strict_json_pointer_syntax
371// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100372//
Nigel Taob48ee752020-03-13 09:27:33 +1100373// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
374// incrementally match the object key with the query fragment. For example, if
375// we have already matched the "ban" of "banana", then we would accept any of
376// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
377// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100378//
Nigel Taob48ee752020-03-13 09:27:33 +1100379// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100380// v
381// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
382// / a p p l e / b a n a n a / 1 2 / d u r i a n $
383// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
384// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100385// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100386//
387// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100388// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
389// have a fragment match: the query fragment equals the object key. If there is
390// a next fragment (in this example, "12") we move the frag_etc pointers to its
391// start and end and increment Query::m_depth. Otherwise, we have matched the
392// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100393//
394// The discussion above centers on object keys. If the query fragment is
395// numeric then it can also match as an array index: the string fragment "12"
396// will match an array's 13th element (starting counting from zero). See RFC
397// 6901 for its precise definition of an "array index" number.
398//
Nigel Taob48ee752020-03-13 09:27:33 +1100399// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100400// whose type (wuffs_base__result_u64) is a result type. An error result means
401// that the fragment is not an array index. A value result holds the number of
402// list elements remaining. When matching a query fragment in an array (instead
403// of in an object), each element ticks this number down towards zero. At zero,
404// the upcoming JSON value is the one that matches the query fragment.
405class Query {
406 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100407 uint8_t* m_frag_i;
408 uint8_t* m_frag_j;
409 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100410
Nigel Taob48ee752020-03-13 09:27:33 +1100411 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100412
Nigel Taob48ee752020-03-13 09:27:33 +1100413 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100414
415 public:
416 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100417 m_frag_i = (uint8_t*)query_c_string;
418 m_frag_j = (uint8_t*)query_c_string;
419 m_frag_k = (uint8_t*)query_c_string;
420 m_depth = 0;
421 m_array_index.status.repr = "#main: not an array index query fragment";
422 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100423 }
424
Nigel Taob48ee752020-03-13 09:27:33 +1100425 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100426
Nigel Taob48ee752020-03-13 09:27:33 +1100427 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100428
429 // tick returns whether the fragment is a valid array index whose value is
430 // zero. If valid but non-zero, it decrements it and returns false.
431 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100432 if (m_array_index.status.is_ok()) {
Nigel Tao0291a472020-08-13 22:40:10 +1000433 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100434 return true;
435 }
Nigel Tao0291a472020-08-13 22:40:10 +1000436 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100437 }
438 return false;
439 }
440
441 // next_fragment moves to the next fragment, returning whether it existed.
442 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100443 uint8_t* k = m_frag_k;
444 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100445
446 this->reset(nullptr);
447
448 if (!k || (*k != '/')) {
449 return false;
450 }
451 k++;
452
453 bool all_digits = true;
454 uint8_t* i = k;
455 while ((*k != '\x00') && (*k != '/')) {
456 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
457 k++;
458 }
Nigel Taob48ee752020-03-13 09:27:33 +1100459 m_frag_i = i;
460 m_frag_j = i;
461 m_frag_k = k;
462 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100463 if (all_digits) {
464 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Tao6b7ce302020-07-07 16:19:46 +1000465 m_array_index = wuffs_base__parse_number_u64(
466 wuffs_base__make_slice_u8(i, k - i),
467 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100468 }
469 return true;
470 }
471
Nigel Taob48ee752020-03-13 09:27:33 +1100472 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100473
Nigel Taob48ee752020-03-13 09:27:33 +1100474 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100475
476 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100477 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100478 return;
479 }
Nigel Taob48ee752020-03-13 09:27:33 +1100480 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100481 while (true) {
482 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100483 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100484 return;
485 }
486
487 if (*j == '\x00') {
488 break;
489
490 } else if (*j == '~') {
491 j++;
492 if (*j == '0') {
493 if (*ptr != '~') {
494 break;
495 }
496 } else if (*j == '1') {
497 if (*ptr != '/') {
498 break;
499 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100500 } else if (*j == 'n') {
501 if (*ptr != '\n') {
502 break;
503 }
504 } else if (*j == 'r') {
505 if (*ptr != '\r') {
506 break;
507 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100508 } else {
509 break;
510 }
511
512 } else if (*j != *ptr) {
513 break;
514 }
515
516 j++;
517 ptr++;
518 len--;
519 }
Nigel Taob48ee752020-03-13 09:27:33 +1100520 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100521 }
522
523 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100524 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100525 return;
526 }
527 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
528 size_t n = wuffs_base__utf_8__encode(
529 wuffs_base__make_slice_u8(&u[0],
530 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
531 code_point);
532 if (n > 0) {
533 this->incremental_match_slice(&u[0], n);
534 }
535 }
536
537 // validate returns whether the (ptr, len) arguments form a valid JSON
538 // Pointer. In particular, it must be valid UTF-8, and either be empty or
539 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100540 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
541 // followed by either 'n' or 'r'.
542 static bool validate(char* query_c_string,
543 size_t length,
544 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100545 if (length <= 0) {
546 return true;
547 }
548 if (query_c_string[0] != '/') {
549 return false;
550 }
551 wuffs_base__slice_u8 s =
552 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
553 bool previous_was_tilde = false;
554 while (s.len > 0) {
Nigel Tao702c7b22020-07-22 15:42:54 +1000555 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s.ptr, s.len);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100556 if (!o.is_valid()) {
557 return false;
558 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100559
560 if (previous_was_tilde) {
561 switch (o.code_point) {
562 case '0':
563 case '1':
564 break;
565 case 'n':
566 case 'r':
567 if (strict_json_pointer_syntax) {
568 return false;
569 }
570 break;
571 default:
572 return false;
573 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100574 }
575 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100576
Nigel Tao0cd2f982020-03-03 23:03:02 +1100577 s.ptr += o.byte_length;
578 s.len -= o.byte_length;
579 }
580 return !previous_was_tilde;
581 }
Nigel Taod60815c2020-03-26 14:32:35 +1100582} g_query;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100583
584// ----
585
Nigel Tao68920952020-03-03 11:25:18 +1100586struct {
587 int remaining_argc;
588 char** remaining_argv;
589
Nigel Tao3690e832020-03-12 16:52:26 +1100590 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100591 bool fail_if_unsandboxed;
Nigel Tao0291a472020-08-13 22:40:10 +1000592 bool input_allow_comments;
593 bool input_allow_extra_comma;
594 bool input_allow_inf_nan_numbers;
Nigel Tao0291a472020-08-13 22:40:10 +1000595 bool output_extra_comma;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100596 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100597 bool tabs;
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000598
599 uint32_t max_output_depth;
600 uint32_t spaces;
601
602 char* query_c_string;
Nigel Taod60815c2020-03-26 14:32:35 +1100603} g_flags = {0};
Nigel Tao68920952020-03-03 11:25:18 +1100604
605const char* //
606parse_flags(int argc, char** argv) {
Nigel Taoecadf722020-07-13 08:22:34 +1000607 g_flags.spaces = 4;
Nigel Taod60815c2020-03-26 14:32:35 +1100608 g_flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100609
610 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
611 for (; c < argc; c++) {
612 char* arg = argv[c];
613 if (*arg++ != '-') {
614 break;
615 }
616
617 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
618 // cases, a bare "-" is not a flag (some programs may interpret it as
619 // stdin) and a bare "--" means to stop parsing flags.
620 if (*arg == '\x00') {
621 break;
622 } else if (*arg == '-') {
623 arg++;
624 if (*arg == '\x00') {
625 c++;
626 break;
627 }
628 }
629
Nigel Tao3690e832020-03-12 16:52:26 +1100630 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100631 g_flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100632 continue;
633 }
Nigel Tao94440cf2020-04-02 22:28:24 +1100634 if (!strcmp(arg, "d") || !strcmp(arg, "max-output-depth")) {
635 g_flags.max_output_depth = 1;
636 continue;
637 } else if (!strncmp(arg, "d=", 2) ||
638 !strncmp(arg, "max-output-depth=", 16)) {
639 while (*arg++ != '=') {
640 }
641 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
Nigel Tao6b7ce302020-07-07 16:19:46 +1000642 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)),
643 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Taoaf757722020-07-18 17:27:11 +1000644 if (u.status.is_ok() && (u.value <= 0xFFFFFFFF)) {
Nigel Tao94440cf2020-04-02 22:28:24 +1100645 g_flags.max_output_depth = (uint32_t)(u.value);
646 continue;
647 }
648 return g_usage;
649 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100650 if (!strcmp(arg, "fail-if-unsandboxed")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100651 g_flags.fail_if_unsandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100652 continue;
653 }
Nigel Tao0291a472020-08-13 22:40:10 +1000654 if (!strcmp(arg, "input-allow-comments")) {
655 g_flags.input_allow_comments = true;
Nigel Tao4e193592020-07-15 12:48:57 +1000656 continue;
657 }
Nigel Tao0291a472020-08-13 22:40:10 +1000658 if (!strcmp(arg, "input-allow-extra-comma")) {
659 g_flags.input_allow_extra_comma = true;
Nigel Tao4e193592020-07-15 12:48:57 +1000660 continue;
661 }
Nigel Tao0291a472020-08-13 22:40:10 +1000662 if (!strcmp(arg, "input-allow-inf-nan-numbers")) {
663 g_flags.input_allow_inf_nan_numbers = true;
Nigel Tao3c8589b2020-07-19 21:49:00 +1000664 continue;
665 }
Nigel Tao0291a472020-08-13 22:40:10 +1000666 if (!strcmp(arg, "output-extra-comma")) {
667 g_flags.output_extra_comma = true;
Nigel Taodd114692020-07-25 21:54:12 +1000668 continue;
669 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100670 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
671 while (*arg++ != '=') {
672 }
Nigel Taod60815c2020-03-26 14:32:35 +1100673 g_flags.query_c_string = arg;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100674 continue;
675 }
Nigel Taoecadf722020-07-13 08:22:34 +1000676 if (!strncmp(arg, "s=", 2) || !strncmp(arg, "spaces=", 7)) {
677 while (*arg++ != '=') {
678 }
679 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
680 g_flags.spaces = arg[0] - '0';
681 continue;
682 }
683 return g_usage;
684 }
685 if (!strcmp(arg, "strict-json-pointer-syntax")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100686 g_flags.strict_json_pointer_syntax = true;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100687 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100688 }
689 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100690 g_flags.tabs = true;
Nigel Tao68920952020-03-03 11:25:18 +1100691 continue;
692 }
693
Nigel Taod60815c2020-03-26 14:32:35 +1100694 return g_usage;
Nigel Tao68920952020-03-03 11:25:18 +1100695 }
696
Nigel Taod60815c2020-03-26 14:32:35 +1100697 if (g_flags.query_c_string &&
698 !Query::validate(g_flags.query_c_string, strlen(g_flags.query_c_string),
699 g_flags.strict_json_pointer_syntax)) {
Nigel Taod6fdfb12020-03-11 12:24:14 +1100700 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
701 }
702
Nigel Taod60815c2020-03-26 14:32:35 +1100703 g_flags.remaining_argc = argc - c;
704 g_flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100705 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100706}
707
Nigel Tao2cf76db2020-02-27 22:42:01 +1100708const char* //
709initialize_globals(int argc, char** argv) {
Nigel Taod60815c2020-03-26 14:32:35 +1100710 g_dst = wuffs_base__make_io_buffer(
711 wuffs_base__make_slice_u8(g_dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100712 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100713
Nigel Taod60815c2020-03-26 14:32:35 +1100714 g_src = wuffs_base__make_io_buffer(
715 wuffs_base__make_slice_u8(g_src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100716 wuffs_base__empty_io_buffer_meta());
717
Nigel Taod60815c2020-03-26 14:32:35 +1100718 g_tok = wuffs_base__make_token_buffer(
719 wuffs_base__make_slice_token(g_tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100720 wuffs_base__empty_token_buffer_meta());
721
Nigel Tao991bd512020-08-19 09:38:16 +1000722 g_cursor_index = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100723
Nigel Taod60815c2020-03-26 14:32:35 +1100724 g_depth = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100725
Nigel Taod60815c2020-03-26 14:32:35 +1100726 g_ctx = context::none;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100727
Nigel Tao68920952020-03-03 11:25:18 +1100728 TRY(parse_flags(argc, argv));
Nigel Taod60815c2020-03-26 14:32:35 +1100729 if (g_flags.fail_if_unsandboxed && !g_sandboxed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100730 return "main: unsandboxed";
731 }
Nigel Tao01abc842020-03-06 21:42:33 +1100732 const int stdin_fd = 0;
Nigel Taod60815c2020-03-26 14:32:35 +1100733 if (g_flags.remaining_argc >
734 ((g_input_file_descriptor != stdin_fd) ? 1 : 0)) {
735 return g_usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100736 }
737
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000738 g_new_line_then_256_indent_bytes =
739 g_flags.tabs ? NEW_LINE_THEN_256_TABS : NEW_LINE_THEN_256_SPACES;
740 g_bytes_per_indent_depth = g_flags.tabs ? 1 : g_flags.spaces;
741
Nigel Taod60815c2020-03-26 14:32:35 +1100742 g_query.reset(g_flags.query_c_string);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100743
Nigel Taoc96b31c2020-07-27 22:37:23 +1000744 // If the query is non-empty, suppress writing to stdout until we've
Nigel Tao0cd2f982020-03-03 23:03:02 +1100745 // completed the query.
Nigel Taod60815c2020-03-26 14:32:35 +1100746 g_suppress_write_dst = g_query.next_fragment() ? 1 : 0;
747 g_wrote_to_dst = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100748
Nigel Tao0291a472020-08-13 22:40:10 +1000749 TRY(g_dec.initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
750 .message());
Nigel Tao4b186b02020-03-18 14:25:21 +1100751
Nigel Tao0291a472020-08-13 22:40:10 +1000752 if (g_flags.input_allow_comments) {
753 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_COMMENT_BLOCK, true);
754 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_COMMENT_LINE, true);
Nigel Tao3c8589b2020-07-19 21:49:00 +1000755 }
Nigel Tao0291a472020-08-13 22:40:10 +1000756 if (g_flags.input_allow_extra_comma) {
757 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_EXTRA_COMMA, true);
Nigel Taoc766bb72020-07-09 12:59:32 +1000758 }
Nigel Tao0291a472020-08-13 22:40:10 +1000759 if (g_flags.input_allow_inf_nan_numbers) {
760 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_INF_NAN_NUMBERS, true);
Nigel Tao51a38292020-07-19 22:43:17 +1000761 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000762
Nigel Tao4b186b02020-03-18 14:25:21 +1100763 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
764 // but it works better with line oriented Unix tools (such as "echo 123 |
765 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
766 // can accidentally contain trailing whitespace.
Nigel Tao0291a472020-08-13 22:40:10 +1000767 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
Nigel Tao4b186b02020-03-18 14:25:21 +1100768
769 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100770}
Nigel Tao1b073492020-02-16 22:11:36 +1100771
772// ----
773
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100774// ignore_return_value suppresses errors from -Wall -Werror.
775static void //
776ignore_return_value(int ignored) {}
777
Nigel Tao2914bae2020-02-26 09:40:30 +1100778const char* //
779read_src() {
Nigel Taod60815c2020-03-26 14:32:35 +1100780 if (g_src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100781 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100782 }
Nigel Taod60815c2020-03-26 14:32:35 +1100783 g_src.compact();
784 if (g_src.meta.wi >= g_src.data.len) {
785 return "main: g_src buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100786 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100787 while (true) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000788 ssize_t n = read(g_input_file_descriptor, g_src.writer_pointer(),
789 g_src.writer_length());
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100790 if (n >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100791 g_src.meta.wi += n;
792 g_src.meta.closed = n == 0;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100793 break;
794 } else if (errno != EINTR) {
795 return strerror(errno);
796 }
Nigel Tao1b073492020-02-16 22:11:36 +1100797 }
798 return nullptr;
799}
800
Nigel Tao2914bae2020-02-26 09:40:30 +1100801const char* //
802flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100803 while (true) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000804 size_t n = g_dst.reader_length();
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100805 if (n == 0) {
806 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100807 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100808 const int stdout_fd = 1;
Nigel Taod6a10df2020-07-27 11:47:47 +1000809 ssize_t i = write(stdout_fd, g_dst.reader_pointer(), n);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100810 if (i >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100811 g_dst.meta.ri += i;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100812 } else if (errno != EINTR) {
813 return strerror(errno);
814 }
Nigel Tao1b073492020-02-16 22:11:36 +1100815 }
Nigel Taod60815c2020-03-26 14:32:35 +1100816 g_dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100817 return nullptr;
818}
819
Nigel Tao2914bae2020-02-26 09:40:30 +1100820const char* //
Nigel Tao6b86cbc2020-08-19 11:39:56 +1000821write_dst_slow(const void* s, size_t n) {
Nigel Tao1b073492020-02-16 22:11:36 +1100822 const uint8_t* p = static_cast<const uint8_t*>(s);
823 while (n > 0) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000824 size_t i = g_dst.writer_length();
Nigel Tao1b073492020-02-16 22:11:36 +1100825 if (i == 0) {
826 const char* z = flush_dst();
827 if (z) {
828 return z;
829 }
Nigel Taod6a10df2020-07-27 11:47:47 +1000830 i = g_dst.writer_length();
Nigel Tao1b073492020-02-16 22:11:36 +1100831 if (i == 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100832 return "main: g_dst buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100833 }
834 }
835
836 if (i > n) {
837 i = n;
838 }
Nigel Taod60815c2020-03-26 14:32:35 +1100839 memcpy(g_dst.data.ptr + g_dst.meta.wi, p, i);
840 g_dst.meta.wi += i;
Nigel Tao1b073492020-02-16 22:11:36 +1100841 p += i;
842 n -= i;
Nigel Taod60815c2020-03-26 14:32:35 +1100843 g_wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100844 }
845 return nullptr;
846}
847
Nigel Tao6b86cbc2020-08-19 11:39:56 +1000848inline const char* //
849write_dst(const void* s, size_t n) {
850 if (g_suppress_write_dst > 0) {
851 return nullptr;
852 } else if (n <= (DST_BUFFER_ARRAY_SIZE - g_dst.meta.wi)) {
853 memcpy(g_dst.data.ptr + g_dst.meta.wi, s, n);
854 g_dst.meta.wi += n;
855 g_wrote_to_dst = true;
856 return nullptr;
857 }
858 return write_dst_slow(s, n);
859}
860
Nigel Tao1b073492020-02-16 22:11:36 +1100861// ----
862
Nigel Tao2914bae2020-02-26 09:40:30 +1100863uint8_t //
864hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +1100865 nibble &= 0x0F;
866 if (nibble <= 9) {
867 return '0' + nibble;
868 }
869 return ('A' - 10) + nibble;
870}
871
Nigel Tao2914bae2020-02-26 09:40:30 +1100872const char* //
Nigel Tao7cb76542020-07-19 22:19:04 +1000873handle_unicode_code_point(uint32_t ucp) {
Nigel Tao0291a472020-08-13 22:40:10 +1000874 if (ucp < 0x0020) {
875 switch (ucp) {
876 case '\b':
877 return write_dst("\\b", 2);
878 case '\f':
879 return write_dst("\\f", 2);
880 case '\n':
881 return write_dst("\\n", 2);
882 case '\r':
883 return write_dst("\\r", 2);
884 case '\t':
885 return write_dst("\\t", 2);
Nigel Tao7cb76542020-07-19 22:19:04 +1000886 }
Nigel Tao0291a472020-08-13 22:40:10 +1000887
888 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
889 // JSON string. They need to remain escaped.
890 uint8_t esc6[6];
891 esc6[0] = '\\';
892 esc6[1] = 'u';
893 esc6[2] = '0';
894 esc6[3] = '0';
895 esc6[4] = hex_digit(ucp >> 4);
896 esc6[5] = hex_digit(ucp >> 0);
897 return write_dst(&esc6[0], 6);
898
899 } else if (ucp == '\"') {
900 return write_dst("\\\"", 2);
901
902 } else if (ucp == '\\') {
903 return write_dst("\\\\", 2);
Nigel Tao7cb76542020-07-19 22:19:04 +1000904 }
905
906 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
907 size_t n = wuffs_base__utf_8__encode(
908 wuffs_base__make_slice_u8(&u[0],
909 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
910 ucp);
911 if (n == 0) {
912 return "main: internal error: unexpected Unicode code point";
913 }
Nigel Tao0291a472020-08-13 22:40:10 +1000914 return write_dst(&u[0], n);
Nigel Tao168f60a2020-07-14 13:19:33 +1000915}
916
Nigel Taod191a3f2020-07-19 22:14:54 +1000917// ----
918
Nigel Tao3b486982020-02-27 15:05:59 +1100919const char* //
Nigel Tao2ef39992020-04-09 17:24:39 +1000920handle_token(wuffs_base__token t, bool start_of_token_chain) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100921 do {
Nigel Tao462f8662020-04-01 23:01:51 +1100922 int64_t vbc = t.value_base_category();
Nigel Tao2cf76db2020-02-27 22:42:01 +1100923 uint64_t vbd = t.value_base_detail();
Nigel Taoee6927f2020-07-27 12:08:33 +1000924 uint64_t token_length = t.length();
Nigel Tao991bd512020-08-19 09:38:16 +1000925 // The "- token_length" is because we incremented g_cursor_index before
926 // calling handle_token.
Nigel Taoee6927f2020-07-27 12:08:33 +1000927 wuffs_base__slice_u8 tok = wuffs_base__make_slice_u8(
Nigel Tao991bd512020-08-19 09:38:16 +1000928 g_src.data.ptr + g_cursor_index - token_length, token_length);
Nigel Tao1b073492020-02-16 22:11:36 +1100929
930 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +1100931 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +1100932 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Taod60815c2020-03-26 14:32:35 +1100933 if (g_query.is_at(g_depth)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100934 return "main: no match for query";
935 }
Nigel Taod60815c2020-03-26 14:32:35 +1100936 if (g_depth <= 0) {
937 return "main: internal error: inconsistent g_depth";
Nigel Tao1b073492020-02-16 22:11:36 +1100938 }
Nigel Taod60815c2020-03-26 14:32:35 +1100939 g_depth--;
Nigel Tao1b073492020-02-16 22:11:36 +1100940
Nigel Taod60815c2020-03-26 14:32:35 +1100941 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
942 g_suppress_write_dst--;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100943 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
Nigel Tao0291a472020-08-13 22:40:10 +1000944 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
945 ? "\"[…]\""
946 : "\"{…}\"",
947 7));
948 } else {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100949 // Write preceding whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +1100950 if ((g_ctx != context::in_list_after_bracket) &&
951 (g_ctx != context::in_dict_after_brace) &&
952 !g_flags.compact_output) {
Nigel Tao0291a472020-08-13 22:40:10 +1000953 if (g_flags.output_extra_comma) {
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000954 TRY(write_dst(",", 1));
Nigel Taoc766bb72020-07-09 12:59:32 +1000955 }
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000956 uint32_t indent = g_depth * g_bytes_per_indent_depth;
957 TRY(write_dst(g_new_line_then_256_indent_bytes, 1 + (indent & 0xFF)));
958 for (indent >>= 8; indent > 0; indent--) {
959 TRY(write_dst(g_new_line_then_256_indent_bytes + 1, 0x100));
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100960 }
Nigel Tao1b073492020-02-16 22:11:36 +1100961 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100962
963 TRY(write_dst(
964 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
965 1));
Nigel Tao1b073492020-02-16 22:11:36 +1100966 }
967
Nigel Taod60815c2020-03-26 14:32:35 +1100968 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
969 ? context::in_list_after_value
970 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +1100971 goto after_value;
972 }
973
Nigel Taod1c928a2020-02-28 12:43:53 +1100974 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
Nigel Tao0291a472020-08-13 22:40:10 +1000975 // continuation of a multi-token chain.
976 if (start_of_token_chain) {
977 if (g_ctx == context::in_dict_after_key) {
Nigel Taod60815c2020-03-26 14:32:35 +1100978 TRY(write_dst(": ", g_flags.compact_output ? 1 : 2));
979 } else if (g_ctx != context::none) {
Nigel Tao0291a472020-08-13 22:40:10 +1000980 if ((g_ctx != context::in_list_after_bracket) &&
981 (g_ctx != context::in_dict_after_brace)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100982 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +1100983 }
Nigel Taod60815c2020-03-26 14:32:35 +1100984 if (!g_flags.compact_output) {
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000985 uint32_t indent = g_depth * g_bytes_per_indent_depth;
986 TRY(write_dst(g_new_line_then_256_indent_bytes, 1 + (indent & 0xFF)));
987 for (indent >>= 8; indent > 0; indent--) {
988 TRY(write_dst(g_new_line_then_256_indent_bytes + 1, 0x100));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100989 }
990 }
991 }
992
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100993 bool query_matched_fragment = false;
Nigel Taod60815c2020-03-26 14:32:35 +1100994 if (g_query.is_at(g_depth)) {
995 switch (g_ctx) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100996 case context::in_list_after_bracket:
997 case context::in_list_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +1100998 query_matched_fragment = g_query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100999 break;
1000 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001001 query_matched_fragment = g_query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001002 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001003 default:
1004 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001005 }
1006 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001007 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001008 // No-op.
Nigel Taod60815c2020-03-26 14:32:35 +11001009 } else if (!g_query.next_fragment()) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001010 // There is no next fragment. We have matched the complete query, and
1011 // the upcoming JSON value is the result of that query.
1012 //
Nigel Taod60815c2020-03-26 14:32:35 +11001013 // Un-suppress writing to stdout and reset the g_ctx and g_depth as if
1014 // we were about to decode a top-level value. This makes any subsequent
1015 // indentation be relative to this point, and we will return g_eod
1016 // after the upcoming JSON value is complete.
1017 if (g_suppress_write_dst != 1) {
1018 return "main: internal error: inconsistent g_suppress_write_dst";
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001019 }
Nigel Taod60815c2020-03-26 14:32:35 +11001020 g_suppress_write_dst = 0;
1021 g_ctx = context::none;
1022 g_depth = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001023 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
1024 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
1025 // The query has moved on to the next fragment but the upcoming JSON
1026 // value is not a container.
1027 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +11001028 }
1029 }
1030
1031 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +11001032 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +11001033 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +11001034 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Taod60815c2020-03-26 14:32:35 +11001035 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1036 g_suppress_write_dst++;
Nigel Tao0291a472020-08-13 22:40:10 +10001037 } else {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001038 TRY(write_dst(
1039 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
1040 1));
1041 }
Nigel Taod60815c2020-03-26 14:32:35 +11001042 g_depth++;
1043 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1044 ? context::in_list_after_bracket
1045 : context::in_dict_after_brace;
Nigel Tao85fba7f2020-02-29 16:28:06 +11001046 return nullptr;
1047
Nigel Tao2cf76db2020-02-27 22:42:01 +11001048 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Tao0291a472020-08-13 22:40:10 +10001049 if (start_of_token_chain) {
1050 TRY(write_dst("\"", 1));
1051 g_query.restart_fragment(in_dict_before_key() &&
1052 g_query.is_at(g_depth));
1053 }
1054
1055 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
1056 // No-op.
1057 } else if (vbd &
1058 WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
1059 TRY(write_dst(tok.ptr, tok.len));
1060 g_query.incremental_match_slice(tok.ptr, tok.len);
1061 } else {
1062 return "main: internal error: unexpected string-token conversion";
1063 }
1064
Nigel Tao496e88b2020-04-09 22:10:08 +10001065 if (t.continued()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001066 return nullptr;
1067 }
Nigel Tao0291a472020-08-13 22:40:10 +10001068 TRY(write_dst("\"", 1));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001069 goto after_value;
1070
1071 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao496e88b2020-04-09 22:10:08 +10001072 if (!t.continued()) {
1073 return "main: internal error: unexpected non-continued UCP token";
Nigel Tao0cd2f982020-03-03 23:03:02 +11001074 }
1075 TRY(handle_unicode_code_point(vbd));
Nigel Taod60815c2020-03-26 14:32:35 +11001076 g_query.incremental_match_code_point(vbd);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001077 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001078
Nigel Tao85fba7f2020-02-29 16:28:06 +11001079 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao2cf76db2020-02-27 22:42:01 +11001080 case WUFFS_BASE__TOKEN__VBC__NUMBER:
Nigel Tao0291a472020-08-13 22:40:10 +10001081 TRY(write_dst(tok.ptr, tok.len));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001082 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +11001083 }
1084
Nigel Tao0291a472020-08-13 22:40:10 +10001085 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +11001086 return "main: internal error: unexpected token";
1087 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +11001088
Nigel Tao2cf76db2020-02-27 22:42:01 +11001089 // Book-keeping after completing a value (whether a container value or a
1090 // simple value). Empty parent containers are no longer empty. If the parent
1091 // container is a "{...}" object, toggle between keys and values.
1092after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001093 if (g_depth == 0) {
1094 return g_eod;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001095 }
Nigel Taod60815c2020-03-26 14:32:35 +11001096 switch (g_ctx) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001097 case context::in_list_after_bracket:
Nigel Taod60815c2020-03-26 14:32:35 +11001098 g_ctx = context::in_list_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001099 break;
1100 case context::in_dict_after_brace:
Nigel Taod60815c2020-03-26 14:32:35 +11001101 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001102 break;
1103 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001104 g_ctx = context::in_dict_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001105 break;
1106 case context::in_dict_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001107 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001108 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001109 default:
1110 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001111 }
1112 return nullptr;
1113}
1114
1115const char* //
1116main1(int argc, char** argv) {
1117 TRY(initialize_globals(argc, argv));
1118
Nigel Taocd183f92020-07-14 12:11:05 +10001119 bool start_of_token_chain = true;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001120 while (true) {
Nigel Tao0291a472020-08-13 22:40:10 +10001121 wuffs_base__status status = g_dec.decode_tokens(
Nigel Taod60815c2020-03-26 14:32:35 +11001122 &g_tok, &g_src,
1123 wuffs_base__make_slice_u8(g_work_buffer_array, WORK_BUFFER_ARRAY_SIZE));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001124
Nigel Taod60815c2020-03-26 14:32:35 +11001125 while (g_tok.meta.ri < g_tok.meta.wi) {
1126 wuffs_base__token t = g_tok.data.ptr[g_tok.meta.ri++];
Nigel Tao991bd512020-08-19 09:38:16 +10001127 uint64_t token_length = t.length();
1128 if ((g_src.meta.ri - g_cursor_index) < token_length) {
Nigel Taod60815c2020-03-26 14:32:35 +11001129 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001130 }
Nigel Tao991bd512020-08-19 09:38:16 +10001131 g_cursor_index += token_length;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001132
Nigel Taod0b16cb2020-03-14 10:15:54 +11001133 // Skip filler tokens (e.g. whitespace).
Nigel Tao3c8589b2020-07-19 21:49:00 +10001134 if (t.value_base_category() == WUFFS_BASE__TOKEN__VBC__FILLER) {
Nigel Tao496e88b2020-04-09 22:10:08 +10001135 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001136 continue;
1137 }
1138
Nigel Tao2ef39992020-04-09 17:24:39 +10001139 const char* z = handle_token(t, start_of_token_chain);
Nigel Tao496e88b2020-04-09 22:10:08 +10001140 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001141 if (z == nullptr) {
1142 continue;
Nigel Taod60815c2020-03-26 14:32:35 +11001143 } else if (z == g_eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001144 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001145 }
1146 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001147 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001148
1149 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001150 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001151 } else if (status.repr == wuffs_base__suspension__short_read) {
Nigel Tao991bd512020-08-19 09:38:16 +10001152 if (g_cursor_index != g_src.meta.ri) {
Nigel Taod60815c2020-03-26 14:32:35 +11001153 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001154 }
1155 TRY(read_src());
Nigel Tao991bd512020-08-19 09:38:16 +10001156 g_cursor_index = g_src.meta.ri;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001157 } else if (status.repr == wuffs_base__suspension__short_write) {
Nigel Taod60815c2020-03-26 14:32:35 +11001158 g_tok.compact();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001159 } else {
1160 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001161 }
1162 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001163end_of_data:
1164
Nigel Taod60815c2020-03-26 14:32:35 +11001165 // With a non-empty g_query, don't try to consume trailing whitespace or
Nigel Tao0cd2f982020-03-03 23:03:02 +11001166 // confirm that we've processed all the tokens.
Nigel Taod60815c2020-03-26 14:32:35 +11001167 if (g_flags.query_c_string && *g_flags.query_c_string) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001168 return nullptr;
1169 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001170
Nigel Tao6b161af2020-02-24 11:01:48 +11001171 // Check that we've exhausted the input.
Nigel Taod60815c2020-03-26 14:32:35 +11001172 if ((g_src.meta.ri == g_src.meta.wi) && !g_src.meta.closed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001173 TRY(read_src());
1174 }
Nigel Taod60815c2020-03-26 14:32:35 +11001175 if ((g_src.meta.ri < g_src.meta.wi) || !g_src.meta.closed) {
Nigel Tao0291a472020-08-13 22:40:10 +10001176 return "main: valid JSON followed by further (unexpected) data";
Nigel Tao6b161af2020-02-24 11:01:48 +11001177 }
1178
1179 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001180 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1181 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1182 // filler token for the "\n".
Nigel Taod60815c2020-03-26 14:32:35 +11001183 for (; g_tok.meta.ri < g_tok.meta.wi; g_tok.meta.ri++) {
1184 if (g_tok.data.ptr[g_tok.meta.ri].value_base_category() !=
Nigel Tao6b161af2020-02-24 11:01:48 +11001185 WUFFS_BASE__TOKEN__VBC__FILLER) {
1186 return "main: internal error: decoded OK but unprocessed tokens remain";
1187 }
1188 }
1189
1190 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001191}
1192
Nigel Tao2914bae2020-02-26 09:40:30 +11001193int //
1194compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001195 if (!status_msg) {
1196 return 0;
1197 }
Nigel Tao01abc842020-03-06 21:42:33 +11001198 size_t n;
Nigel Taod60815c2020-03-26 14:32:35 +11001199 if (status_msg == g_usage) {
Nigel Tao01abc842020-03-06 21:42:33 +11001200 n = strlen(status_msg);
1201 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001202 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001203 if (n >= 2047) {
1204 status_msg = "main: internal error: error message is too long";
1205 n = strnlen(status_msg, 2047);
1206 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001207 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001208 const int stderr_fd = 2;
1209 ignore_return_value(write(stderr_fd, status_msg, n));
1210 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001211 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1212 // formatted or unsupported input.
1213 //
1214 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1215 // run-time checks found that an internal invariant did not hold.
1216 //
1217 // Automated testing, including badly formatted inputs, can therefore
1218 // discriminate between expected failure (exit code 1) and unexpected failure
1219 // (other non-zero exit codes). Specifically, exit code 2 for internal
1220 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1221 // linux) for a segmentation fault (e.g. null pointer dereference).
1222 return strstr(status_msg, "internal error:") ? 2 : 1;
1223}
1224
Nigel Tao2914bae2020-02-26 09:40:30 +11001225int //
1226main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001227 // Look for an input filename (the first non-flag argument) in argv. If there
1228 // is one, open it (but do not read from it) before we self-impose a sandbox.
1229 //
1230 // Flags start with "-", unless it comes after a bare "--" arg.
1231 {
1232 bool dash_dash = false;
1233 int a;
1234 for (a = 1; a < argc; a++) {
1235 char* arg = argv[a];
1236 if ((arg[0] == '-') && !dash_dash) {
1237 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1238 continue;
1239 }
Nigel Taod60815c2020-03-26 14:32:35 +11001240 g_input_file_descriptor = open(arg, O_RDONLY);
1241 if (g_input_file_descriptor < 0) {
Nigel Tao01abc842020-03-06 21:42:33 +11001242 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1243 return 1;
1244 }
1245 break;
1246 }
1247 }
1248
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001249#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1250 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
Nigel Taod60815c2020-03-26 14:32:35 +11001251 g_sandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001252#endif
1253
Nigel Tao0cd2f982020-03-03 23:03:02 +11001254 const char* z = main1(argc, argv);
Nigel Taod60815c2020-03-26 14:32:35 +11001255 if (g_wrote_to_dst) {
Nigel Tao0291a472020-08-13 22:40:10 +10001256 const char* z1 = write_dst("\n", 1);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001257 const char* z2 = flush_dst();
1258 z = z ? z : (z1 ? z1 : z2);
1259 }
1260 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001261
1262#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1263 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1264 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1265 // only SYS_exit.
1266 syscall(SYS_exit, exit_code);
1267#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001268 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001269}