blob: eab2cb28b2d98ed97d348e5d79ba5ee605f53f07 [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
Nigel Tao0291a472020-08-13 22:40:10 +100019(RFC 6901) query syntax. It reads UTF-8 JSON from stdin and writes
20canonicalized, formatted UTF-8 JSON to stdout.
Nigel Tao0cd2f982020-03-03 23:03:02 +110021
Nigel Taod60815c2020-03-26 14:32:35 +110022See the "const char* g_usage" string below for details.
Nigel Tao0cd2f982020-03-03 23:03:02 +110023
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
Nigel Tao0291a472020-08-13 22:40:10 +100030One benefit of simplicity is that this program's JSON and JSON Pointer
Nigel Tao0cd2f982020-03-03 23:03:02 +110031implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
Nigel Tao0291a472020-08-13 22:40:10 +100036The core JSON implementation is also written in the Wuffs programming language
37(and then transpiled to C/C++), which is memory-safe (e.g. array indexing is
38bounds-checked) but also guards against integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao0291a472020-08-13 22:40:10 +100045All together, this program aims to safely handle untrusted JSON files without
46fear of security bugs such as remote code execution.
Nigel Tao0cd2f982020-03-03 23:03:02 +110047
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
Nigel Taod60815c2020-03-26 14:32:35 +110063changes global state (e.g. the `g_depth` and `g_ctx` variables) and prints
Nigel Taod0b16cb2020-03-14 10:15:54 +110064output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao50bfab92020-08-05 11:39:09 +100089To run:
Nigel Tao1b073492020-02-16 22:11:36 +110090
91$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
92
93for a C++ compiler $CXX, such as clang++ or g++.
94*/
95
Nigel Tao721190a2020-04-03 22:25:21 +110096#if defined(__cplusplus) && (__cplusplus < 201103L)
97#error "This C++ program requires -std=c++11 or later"
98#endif
99
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100100#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +1100101#include <fcntl.h>
102#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100103#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100104#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100105
106// Wuffs ships as a "single file C library" or "header file library" as per
107// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
108//
109// To use that single file as a "foo.c"-like implementation, instead of a
110// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
111// compiling it.
112#define WUFFS_IMPLEMENTATION
113
114// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
115// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
116// the entire Wuffs standard library, implementing a variety of codecs and file
117// formats. Without this macro definition, an optimizing compiler or linker may
118// very well discard Wuffs code for unused codecs, but listing the Wuffs
119// modules we use makes that process explicit. Preprocessing means that such
120// code simply isn't compiled.
121#define WUFFS_CONFIG__MODULES
122#define WUFFS_CONFIG__MODULE__BASE
123#define WUFFS_CONFIG__MODULE__JSON
124
125// If building this program in an environment that doesn't easily accommodate
126// relative includes, you can use the script/inline-c-relative-includes.go
127// program to generate a stand-alone C++ file.
128#include "../../release/c/wuffs-unsupported-snapshot.c"
129
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100130#if defined(__linux__)
131#include <linux/prctl.h>
132#include <linux/seccomp.h>
133#include <sys/prctl.h>
134#include <sys/syscall.h>
135#define WUFFS_EXAMPLE_USE_SECCOMP
136#endif
137
Nigel Tao2cf76db2020-02-27 22:42:01 +1100138#define TRY(error_msg) \
139 do { \
140 const char* z = error_msg; \
141 if (z) { \
142 return z; \
143 } \
144 } while (false)
145
Nigel Taod60815c2020-03-26 14:32:35 +1100146static const char* g_eod = "main: end of data";
Nigel Tao2cf76db2020-02-27 22:42:01 +1100147
Nigel Taod60815c2020-03-26 14:32:35 +1100148static const char* g_usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100149 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100150 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100151 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100152 " -c -compact-output\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100153 " -d=NUM -max-output-depth=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100154 " -q=STR -query=STR\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000155 " -s=NUM -spaces=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100156 " -t -tabs\n"
157 " -fail-if-unsandboxed\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000158 " -input-allow-comments\n"
159 " -input-allow-extra-comma\n"
160 " -input-allow-inf-nan-numbers\n"
161 " -output-extra-comma\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000162 " -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100163 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100164 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100165 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100166 "----\n"
167 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100168 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000169 "Pointer (RFC 6901) query syntax. It reads UTF-8 JSON from stdin and\n"
170 "writes canonicalized, formatted UTF-8 JSON to stdout.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100171 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000172 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
173 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100174 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100175 "\n"
176 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000177 "on its own line. Configure this with the -c / -compact-output, -s=NUM /\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000178 "-spaces=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags.\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000179 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000180 "The -input-allow-comments flag allows \"/*slash-star*/\" and\n"
181 "\"//slash-slash\" C-style comments within JSON input. Such comments are\n"
182 "stripped from the output.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100183 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000184 "The -input-allow-extra-comma flag allows input like \"[1,2,]\", with a\n"
185 "comma after the final element of a JSON list or dictionary.\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000186 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000187 "The -input-allow-inf-nan-numbers flag allows non-finite floating point\n"
188 "numbers (infinities and not-a-numbers) within JSON input.\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000189 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000190 "The -output-extra-comma flag writes output like \"[1,2,]\", with a comma\n"
191 "after the final element of a JSON list or dictionary. Such commas are\n"
192 "non-compliant with the JSON specification but many parsers accept them\n"
193 "and they can produce simpler line-based diffs. This flag is ignored when\n"
194 "-compact-output is set.\n"
Nigel Taof8dfc762020-07-23 23:35:44 +1000195 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100196 "----\n"
197 "\n"
198 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100199 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100200 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
201 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100202 "will print:\n"
203 " \"baz\"\n"
204 "\n"
205 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100206 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100207 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
208 "child (the value in a key-value pair) of the root whose key is the empty\n"
209 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100210 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000211 "If the query found a valid JSON value, this program will return a zero\n"
212 "exit code even if the rest of the input isn't valid JSON. If the query\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100213 "did not find a value, or found an invalid one, this program returns a\n"
214 "non-zero exit code, but may still print partial output to stdout.\n"
215 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000216 "The JSON specification (https://json.org/) permits implementations that\n"
217 "allow duplicate keys, as this one does. This JSON Pointer implementation\n"
218 "is also greedy, following the first match for each fragment without\n"
219 "back-tracking. For example, the \"/foo/bar\" query will fail if the root\n"
220 "object has multiple \"foo\" children but the first one doesn't have a\n"
221 "\"bar\" child, even if later ones do.\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100222 "\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000223 "The -strict-json-pointer-syntax flag restricts the -query=STR string to\n"
224 "exactly RFC 6901, with only two escape sequences: \"~0\" and \"~1\" for\n"
225 "\"~\" and \"/\". Without this flag, this program also lets \"~n\" and\n"
226 "\"~r\" escape the New Line and Carriage Return ASCII control characters,\n"
227 "which can work better with line oriented Unix tools that assume exactly\n"
228 "one value (i.e. one JSON Pointer string) per line.\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100229 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100230 "----\n"
231 "\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100232 "The -d=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000233 "output depth. JSON containers ([] arrays and {} objects) can hold other\n"
234 "containers. When this flag is set, containers at depth NUM are replaced\n"
235 "with \"[…]\" or \"{…}\". A bare -d or -max-output-depth is equivalent to\n"
236 "-d=1. The flag's absence is equivalent to an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100237 "\n"
238 "The -max-output-depth flag only affects the program's output. It doesn't\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000239 "affect whether or not the input is considered valid JSON. The JSON\n"
240 "specification permits implementations to set their own maximum input\n"
241 "depth. This JSON implementation sets it to 1024.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100242 "\n"
243 "Depth is measured in terms of nested containers. It is unaffected by the\n"
244 "number of spaces or tabs used to indent.\n"
245 "\n"
246 "When both -max-output-depth and -query are set, the output depth is\n"
247 "measured from when the query resolves, not from the input root. The\n"
248 "input depth (measured from the root) is still limited to 1024.\n"
249 "\n"
250 "----\n"
251 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100252 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
253 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100254 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100255
Nigel Tao2cf76db2020-02-27 22:42:01 +1100256// ----
257
Nigel Taof3146c22020-03-26 08:47:42 +1100258// Wuffs allows either statically or dynamically allocated work buffers. This
259// program exercises static allocation.
260#define WORK_BUFFER_ARRAY_SIZE \
261 WUFFS_JSON__DECODER_WORKBUF_LEN_MAX_INCL_WORST_CASE
262#if WORK_BUFFER_ARRAY_SIZE > 0
Nigel Taod60815c2020-03-26 14:32:35 +1100263uint8_t g_work_buffer_array[WORK_BUFFER_ARRAY_SIZE];
Nigel Taof3146c22020-03-26 08:47:42 +1100264#else
265// Not all C/C++ compilers support 0-length arrays.
Nigel Taod60815c2020-03-26 14:32:35 +1100266uint8_t g_work_buffer_array[1];
Nigel Taof3146c22020-03-26 08:47:42 +1100267#endif
268
Nigel Taod60815c2020-03-26 14:32:35 +1100269bool g_sandboxed = false;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100270
Nigel Taod60815c2020-03-26 14:32:35 +1100271int g_input_file_descriptor = 0; // A 0 default means stdin.
Nigel Tao01abc842020-03-06 21:42:33 +1100272
Nigel Tao484c1d52020-08-09 23:11:06 +1000273// parse_flags enforces that g_flags.spaces <= 8 (the length of
274// INDENT_SPACES_STRING).
Nigel Tao107f0ef2020-03-01 21:35:02 +1100275#define INDENT_SPACES_STRING " "
Nigel Tao6e7d1412020-03-06 09:21:35 +1100276#define INDENT_TAB_STRING "\t"
Nigel Tao107f0ef2020-03-01 21:35:02 +1100277
Nigel Taofdac24a2020-03-06 21:53:08 +1100278#ifndef DST_BUFFER_ARRAY_SIZE
279#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100280#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100281#ifndef SRC_BUFFER_ARRAY_SIZE
282#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100283#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100284#ifndef TOKEN_BUFFER_ARRAY_SIZE
285#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100286#endif
287
Nigel Taod60815c2020-03-26 14:32:35 +1100288uint8_t g_dst_array[DST_BUFFER_ARRAY_SIZE];
289uint8_t g_src_array[SRC_BUFFER_ARRAY_SIZE];
290wuffs_base__token g_tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100291
Nigel Taod60815c2020-03-26 14:32:35 +1100292wuffs_base__io_buffer g_dst;
293wuffs_base__io_buffer g_src;
294wuffs_base__token_buffer g_tok;
Nigel Tao1b073492020-02-16 22:11:36 +1100295
Nigel Taod60815c2020-03-26 14:32:35 +1100296// g_curr_token_end_src_index is the g_src.data.ptr index of the end of the
297// current token. An invariant is that (g_curr_token_end_src_index <=
298// g_src.meta.ri).
299size_t g_curr_token_end_src_index;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100300
Nigel Taod60815c2020-03-26 14:32:35 +1100301uint32_t g_depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100302
303enum class context {
304 none,
305 in_list_after_bracket,
306 in_list_after_value,
307 in_dict_after_brace,
308 in_dict_after_key,
309 in_dict_after_value,
Nigel Taod60815c2020-03-26 14:32:35 +1100310} g_ctx;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100311
Nigel Tao0cd2f982020-03-03 23:03:02 +1100312bool //
313in_dict_before_key() {
Nigel Taod60815c2020-03-26 14:32:35 +1100314 return (g_ctx == context::in_dict_after_brace) ||
315 (g_ctx == context::in_dict_after_value);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100316}
317
Nigel Taod60815c2020-03-26 14:32:35 +1100318uint32_t g_suppress_write_dst;
319bool g_wrote_to_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100320
Nigel Tao0291a472020-08-13 22:40:10 +1000321wuffs_json__decoder g_dec;
Nigel Taoea532452020-07-27 00:03:00 +1000322
Nigel Tao0cd2f982020-03-03 23:03:02 +1100323// ----
324
325// Query is a JSON Pointer query. After initializing with a NUL-terminated C
326// string, its multiple fragments are consumed as the program walks the JSON
327// data from stdin. For example, letting "$" denote a NUL, suppose that we
328// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100329// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100330//
331// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
332// / a p p l e / b a n a n a / 1 2 / d u r i a n $
333// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
334// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100335// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100336//
Nigel Taob48ee752020-03-13 09:27:33 +1100337// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
338// start (inclusive) and end (exclusive) of the query fragment. They satisfy
339// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
340// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100341//
Nigel Taob48ee752020-03-13 09:27:33 +1100342// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
343// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100344//
345// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
346// tokens, as backslash-escaped values within that JSON string may each get
347// their own token.
348//
Nigel Taob48ee752020-03-13 09:27:33 +1100349// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100350//
Nigel Taob48ee752020-03-13 09:27:33 +1100351// While mfj remains non-nullptr, each token's unescaped contents are then
352// compared to that part of the fragment from mfj to mfk. If it is a prefix
353// (including the case of an exact match), then mfj is advanced by the
354// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100355//
356// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
357// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100358// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
359// responsible for calling Query::validate (with a strict_json_pointer_syntax
360// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100361//
Nigel Taob48ee752020-03-13 09:27:33 +1100362// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
363// incrementally match the object key with the query fragment. For example, if
364// we have already matched the "ban" of "banana", then we would accept any of
365// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
366// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100367//
Nigel Taob48ee752020-03-13 09:27:33 +1100368// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100369// v
370// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
371// / a p p l e / b a n a n a / 1 2 / d u r i a n $
372// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
373// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100374// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100375//
376// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100377// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
378// have a fragment match: the query fragment equals the object key. If there is
379// a next fragment (in this example, "12") we move the frag_etc pointers to its
380// start and end and increment Query::m_depth. Otherwise, we have matched the
381// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100382//
383// The discussion above centers on object keys. If the query fragment is
384// numeric then it can also match as an array index: the string fragment "12"
385// will match an array's 13th element (starting counting from zero). See RFC
386// 6901 for its precise definition of an "array index" number.
387//
Nigel Taob48ee752020-03-13 09:27:33 +1100388// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100389// whose type (wuffs_base__result_u64) is a result type. An error result means
390// that the fragment is not an array index. A value result holds the number of
391// list elements remaining. When matching a query fragment in an array (instead
392// of in an object), each element ticks this number down towards zero. At zero,
393// the upcoming JSON value is the one that matches the query fragment.
394class Query {
395 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100396 uint8_t* m_frag_i;
397 uint8_t* m_frag_j;
398 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100399
Nigel Taob48ee752020-03-13 09:27:33 +1100400 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100401
Nigel Taob48ee752020-03-13 09:27:33 +1100402 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100403
404 public:
405 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100406 m_frag_i = (uint8_t*)query_c_string;
407 m_frag_j = (uint8_t*)query_c_string;
408 m_frag_k = (uint8_t*)query_c_string;
409 m_depth = 0;
410 m_array_index.status.repr = "#main: not an array index query fragment";
411 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100412 }
413
Nigel Taob48ee752020-03-13 09:27:33 +1100414 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100415
Nigel Taob48ee752020-03-13 09:27:33 +1100416 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100417
418 // tick returns whether the fragment is a valid array index whose value is
419 // zero. If valid but non-zero, it decrements it and returns false.
420 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100421 if (m_array_index.status.is_ok()) {
Nigel Tao0291a472020-08-13 22:40:10 +1000422 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100423 return true;
424 }
Nigel Tao0291a472020-08-13 22:40:10 +1000425 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100426 }
427 return false;
428 }
429
430 // next_fragment moves to the next fragment, returning whether it existed.
431 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100432 uint8_t* k = m_frag_k;
433 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100434
435 this->reset(nullptr);
436
437 if (!k || (*k != '/')) {
438 return false;
439 }
440 k++;
441
442 bool all_digits = true;
443 uint8_t* i = k;
444 while ((*k != '\x00') && (*k != '/')) {
445 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
446 k++;
447 }
Nigel Taob48ee752020-03-13 09:27:33 +1100448 m_frag_i = i;
449 m_frag_j = i;
450 m_frag_k = k;
451 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100452 if (all_digits) {
453 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Tao6b7ce302020-07-07 16:19:46 +1000454 m_array_index = wuffs_base__parse_number_u64(
455 wuffs_base__make_slice_u8(i, k - i),
456 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100457 }
458 return true;
459 }
460
Nigel Taob48ee752020-03-13 09:27:33 +1100461 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100462
Nigel Taob48ee752020-03-13 09:27:33 +1100463 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100464
465 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100466 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100467 return;
468 }
Nigel Taob48ee752020-03-13 09:27:33 +1100469 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100470 while (true) {
471 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100472 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100473 return;
474 }
475
476 if (*j == '\x00') {
477 break;
478
479 } else if (*j == '~') {
480 j++;
481 if (*j == '0') {
482 if (*ptr != '~') {
483 break;
484 }
485 } else if (*j == '1') {
486 if (*ptr != '/') {
487 break;
488 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100489 } else if (*j == 'n') {
490 if (*ptr != '\n') {
491 break;
492 }
493 } else if (*j == 'r') {
494 if (*ptr != '\r') {
495 break;
496 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100497 } else {
498 break;
499 }
500
501 } else if (*j != *ptr) {
502 break;
503 }
504
505 j++;
506 ptr++;
507 len--;
508 }
Nigel Taob48ee752020-03-13 09:27:33 +1100509 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100510 }
511
512 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100513 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100514 return;
515 }
516 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
517 size_t n = wuffs_base__utf_8__encode(
518 wuffs_base__make_slice_u8(&u[0],
519 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
520 code_point);
521 if (n > 0) {
522 this->incremental_match_slice(&u[0], n);
523 }
524 }
525
526 // validate returns whether the (ptr, len) arguments form a valid JSON
527 // Pointer. In particular, it must be valid UTF-8, and either be empty or
528 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100529 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
530 // followed by either 'n' or 'r'.
531 static bool validate(char* query_c_string,
532 size_t length,
533 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100534 if (length <= 0) {
535 return true;
536 }
537 if (query_c_string[0] != '/') {
538 return false;
539 }
540 wuffs_base__slice_u8 s =
541 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
542 bool previous_was_tilde = false;
543 while (s.len > 0) {
Nigel Tao702c7b22020-07-22 15:42:54 +1000544 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s.ptr, s.len);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100545 if (!o.is_valid()) {
546 return false;
547 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100548
549 if (previous_was_tilde) {
550 switch (o.code_point) {
551 case '0':
552 case '1':
553 break;
554 case 'n':
555 case 'r':
556 if (strict_json_pointer_syntax) {
557 return false;
558 }
559 break;
560 default:
561 return false;
562 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100563 }
564 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100565
Nigel Tao0cd2f982020-03-03 23:03:02 +1100566 s.ptr += o.byte_length;
567 s.len -= o.byte_length;
568 }
569 return !previous_was_tilde;
570 }
Nigel Taod60815c2020-03-26 14:32:35 +1100571} g_query;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100572
573// ----
574
Nigel Tao68920952020-03-03 11:25:18 +1100575struct {
576 int remaining_argc;
577 char** remaining_argv;
578
Nigel Tao3690e832020-03-12 16:52:26 +1100579 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100580 bool fail_if_unsandboxed;
Nigel Tao0291a472020-08-13 22:40:10 +1000581 bool input_allow_comments;
582 bool input_allow_extra_comma;
583 bool input_allow_inf_nan_numbers;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100584 uint32_t max_output_depth;
Nigel Tao0291a472020-08-13 22:40:10 +1000585 bool output_extra_comma;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100586 char* query_c_string;
Nigel Taoecadf722020-07-13 08:22:34 +1000587 size_t spaces;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100588 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100589 bool tabs;
Nigel Taod60815c2020-03-26 14:32:35 +1100590} g_flags = {0};
Nigel Tao68920952020-03-03 11:25:18 +1100591
592const char* //
593parse_flags(int argc, char** argv) {
Nigel Taoecadf722020-07-13 08:22:34 +1000594 g_flags.spaces = 4;
Nigel Taod60815c2020-03-26 14:32:35 +1100595 g_flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100596
597 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
598 for (; c < argc; c++) {
599 char* arg = argv[c];
600 if (*arg++ != '-') {
601 break;
602 }
603
604 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
605 // cases, a bare "-" is not a flag (some programs may interpret it as
606 // stdin) and a bare "--" means to stop parsing flags.
607 if (*arg == '\x00') {
608 break;
609 } else if (*arg == '-') {
610 arg++;
611 if (*arg == '\x00') {
612 c++;
613 break;
614 }
615 }
616
Nigel Tao3690e832020-03-12 16:52:26 +1100617 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100618 g_flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100619 continue;
620 }
Nigel Tao94440cf2020-04-02 22:28:24 +1100621 if (!strcmp(arg, "d") || !strcmp(arg, "max-output-depth")) {
622 g_flags.max_output_depth = 1;
623 continue;
624 } else if (!strncmp(arg, "d=", 2) ||
625 !strncmp(arg, "max-output-depth=", 16)) {
626 while (*arg++ != '=') {
627 }
628 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
Nigel Tao6b7ce302020-07-07 16:19:46 +1000629 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)),
630 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Taoaf757722020-07-18 17:27:11 +1000631 if (u.status.is_ok() && (u.value <= 0xFFFFFFFF)) {
Nigel Tao94440cf2020-04-02 22:28:24 +1100632 g_flags.max_output_depth = (uint32_t)(u.value);
633 continue;
634 }
635 return g_usage;
636 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100637 if (!strcmp(arg, "fail-if-unsandboxed")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100638 g_flags.fail_if_unsandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100639 continue;
640 }
Nigel Tao0291a472020-08-13 22:40:10 +1000641 if (!strcmp(arg, "input-allow-comments")) {
642 g_flags.input_allow_comments = true;
Nigel Tao4e193592020-07-15 12:48:57 +1000643 continue;
644 }
Nigel Tao0291a472020-08-13 22:40:10 +1000645 if (!strcmp(arg, "input-allow-extra-comma")) {
646 g_flags.input_allow_extra_comma = true;
Nigel Tao4e193592020-07-15 12:48:57 +1000647 continue;
648 }
Nigel Tao0291a472020-08-13 22:40:10 +1000649 if (!strcmp(arg, "input-allow-inf-nan-numbers")) {
650 g_flags.input_allow_inf_nan_numbers = true;
Nigel Tao3c8589b2020-07-19 21:49:00 +1000651 continue;
652 }
Nigel Tao0291a472020-08-13 22:40:10 +1000653 if (!strcmp(arg, "output-extra-comma")) {
654 g_flags.output_extra_comma = true;
Nigel Taodd114692020-07-25 21:54:12 +1000655 continue;
656 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100657 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
658 while (*arg++ != '=') {
659 }
Nigel Taod60815c2020-03-26 14:32:35 +1100660 g_flags.query_c_string = arg;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100661 continue;
662 }
Nigel Taoecadf722020-07-13 08:22:34 +1000663 if (!strncmp(arg, "s=", 2) || !strncmp(arg, "spaces=", 7)) {
664 while (*arg++ != '=') {
665 }
666 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
667 g_flags.spaces = arg[0] - '0';
668 continue;
669 }
670 return g_usage;
671 }
672 if (!strcmp(arg, "strict-json-pointer-syntax")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100673 g_flags.strict_json_pointer_syntax = true;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100674 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100675 }
676 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100677 g_flags.tabs = true;
Nigel Tao68920952020-03-03 11:25:18 +1100678 continue;
679 }
680
Nigel Taod60815c2020-03-26 14:32:35 +1100681 return g_usage;
Nigel Tao68920952020-03-03 11:25:18 +1100682 }
683
Nigel Taod60815c2020-03-26 14:32:35 +1100684 if (g_flags.query_c_string &&
685 !Query::validate(g_flags.query_c_string, strlen(g_flags.query_c_string),
686 g_flags.strict_json_pointer_syntax)) {
Nigel Taod6fdfb12020-03-11 12:24:14 +1100687 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
688 }
689
Nigel Taod60815c2020-03-26 14:32:35 +1100690 g_flags.remaining_argc = argc - c;
691 g_flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100692 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100693}
694
Nigel Tao2cf76db2020-02-27 22:42:01 +1100695const char* //
696initialize_globals(int argc, char** argv) {
Nigel Taod60815c2020-03-26 14:32:35 +1100697 g_dst = wuffs_base__make_io_buffer(
698 wuffs_base__make_slice_u8(g_dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100699 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100700
Nigel Taod60815c2020-03-26 14:32:35 +1100701 g_src = wuffs_base__make_io_buffer(
702 wuffs_base__make_slice_u8(g_src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100703 wuffs_base__empty_io_buffer_meta());
704
Nigel Taod60815c2020-03-26 14:32:35 +1100705 g_tok = wuffs_base__make_token_buffer(
706 wuffs_base__make_slice_token(g_tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100707 wuffs_base__empty_token_buffer_meta());
708
Nigel Taod60815c2020-03-26 14:32:35 +1100709 g_curr_token_end_src_index = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100710
Nigel Taod60815c2020-03-26 14:32:35 +1100711 g_depth = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100712
Nigel Taod60815c2020-03-26 14:32:35 +1100713 g_ctx = context::none;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100714
Nigel Tao68920952020-03-03 11:25:18 +1100715 TRY(parse_flags(argc, argv));
Nigel Taod60815c2020-03-26 14:32:35 +1100716 if (g_flags.fail_if_unsandboxed && !g_sandboxed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100717 return "main: unsandboxed";
718 }
Nigel Tao01abc842020-03-06 21:42:33 +1100719 const int stdin_fd = 0;
Nigel Taod60815c2020-03-26 14:32:35 +1100720 if (g_flags.remaining_argc >
721 ((g_input_file_descriptor != stdin_fd) ? 1 : 0)) {
722 return g_usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100723 }
724
Nigel Taod60815c2020-03-26 14:32:35 +1100725 g_query.reset(g_flags.query_c_string);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100726
Nigel Taoc96b31c2020-07-27 22:37:23 +1000727 // If the query is non-empty, suppress writing to stdout until we've
Nigel Tao0cd2f982020-03-03 23:03:02 +1100728 // completed the query.
Nigel Taod60815c2020-03-26 14:32:35 +1100729 g_suppress_write_dst = g_query.next_fragment() ? 1 : 0;
730 g_wrote_to_dst = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100731
Nigel Tao0291a472020-08-13 22:40:10 +1000732 TRY(g_dec.initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
733 .message());
Nigel Tao4b186b02020-03-18 14:25:21 +1100734
Nigel Tao0291a472020-08-13 22:40:10 +1000735 if (g_flags.input_allow_comments) {
736 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_COMMENT_BLOCK, true);
737 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_COMMENT_LINE, true);
Nigel Tao3c8589b2020-07-19 21:49:00 +1000738 }
Nigel Tao0291a472020-08-13 22:40:10 +1000739 if (g_flags.input_allow_extra_comma) {
740 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_EXTRA_COMMA, true);
Nigel Taoc766bb72020-07-09 12:59:32 +1000741 }
Nigel Tao0291a472020-08-13 22:40:10 +1000742 if (g_flags.input_allow_inf_nan_numbers) {
743 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_INF_NAN_NUMBERS, true);
Nigel Tao51a38292020-07-19 22:43:17 +1000744 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000745
Nigel Tao4b186b02020-03-18 14:25:21 +1100746 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
747 // but it works better with line oriented Unix tools (such as "echo 123 |
748 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
749 // can accidentally contain trailing whitespace.
Nigel Tao0291a472020-08-13 22:40:10 +1000750 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
Nigel Tao4b186b02020-03-18 14:25:21 +1100751
752 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100753}
Nigel Tao1b073492020-02-16 22:11:36 +1100754
755// ----
756
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100757// ignore_return_value suppresses errors from -Wall -Werror.
758static void //
759ignore_return_value(int ignored) {}
760
Nigel Tao2914bae2020-02-26 09:40:30 +1100761const char* //
762read_src() {
Nigel Taod60815c2020-03-26 14:32:35 +1100763 if (g_src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100764 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100765 }
Nigel Taod60815c2020-03-26 14:32:35 +1100766 g_src.compact();
767 if (g_src.meta.wi >= g_src.data.len) {
768 return "main: g_src buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100769 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100770 while (true) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000771 ssize_t n = read(g_input_file_descriptor, g_src.writer_pointer(),
772 g_src.writer_length());
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100773 if (n >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100774 g_src.meta.wi += n;
775 g_src.meta.closed = n == 0;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100776 break;
777 } else if (errno != EINTR) {
778 return strerror(errno);
779 }
Nigel Tao1b073492020-02-16 22:11:36 +1100780 }
781 return nullptr;
782}
783
Nigel Tao2914bae2020-02-26 09:40:30 +1100784const char* //
785flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100786 while (true) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000787 size_t n = g_dst.reader_length();
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100788 if (n == 0) {
789 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100790 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100791 const int stdout_fd = 1;
Nigel Taod6a10df2020-07-27 11:47:47 +1000792 ssize_t i = write(stdout_fd, g_dst.reader_pointer(), n);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100793 if (i >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100794 g_dst.meta.ri += i;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100795 } else if (errno != EINTR) {
796 return strerror(errno);
797 }
Nigel Tao1b073492020-02-16 22:11:36 +1100798 }
Nigel Taod60815c2020-03-26 14:32:35 +1100799 g_dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100800 return nullptr;
801}
802
Nigel Tao2914bae2020-02-26 09:40:30 +1100803const char* //
804write_dst(const void* s, size_t n) {
Nigel Taod60815c2020-03-26 14:32:35 +1100805 if (g_suppress_write_dst > 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100806 return nullptr;
807 }
Nigel Tao1b073492020-02-16 22:11:36 +1100808 const uint8_t* p = static_cast<const uint8_t*>(s);
809 while (n > 0) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000810 size_t i = g_dst.writer_length();
Nigel Tao1b073492020-02-16 22:11:36 +1100811 if (i == 0) {
812 const char* z = flush_dst();
813 if (z) {
814 return z;
815 }
Nigel Taod6a10df2020-07-27 11:47:47 +1000816 i = g_dst.writer_length();
Nigel Tao1b073492020-02-16 22:11:36 +1100817 if (i == 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100818 return "main: g_dst buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100819 }
820 }
821
822 if (i > n) {
823 i = n;
824 }
Nigel Taod60815c2020-03-26 14:32:35 +1100825 memcpy(g_dst.data.ptr + g_dst.meta.wi, p, i);
826 g_dst.meta.wi += i;
Nigel Tao1b073492020-02-16 22:11:36 +1100827 p += i;
828 n -= i;
Nigel Taod60815c2020-03-26 14:32:35 +1100829 g_wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100830 }
831 return nullptr;
832}
833
834// ----
835
Nigel Tao2914bae2020-02-26 09:40:30 +1100836uint8_t //
837hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +1100838 nibble &= 0x0F;
839 if (nibble <= 9) {
840 return '0' + nibble;
841 }
842 return ('A' - 10) + nibble;
843}
844
Nigel Tao2914bae2020-02-26 09:40:30 +1100845const char* //
Nigel Tao7cb76542020-07-19 22:19:04 +1000846handle_unicode_code_point(uint32_t ucp) {
Nigel Tao0291a472020-08-13 22:40:10 +1000847 if (ucp < 0x0020) {
848 switch (ucp) {
849 case '\b':
850 return write_dst("\\b", 2);
851 case '\f':
852 return write_dst("\\f", 2);
853 case '\n':
854 return write_dst("\\n", 2);
855 case '\r':
856 return write_dst("\\r", 2);
857 case '\t':
858 return write_dst("\\t", 2);
Nigel Tao7cb76542020-07-19 22:19:04 +1000859 }
Nigel Tao0291a472020-08-13 22:40:10 +1000860
861 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
862 // JSON string. They need to remain escaped.
863 uint8_t esc6[6];
864 esc6[0] = '\\';
865 esc6[1] = 'u';
866 esc6[2] = '0';
867 esc6[3] = '0';
868 esc6[4] = hex_digit(ucp >> 4);
869 esc6[5] = hex_digit(ucp >> 0);
870 return write_dst(&esc6[0], 6);
871
872 } else if (ucp == '\"') {
873 return write_dst("\\\"", 2);
874
875 } else if (ucp == '\\') {
876 return write_dst("\\\\", 2);
Nigel Tao7cb76542020-07-19 22:19:04 +1000877 }
878
879 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
880 size_t n = wuffs_base__utf_8__encode(
881 wuffs_base__make_slice_u8(&u[0],
882 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
883 ucp);
884 if (n == 0) {
885 return "main: internal error: unexpected Unicode code point";
886 }
Nigel Tao0291a472020-08-13 22:40:10 +1000887 return write_dst(&u[0], n);
Nigel Tao168f60a2020-07-14 13:19:33 +1000888}
889
Nigel Taod191a3f2020-07-19 22:14:54 +1000890// ----
891
Nigel Tao3b486982020-02-27 15:05:59 +1100892const char* //
Nigel Tao2ef39992020-04-09 17:24:39 +1000893handle_token(wuffs_base__token t, bool start_of_token_chain) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100894 do {
Nigel Tao462f8662020-04-01 23:01:51 +1100895 int64_t vbc = t.value_base_category();
Nigel Tao2cf76db2020-02-27 22:42:01 +1100896 uint64_t vbd = t.value_base_detail();
Nigel Taoee6927f2020-07-27 12:08:33 +1000897 uint64_t token_length = t.length();
898 wuffs_base__slice_u8 tok = wuffs_base__make_slice_u8(
899 g_src.data.ptr + g_curr_token_end_src_index - token_length,
900 token_length);
Nigel Tao1b073492020-02-16 22:11:36 +1100901
902 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +1100903 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +1100904 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Taod60815c2020-03-26 14:32:35 +1100905 if (g_query.is_at(g_depth)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100906 return "main: no match for query";
907 }
Nigel Taod60815c2020-03-26 14:32:35 +1100908 if (g_depth <= 0) {
909 return "main: internal error: inconsistent g_depth";
Nigel Tao1b073492020-02-16 22:11:36 +1100910 }
Nigel Taod60815c2020-03-26 14:32:35 +1100911 g_depth--;
Nigel Tao1b073492020-02-16 22:11:36 +1100912
Nigel Taod60815c2020-03-26 14:32:35 +1100913 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
914 g_suppress_write_dst--;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100915 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
Nigel Tao0291a472020-08-13 22:40:10 +1000916 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
917 ? "\"[…]\""
918 : "\"{…}\"",
919 7));
920 } else {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100921 // Write preceding whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +1100922 if ((g_ctx != context::in_list_after_bracket) &&
923 (g_ctx != context::in_dict_after_brace) &&
924 !g_flags.compact_output) {
Nigel Tao0291a472020-08-13 22:40:10 +1000925 if (g_flags.output_extra_comma) {
Nigel Taoc766bb72020-07-09 12:59:32 +1000926 TRY(write_dst(",\n", 2));
927 } else {
928 TRY(write_dst("\n", 1));
929 }
Nigel Taod60815c2020-03-26 14:32:35 +1100930 for (uint32_t i = 0; i < g_depth; i++) {
931 TRY(write_dst(
932 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
Nigel Taoecadf722020-07-13 08:22:34 +1000933 g_flags.tabs ? 1 : g_flags.spaces));
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100934 }
Nigel Tao1b073492020-02-16 22:11:36 +1100935 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100936
937 TRY(write_dst(
938 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
939 1));
Nigel Tao1b073492020-02-16 22:11:36 +1100940 }
941
Nigel Taod60815c2020-03-26 14:32:35 +1100942 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
943 ? context::in_list_after_value
944 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +1100945 goto after_value;
946 }
947
Nigel Taod1c928a2020-02-28 12:43:53 +1100948 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
Nigel Tao0291a472020-08-13 22:40:10 +1000949 // continuation of a multi-token chain.
950 if (start_of_token_chain) {
951 if (g_ctx == context::in_dict_after_key) {
Nigel Taod60815c2020-03-26 14:32:35 +1100952 TRY(write_dst(": ", g_flags.compact_output ? 1 : 2));
953 } else if (g_ctx != context::none) {
Nigel Tao0291a472020-08-13 22:40:10 +1000954 if ((g_ctx != context::in_list_after_bracket) &&
955 (g_ctx != context::in_dict_after_brace)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100956 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +1100957 }
Nigel Taod60815c2020-03-26 14:32:35 +1100958 if (!g_flags.compact_output) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100959 TRY(write_dst("\n", 1));
Nigel Taod60815c2020-03-26 14:32:35 +1100960 for (size_t i = 0; i < g_depth; i++) {
961 TRY(write_dst(
962 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
Nigel Taoecadf722020-07-13 08:22:34 +1000963 g_flags.tabs ? 1 : g_flags.spaces));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100964 }
965 }
966 }
967
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100968 bool query_matched_fragment = false;
Nigel Taod60815c2020-03-26 14:32:35 +1100969 if (g_query.is_at(g_depth)) {
970 switch (g_ctx) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100971 case context::in_list_after_bracket:
972 case context::in_list_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +1100973 query_matched_fragment = g_query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100974 break;
975 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +1100976 query_matched_fragment = g_query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100977 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +1100978 default:
979 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100980 }
981 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100982 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100983 // No-op.
Nigel Taod60815c2020-03-26 14:32:35 +1100984 } else if (!g_query.next_fragment()) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100985 // There is no next fragment. We have matched the complete query, and
986 // the upcoming JSON value is the result of that query.
987 //
Nigel Taod60815c2020-03-26 14:32:35 +1100988 // Un-suppress writing to stdout and reset the g_ctx and g_depth as if
989 // we were about to decode a top-level value. This makes any subsequent
990 // indentation be relative to this point, and we will return g_eod
991 // after the upcoming JSON value is complete.
992 if (g_suppress_write_dst != 1) {
993 return "main: internal error: inconsistent g_suppress_write_dst";
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100994 }
Nigel Taod60815c2020-03-26 14:32:35 +1100995 g_suppress_write_dst = 0;
996 g_ctx = context::none;
997 g_depth = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100998 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
999 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
1000 // The query has moved on to the next fragment but the upcoming JSON
1001 // value is not a container.
1002 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +11001003 }
1004 }
1005
1006 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +11001007 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +11001008 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +11001009 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Taod60815c2020-03-26 14:32:35 +11001010 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1011 g_suppress_write_dst++;
Nigel Tao0291a472020-08-13 22:40:10 +10001012 } else {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001013 TRY(write_dst(
1014 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
1015 1));
1016 }
Nigel Taod60815c2020-03-26 14:32:35 +11001017 g_depth++;
1018 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1019 ? context::in_list_after_bracket
1020 : context::in_dict_after_brace;
Nigel Tao85fba7f2020-02-29 16:28:06 +11001021 return nullptr;
1022
Nigel Tao2cf76db2020-02-27 22:42:01 +11001023 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Tao0291a472020-08-13 22:40:10 +10001024 if (start_of_token_chain) {
1025 TRY(write_dst("\"", 1));
1026 g_query.restart_fragment(in_dict_before_key() &&
1027 g_query.is_at(g_depth));
1028 }
1029
1030 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
1031 // No-op.
1032 } else if (vbd &
1033 WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
1034 TRY(write_dst(tok.ptr, tok.len));
1035 g_query.incremental_match_slice(tok.ptr, tok.len);
1036 } else {
1037 return "main: internal error: unexpected string-token conversion";
1038 }
1039
Nigel Tao496e88b2020-04-09 22:10:08 +10001040 if (t.continued()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001041 return nullptr;
1042 }
Nigel Tao0291a472020-08-13 22:40:10 +10001043 TRY(write_dst("\"", 1));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001044 goto after_value;
1045
1046 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao496e88b2020-04-09 22:10:08 +10001047 if (!t.continued()) {
1048 return "main: internal error: unexpected non-continued UCP token";
Nigel Tao0cd2f982020-03-03 23:03:02 +11001049 }
1050 TRY(handle_unicode_code_point(vbd));
Nigel Taod60815c2020-03-26 14:32:35 +11001051 g_query.incremental_match_code_point(vbd);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001052 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001053
Nigel Tao85fba7f2020-02-29 16:28:06 +11001054 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao2cf76db2020-02-27 22:42:01 +11001055 case WUFFS_BASE__TOKEN__VBC__NUMBER:
Nigel Tao0291a472020-08-13 22:40:10 +10001056 TRY(write_dst(tok.ptr, tok.len));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001057 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +11001058 }
1059
Nigel Tao0291a472020-08-13 22:40:10 +10001060 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +11001061 return "main: internal error: unexpected token";
1062 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +11001063
Nigel Tao2cf76db2020-02-27 22:42:01 +11001064 // Book-keeping after completing a value (whether a container value or a
1065 // simple value). Empty parent containers are no longer empty. If the parent
1066 // container is a "{...}" object, toggle between keys and values.
1067after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001068 if (g_depth == 0) {
1069 return g_eod;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001070 }
Nigel Taod60815c2020-03-26 14:32:35 +11001071 switch (g_ctx) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001072 case context::in_list_after_bracket:
Nigel Taod60815c2020-03-26 14:32:35 +11001073 g_ctx = context::in_list_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001074 break;
1075 case context::in_dict_after_brace:
Nigel Taod60815c2020-03-26 14:32:35 +11001076 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001077 break;
1078 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001079 g_ctx = context::in_dict_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001080 break;
1081 case context::in_dict_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001082 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001083 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001084 default:
1085 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001086 }
1087 return nullptr;
1088}
1089
1090const char* //
1091main1(int argc, char** argv) {
1092 TRY(initialize_globals(argc, argv));
1093
Nigel Taocd183f92020-07-14 12:11:05 +10001094 bool start_of_token_chain = true;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001095 while (true) {
Nigel Tao0291a472020-08-13 22:40:10 +10001096 wuffs_base__status status = g_dec.decode_tokens(
Nigel Taod60815c2020-03-26 14:32:35 +11001097 &g_tok, &g_src,
1098 wuffs_base__make_slice_u8(g_work_buffer_array, WORK_BUFFER_ARRAY_SIZE));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001099
Nigel Taod60815c2020-03-26 14:32:35 +11001100 while (g_tok.meta.ri < g_tok.meta.wi) {
1101 wuffs_base__token t = g_tok.data.ptr[g_tok.meta.ri++];
Nigel Tao2cf76db2020-02-27 22:42:01 +11001102 uint64_t n = t.length();
Nigel Taod60815c2020-03-26 14:32:35 +11001103 if ((g_src.meta.ri - g_curr_token_end_src_index) < n) {
1104 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001105 }
Nigel Taod60815c2020-03-26 14:32:35 +11001106 g_curr_token_end_src_index += n;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001107
Nigel Taod0b16cb2020-03-14 10:15:54 +11001108 // Skip filler tokens (e.g. whitespace).
Nigel Tao3c8589b2020-07-19 21:49:00 +10001109 if (t.value_base_category() == WUFFS_BASE__TOKEN__VBC__FILLER) {
Nigel Tao496e88b2020-04-09 22:10:08 +10001110 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001111 continue;
1112 }
1113
Nigel Tao2ef39992020-04-09 17:24:39 +10001114 const char* z = handle_token(t, start_of_token_chain);
Nigel Tao496e88b2020-04-09 22:10:08 +10001115 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001116 if (z == nullptr) {
1117 continue;
Nigel Taod60815c2020-03-26 14:32:35 +11001118 } else if (z == g_eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001119 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001120 }
1121 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001122 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001123
1124 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001125 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001126 } else if (status.repr == wuffs_base__suspension__short_read) {
Nigel Taod60815c2020-03-26 14:32:35 +11001127 if (g_curr_token_end_src_index != g_src.meta.ri) {
1128 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001129 }
1130 TRY(read_src());
Nigel Taod60815c2020-03-26 14:32:35 +11001131 g_curr_token_end_src_index = g_src.meta.ri;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001132 } else if (status.repr == wuffs_base__suspension__short_write) {
Nigel Taod60815c2020-03-26 14:32:35 +11001133 g_tok.compact();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001134 } else {
1135 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001136 }
1137 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001138end_of_data:
1139
Nigel Taod60815c2020-03-26 14:32:35 +11001140 // With a non-empty g_query, don't try to consume trailing whitespace or
Nigel Tao0cd2f982020-03-03 23:03:02 +11001141 // confirm that we've processed all the tokens.
Nigel Taod60815c2020-03-26 14:32:35 +11001142 if (g_flags.query_c_string && *g_flags.query_c_string) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001143 return nullptr;
1144 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001145
Nigel Tao6b161af2020-02-24 11:01:48 +11001146 // Check that we've exhausted the input.
Nigel Taod60815c2020-03-26 14:32:35 +11001147 if ((g_src.meta.ri == g_src.meta.wi) && !g_src.meta.closed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001148 TRY(read_src());
1149 }
Nigel Taod60815c2020-03-26 14:32:35 +11001150 if ((g_src.meta.ri < g_src.meta.wi) || !g_src.meta.closed) {
Nigel Tao0291a472020-08-13 22:40:10 +10001151 return "main: valid JSON followed by further (unexpected) data";
Nigel Tao6b161af2020-02-24 11:01:48 +11001152 }
1153
1154 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001155 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1156 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1157 // filler token for the "\n".
Nigel Taod60815c2020-03-26 14:32:35 +11001158 for (; g_tok.meta.ri < g_tok.meta.wi; g_tok.meta.ri++) {
1159 if (g_tok.data.ptr[g_tok.meta.ri].value_base_category() !=
Nigel Tao6b161af2020-02-24 11:01:48 +11001160 WUFFS_BASE__TOKEN__VBC__FILLER) {
1161 return "main: internal error: decoded OK but unprocessed tokens remain";
1162 }
1163 }
1164
1165 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001166}
1167
Nigel Tao2914bae2020-02-26 09:40:30 +11001168int //
1169compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001170 if (!status_msg) {
1171 return 0;
1172 }
Nigel Tao01abc842020-03-06 21:42:33 +11001173 size_t n;
Nigel Taod60815c2020-03-26 14:32:35 +11001174 if (status_msg == g_usage) {
Nigel Tao01abc842020-03-06 21:42:33 +11001175 n = strlen(status_msg);
1176 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001177 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001178 if (n >= 2047) {
1179 status_msg = "main: internal error: error message is too long";
1180 n = strnlen(status_msg, 2047);
1181 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001182 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001183 const int stderr_fd = 2;
1184 ignore_return_value(write(stderr_fd, status_msg, n));
1185 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001186 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1187 // formatted or unsupported input.
1188 //
1189 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1190 // run-time checks found that an internal invariant did not hold.
1191 //
1192 // Automated testing, including badly formatted inputs, can therefore
1193 // discriminate between expected failure (exit code 1) and unexpected failure
1194 // (other non-zero exit codes). Specifically, exit code 2 for internal
1195 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1196 // linux) for a segmentation fault (e.g. null pointer dereference).
1197 return strstr(status_msg, "internal error:") ? 2 : 1;
1198}
1199
Nigel Tao2914bae2020-02-26 09:40:30 +11001200int //
1201main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001202 // Look for an input filename (the first non-flag argument) in argv. If there
1203 // is one, open it (but do not read from it) before we self-impose a sandbox.
1204 //
1205 // Flags start with "-", unless it comes after a bare "--" arg.
1206 {
1207 bool dash_dash = false;
1208 int a;
1209 for (a = 1; a < argc; a++) {
1210 char* arg = argv[a];
1211 if ((arg[0] == '-') && !dash_dash) {
1212 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1213 continue;
1214 }
Nigel Taod60815c2020-03-26 14:32:35 +11001215 g_input_file_descriptor = open(arg, O_RDONLY);
1216 if (g_input_file_descriptor < 0) {
Nigel Tao01abc842020-03-06 21:42:33 +11001217 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1218 return 1;
1219 }
1220 break;
1221 }
1222 }
1223
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001224#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1225 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
Nigel Taod60815c2020-03-26 14:32:35 +11001226 g_sandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001227#endif
1228
Nigel Tao0cd2f982020-03-03 23:03:02 +11001229 const char* z = main1(argc, argv);
Nigel Taod60815c2020-03-26 14:32:35 +11001230 if (g_wrote_to_dst) {
Nigel Tao0291a472020-08-13 22:40:10 +10001231 const char* z1 = write_dst("\n", 1);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001232 const char* z2 = flush_dst();
1233 z = z ? z : (z1 ? z1 : z2);
1234 }
1235 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001236
1237#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1238 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1239 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1240 // only SYS_exit.
1241 syscall(SYS_exit, exit_code);
1242#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001243 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001244}