JSON,YAML和TOML

Last Updated: 2023-07-31 05:53:56 Monday

-- TOC --

在数据序列化和反序列化应用领域,JSON是主流。YAML的标准稍微复杂了一点点,而TOML的应用领域是配置文件。

JSON数据格式的定义,有深层次的抽象原因,而YAML出现时间点与JSON差不多,YAML是JSON的超集!

JSON

JSON标准:https://www.rfc-editor.org/rfc/rfc8259.txt

JSON can represent four primitive types (strings, numbers, booleans, and null) and two structured types (objects and arrays).

JSON可以表达4种基础数据类型,string,number,boolean和null,以及2种基本结构类型,object(用{}定义)和array(用[]定义)。object就是一个key/value的map,而任何类型,包括object,都可以放入array!

JSON数据,就是4种数据类型和2种结构类型的各种相互嵌套的组合。

{
    "Image": {
        "Width":  800,
        "Height": 600,
        "Title":  "View from 15th Floor",
        "Thumbnail": {
            "Url":    "http://www.example.com/image/481989943",
            "Height": 125,
            "Width":  100
        },
        "Animated" : false,
        "IDs": [116, 943, 234, 38793]
    }
}
[
    {
       "precision": "zip",
       "Latitude":  37.7668,
       "Longitude": -122.3959,
       "Address":   "",
       "City":      "SAN FRANCISCO",
       "State":     "CA",
       "Zip":       "94107",
       "Country":   "US"
    },
    {
       "precision": "zip",
       "Latitude":  37.371991,
       "Longitude": -122.026020,
       "Address":   "",
       "City":      "SUNNYVALE",
       "State":     "CA",
       "Zip":       "94085",
       "Country":   "US"
    }
]

JSON数据主要用于系统间的数据交换,其设计并未过多考虑人类阅读。标准和固定的编码方式为UTF-8。人类编写JSON数据,容易出错,比如非常常见的错误就是,多了个或少了个逗号!JSON的另一个问题是,它不是为了人类阅读而设计的,比如在字符串特别长的时候,没有换行的语法支持(另一个角度看,这也是一种简洁)。

Json Lines

如果用文件存放json数据,可以按照上文所述,一个文件存放一份json数据。

另外一种存放方式,每一行存放一份完整的json数据,这就是json line数据,比如.jsonl后缀的文件。

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, false]
["Deloise", "2012A", 19, true] 

上面这个示例,像不像一个用Json Lines存储的表格。

由于json格式有标准支撑,许多人认为,这是比用CSV格式更好的文本格式。

Python中的json模块

Python在标准库中提供了处理json数据的模块,同时还很贴心地提供了一个命令行json小工具。

在命令行使用json.tool

我们先学习如何在命令行使用json.tool模块,创建一个json文件,用于在代码中测试json模块。

查看json.tool模块命令行参数

$ python3 -m json.tool -h

验证文件中json数据格式,漂亮地输出

$ python3 -m json.tool json.txt
{
    "c": "ccc",
    "a": 1,
    "b": 2.34,
    "d": true,
    "e": [
        1,
        2,
        3,
        4
    ],
    "f": {
        "c": 3,
        "a": 1,
        "b": 2
    }
}

漂亮地输出json数据时,对key进行排序

上例中的两个"c"都是对象的第1个数据,使用--sort-keys可以排序,让输出更漂亮!

$ python3 -m json.tool --sort-keys json.txt
{
    "a": 1,
    "b": 2.34,
    "c": "ccc",
    "d": true,
    "e": [
        1,
        2,
        3,
        4
    ],
    "f": {
        "a": 1,
        "b": 2,
        "c": 3
    }
}

将漂亮的json数据输出到文件

$ cat json.txt
{
    "c":"ccc",
    "a":1,
    "b":2.34,
    "d":true,
    "e":[1,2,3,4],
    "f":{"c":3,"a":1,"b":2}
}
$ python3 -m json.tool --sort-keys json.txt json-pretty-sorted.txt
$ cat json-pretty-sorted.txt
{
    "a": 1,
    "b": 2.34,
    "c": "ccc",
    "d": true,
    "e": [
        1,
        2,
        3,
        4
    ],
    "f": {
        "a": 1,
        "b": 2,
        "c": 3
    }
}

用重定向的方式也是可以的:

$ python3 -m json.tool --sort-keys json.txt > json-pretty-sorted.txt

注意,在使用重定向的时候,前后的文件名不能相同,这是个常见的错误,在命令输入的一瞬间,因为(非追加)重定向的缘故,这个文件就被清空了,最后的结果是,json.tool会提示错误。

紧凑显示

--no-indent--compact的输出基本相同,只是后者更紧凑一点。

$ cat json.txt
[{
    "c":"ccc",
    "a":1,
    "b":2.34,
    "d":true,
    "e":[1,2,3,4],
    "f":{"c":3,"a":1,"b":2}
},
{
    "c":"ccc",
    "a":1,
    "b":2.34,
    "d":true,
    "e":[1,2,3,4],
    "f":{"c":3,"a":1,"b":2}
}]
$ python3 -m json.tool --no-indent json.txt
[{"c": "ccc", "a": 1, "b": 2.34, "d": true, "e": [1, 2, 3, 4], "f": {"c": 3, "a": 1, "b": 2}}, {"c": "ccc", "a": 1, "b": 2.34, "d": true, "e": [1, 2, 3, 4], "f": {"c": 3, "a": 1, "b": 2}}]
$ python3 -m json.tool --compact json.txt
[{"c":"ccc","a":1,"b":2.34,"d":true,"e":[1,2,3,4],"f":{"c":3,"a":1,"b":2}},{"c":"ccc","a":1,"b":2.34,"d":true,"e":[1,2,3,4],"f":{"c":3,"a":1,"b":2}}]

支持Json Lines

所谓Json Lines,就是文本中,每一行是一个完整的json数据。(前面的示例,都是一个文件存放一个完整的json数据)

$ cat json_compact.txt
{"c":"ccc","a":1,"b":2.34,"d":true,"e":[1,2,3,4],"f":{"c":3,"a":1,"b":2}}
{"c":"ccc","a":1,"b":2.34,"d":true,"e":[1,2,3,4],"f":{"c":3,"a":1,"b":2}}
{"c":"ccc","a":1,"b":2.34,"d":true,"e":[1,2,3,4],"f":{"c":3,"a":1,"b":2}}
$ cat json_compact.txt | python3 -m json.tool --json-lines --sort-keys --compact
{"a":1,"b":2.34,"c":"ccc","d":true,"e":[1,2,3,4],"f":{"a":1,"b":2,"c":3}}
{"a":1,"b":2.34,"c":"ccc","d":true,"e":[1,2,3,4],"f":{"a":1,"b":2,"c":3}}
{"a":1,"b":2.34,"c":"ccc","d":true,"e":[1,2,3,4],"f":{"a":1,"b":2,"c":3}}

使用--json-lines,就可以正确解析按行存放的json数据。

在代码中使用json模块

由于json这种数据格式,与Python自带的dict对象及其相似,因此json模块提供的接口,主要就是将json数据从文件导入dict对象,或者反过来,将dict对象按照json风格写入文件;或者不是文件,而是内存中的json字符串与dict对象之间的相互转换。

load & dump(无s)

接口名称没有s,表示是操作文件。

下面的代码,演示json.loadjson.dump接口:

$ cat json.txt
{
    "c":"ccc",
    "a":1,
    "b":2.34,
    "d":true,
    "e":[1,2,3,4],
    "f":{"c":3,"a":1,"b":2}
}
$ python3 -q
>>> import json
>>> with open('json.txt') as f:
...   data = json.load(f)
...
>>> data
{'c': 'ccc', 'a': 1, 'b': 2.34, 'd': True, 'e': [1, 2, 3, 4], 'f': {'c': 3, 'a': 1, 'b': 2}}
>>> with open('json_copy.txt', 'w') as f:
...   json.dump(data,f)  # write another \n would be better
...
>>> exit()
$ cat json_copy.txt
{"c": "ccc", "a": 1, "b": 2.34, "d": true, "e": [1, 2, 3, 4], "f": {"c": 3, "a": 1, "b": 2}}

json.dump支持很多类似命令行工具的参数,比如sort_keys等。

loads & dumps(有s)

接口带s,表示是围绕字符串进行处理。

>>> import json
>>> d = {'a':1,'b':2}
>>> json.dumps(d)
'{"a": 1, "b": 2}'
>>> json.loads(json.dumps(d))
{'a': 1, 'b': 2}

YAML

YAML官网:https://yaml.org/

阅读YAML的spec,我理解了JSON流行的原因,以及为什么YAML是JSON的超集。

YAML™ (rhymes with “camel”, a recursive acronym for “YAML Ain’t Markup Language”) is a human-friendly, cross language, Unicode based data serialization language designed around the common native data types of dynamic programming languages. It is broadly useful for programming needs ranging from configuration files to internet messaging to object persistence to data auditing and visualization.

The design goals for YAML are, in decreasing priority:

3种基本数据结构

YAML represents any native data structure using three node kinds: sequence - an ordered series of entries; mapping - an unordered association of unique keys to values; and scalar - any datum with opaque structure presentable as a series of Unicode characters.

现在回过头去看看JSON的结构,是不是也可以完全用这3种基本数据结构来定义和解释!这3种数据结构具有很高层次的抽象,几乎所有的数据都可以用这3种结构以及它们的各种嵌套组合来定义描述,比如树形结构,就可以用sequence嵌套sequence来表达。这也是为什么说YAML是JSON的超集,以及JSON的应用范围超越JavaScript语言的原因,这种高层次的抽象本身就具备了很广泛的应用空间。(再想想看,为什么Python的dict对象,能够很好的与JSON兼容,Pythong刚开始的那段时间,还没有JSON。殊途同归而已...)

回忆一下数据结构这门课所学内容,所有数据结构都能够与这三种最基本的结构对应!

A YAML node represents a single native data structure. Such nodes have content of one of three kinds: scalar, sequence or mapping. In addition, each node has a tag which serves to restrict the set of possible values the content can have.

YAML中node的概念,它代表一个数据结构,它也可以包含其它node。

Scalar

The content of a scalar node is an opaque datum that can be presented as a series of zero or more Unicode characters.

Sequence

The content of a sequence node is an ordered series of zero or more nodes. In particular, a sequence may contain the same node more than once. It could even contain itself.

Mapping

The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique. YAML places no further restrictions on the nodes. In particular, keys may be arbitrary nodes, the same node may be used as the value of several key/value pairs and a mapping could even contain itself as a key or a value.

YAML数据格式定义

YAML数据首先要满足人类可读,它的格式定义也比JSON要丰富。YAML格式有一些与Python类似的地方,都是为了可读性,比如用indentation来定义block。跟很多脚本一样,#定义注释。-[]定义sequence,:定义mapping,注意这两个符号后面要有至少一个强制的空格,这也是为了可读性。

pyyaml官网:https://pyyaml.org/

>>> import yaml
>>> from pprint import pprint
>>> a = """
... - abc
... - def
... - hjk
... """
>>> b = yaml.full_load(a)
>>> b
['abc', 'def', 'hjk']
>>> a = """\
... a: 1
... # line comment
... b: 2
... c:   # can be empty
... """
>>> b = yaml.full_load(a)
>>> b
{'a': 1, 'b': 2, 'c': None}

也可以用[]来定义sequence:

>>> a = """
... [1,2,3,4,4]
... """
>>> b = yaml.full_load(a)
>>> b
[1, 2, 3, 4, 4]

下面开始来点复杂的组合,示例基本都来自yaml官网,本人用Python测试:

>>> # mapping scalar to list
>>> a = """
... american:
... - Boston Red Sox
... - Detroit Tigers
... - New York Yankees
... national:
... - New York Mets
... - Chicago Cubs
... - Atlanta Braves
... """
>>> pprint(yaml.full_load(a))
{'american': ['Boston Red Sox', 'Detroit Tigers', 'New York Yankees'],
 'national': ['New York Mets', 'Chicago Cubs', 'Atlanta Braves']}

上例用scalar mapping list,list部分用-定义,可以不用空格缩进。

>>> # list of mapping
>>> a = """
... -
...   name: Mark McGwire
...   hr:   65
...   avg:  0.278
... -
...   name: Sammy Sosa
...   hr:   63
...   avg:  0.288
... """
>>> pprint(yaml.full_load(a))
[{'avg': 0.278, 'hr': 65, 'name': 'Mark McGwire'},
 {'avg': 0.288, 'hr': 63, 'name': 'Sammy Sosa'}]

上面这种case时,-后面可以没有空格,但:后面必须要至少有一个空格。

>>> # compact nested mapping
>>> a = """
... - name: Mark McGwire
...   hr:   65
...   avg:  0.278
... - name: Sammy Sosa
...   hr:   63
...   avg:  0.288
... """
>>> pprint(yaml.full_load(a))
[{'avg': 0.278, 'hr': 65, 'name': 'Mark McGwire'},
 {'avg': 0.288, 'hr': 63, 'name': 'Sammy Sosa'}]
>>> # sequence of sequence, or list of list
>>> a = """
... - [name        , hr, avg  ]
... - [Mark McGwire, 65, 0.278]
... - [Sammy Sosa  , 63, 0.288]
... """
>>> pprint(yaml.full_load(a))
[['name', 'hr', 'avg'], ['Mark McGwire', 65, 0.278], ['Sammy Sosa', 63, 0.288]]
>>> # mapping of mapping
>>> a = """
... Mark McGwire: {hr: 65, avg: 0.278}
... Sammy Sosa: {
...     hr: 63,
...     avg: 0.288,
...  }
... """
>>> pprint(yaml.full_load(a))
{'Mark McGwire': {'avg': 0.278, 'hr': 65},
 'Sammy Sosa': {'avg': 0.288, 'hr': 63}}

YAML uses three dashes (“---”) to separate directives from document content. This also serves to signal the start of a document if no directives are present. Three dots ( “...”) indicate the end of a document without starting a new one, for use in communication channels---表示一个文件的开始,...表示文件结束并且后面没有新的文件。

>>> # two documents in stream
>>> a = """
... # Ranking of 1998 home runs
... ---
... - Mark McGwire
... - Sammy Sosa
... - Ken Griffey
... 
... # Team ranking
... ---
... - Chicago Cubs
... - St Louis Cardinals
... """
>>> b = yaml.full_load_all(a)
>>> for doc in b:
...   pprint(doc)
... 
['Mark McGwire', 'Sammy Sosa', 'Ken Griffey']
['Chicago Cubs', 'St Louis Cardinals']
>>> # Play by Play Feed from a Game
>>> a = """
... ---
... time: 20:03:20
... player: Sammy Sosa
... action: strike (miss)
... ...
... ---
... time: 20:03:47
... player: Sammy Sosa
... action: grand slam
... ...
... """
>>> b = yaml.full_load_all(a)
>>> for doc in b:
...   pprint(doc)
... 
{'action': 'strike (miss)', 'player': 'Sammy Sosa', 'time': 72200}
{'action': 'grand slam', 'player': 'Sammy Sosa', 'time': 72227}

...结束后,必须再用---新开文档。

>>> # Single Document with Two Comments
>>> a = """
... ---
... hr: # 1998 hr ranking
... - Mark McGwire
... - Sammy Sosa
... # 1998 rbi ranking
... rbi:
... - Sammy Sosa
... - Ken Griffey
... """
>>> pprint(yaml.full_load(a))
{'hr': ['Mark McGwire', 'Sammy Sosa'], 'rbi': ['Sammy Sosa', 'Ken Griffey']}

重复node的引用,用&定义,后面用*引用:

>>> # Node for “Sammy Sosa” appears twice in this document
>>> a = """
... ---
... hr:
... - Mark McGwire
... # Following node labeled SS
... - &SS Sammy Sosa
... rbi:
... - *SS # Subsequent occurrence
... - Ken Griffey
... """
>>> pprint(yaml.full_load(a))
{'hr': ['Mark McGwire', 'Sammy Sosa'], 'rbi': ['Sammy Sosa', 'Ken Griffey']}

?加一个空格,表示complex mapping:

? - Detroit Tigers
  - Chicago cubs
: - 2001-07-23

? [ New York Yankees,
    Atlanta Braves ]
: [ 2001-07-02, 2001-08-12,
    2001-08-14 ]

list map to list,还不知道如何用python解析这种格式?

|符号表示scalar node中的换行是有效的:

>>> # | means all line breaks are significant.
>>> a = """
... - |
...   a b c d
...   1 2 3 4
... - |
...   abcd
...   1234
... """
>>> pprint(yaml.full_load(a))
['a b c d\n1 2 3 4\n', 'abcd\n1234\n']
>>> b = yaml.full_load(a)
>>> for it in b:
...   print(it)
... 
a b c d
1 2 3 4

abcd
1234

>>> 

>表示scalar node中的换行,会被空格替换:

>>> # Folded newlines are preserved for “more indented” and blank lines
>>> a = """
... --- >
...  Sammy Sosa completed another
...  fine season with great stats.
... 
...    63 Home Runs
...    0.288 Batting Average
... 
...  What a year!\
... """
>>> print(yaml.full_load(a))
Sammy Sosa completed another fine season with great stats.

  63 Home Runs
  0.288 Batting Average

What a year!

Folded newlines are preserved for “more indented” and blank lines

默认的风格,与|>都不相同,那是没有引号的flow scalar:

>>> a = """
... -
...   a b c d
...   1 2 3 4
... - 
...   abcd
...   1234
... """
>>> print(yaml.full_load(a))
['a b c d 1 2 3 4', 'abcd 1234']  # the last \n is missing
>>> # Indentation determines scope
>>> a = """
... name: Mark McGwire
... accomplishment: >
...   Mark set a major league
...   home run record in 1998.
... stats: |
...   65 Home Runs
...   0.278 Batting Average
... """
>>> pprint(yaml.full_load(a))
{'accomplishment': 'Mark set a major league home run record in 1998.\n',
 'name': 'Mark McGwire',
 'stats': '65 Home Runs\n0.278 Batting Average\n'}

YAML定义的flow scalar,我理解就是字符串,有三种风格,double-quoted, single-quoted and plain (unquoted). Each provides a different trade-off between readability and expressive power.

The double-quoted style provides escape sequences. The single-quoted style is useful when escaping is not needed. All flow scalars can span multiple lines; line breaks are always folded. 双引号可以实现escape功能,单引号不能escape,换行总是用空格代替。

>>> # quoted scalars and multi-line flow scalar
>>> a = """
... unicode: "Sosa did fine.\u263A"
... hex esc: "\x0d\x0a is \r\n"
... 
... single: '"Howdy!" he cried.'
... quoted: ' # Not a ''comment''.'
... tie-fighter: '|\-*-/|'
... 
... plain:
...   This unquoted scalar
...   spans many lines.
... 
... quoted: "So does this
...   quoted scalar.\n"
... """
>>> pprint(yaml.full_load(a))
{'hex esc': ' is ',
 'plain': 'This unquoted scalar spans many lines.',
 'quoted': 'So does this quoted scalar. ',
 'single': '"Howdy!" he cried.',
 'tie-fighter': '|\\-*-/|',
 'unicode': 'Sosa did fine.☺'}
>>> # data type in python
>>> a = """
... decimal_neg: -123
... decimal_pos: +123
... decimal: 12345
... octal: 014
... hex: 0xFFFF
... float: 1.23456
... exponential: 1.234e+3
... neg inf: -.inf
... not a number: .nan
... kong: null
... kong_2: 
... booleans: [true, false]
... string: '12312abc'
... """
>>> pprint(yaml.full_load(a))
{'booleans': [True, False],
 'decimal': 12345,
 'decimal_neg': -123,
 'decimal_pos': 123,
 'exponential': 1234.0,
 'float': 1.23456,
 'hex': 65535,
 'kong': None,
 'kong_2': None,
 'neg inf': -inf,
 'not a number': nan,
 'octal': 12,
 'string': '12312abc'}

暂时就总结这么多吧,至少能基本看到.yml文件了!

TOML

TOML—Tom’s Obvious Minimal Language—is a reasonably new configuration file format that the Python community has embraced over the last couple of years. TOML plays an essential part in the Python ecosystem. Many of your favorite tools rely on TOML for configuration, and you’ll use pyproject.toml when you build and distribute your own packages.

TOML之父是Github的联合创始人,Tom Preston-Werner。TOML定位为配置文件的格式。Python社区非常喜欢TOML,已经开始大量使用,Python3.11标准库中将会新增一个解析TOML数据格式的package:tomllib。(在小于3.11版本的代码中,可以使用tomli,tomllib的代码就是来自tomli)

TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics. TOML is designed to map unambiguously to a hash table. TOML should be easy to parse into data structures in a wide variety of languages.

TOML被设计成可以无歧义地映射到一个hash表,现在主流的编程语言都已经广泛地支持了TOML格式的配置文件。

TOML官网:https://toml.io/en/

本文链接:https://cs.pynote.net/sf/202201141/

-- EOF --

-- MORE --