Python YAML: A Comprehensive Guide for Beginners

Python's extensive standard library includes modules that meet most of an average developer's coding needs. Not to mention, there are hundreds of external modules that make a developer's life easy.

However, Python still has one drawback. It does not support the YAML data format, known for its easy configuration and serialization features, despite its similarities to Python.

In this comprehensive guide, we will walk you through working with YAML in Python with third-party libraries, specifically PyYAML.

Understanding YAML's Syntax

As mentioned earlier, YAML shares similarities with Python, and when you look at YAML, you will realize that its block indentation is the same as that of Python.

Interestingly, several other languages and data formats heavily inspire YAML. It doesn't involve any special characters or tags and relies on the leading whitespace in each line to define the block's scope.

family_tree:

  father:
    son:
      name: James
    daughter:
      name: Emily

In this document, we have a family tree with "family_tree" as the root element.

The immediate child of the root element is the "father," who has two children: a son named James and a daughter named Emily. Each individual's name is defined using the "name" attribute at the lowest level in the tree.

It's interesting to note that YAML doesn't allow using tabs for indentation. If you accidentally put a tab space in your YAML, it will throw a syntax error.

This is in line with PEP-8, a 2001 document for Python aiming to improve the readability and consistency of the code.

If you prefer, you can use the inline-block syntax that YAML takes from JSON. Here's how you could write the same YAML as above in this syntax:

family_tree:

  father:
    son: {name: James}
    daughter: {'name': "Emily"}

As you can see, YAML doesn't have a problem with you mixing the inline and indented blocks in the same document.

You might have also noticed that an attribute and its value are enclosed in single and double quotes. Doing this is not necessary, but choosing to do it enables an interpolation method of YAML's special character sequences.

You can escape these sequences with a backslash (\), similar to how it's done in Python.

The only time you must use quotes in YAML is when declaring a string that could be interpreted as a data type. The best example of this is True – it will be treated as a Python Boolean unless it is enclosed in quotes.

Data Structures in YAML

Like most computer languages, YAML has data structures. But there are only three of them, and they are inspired by Perl, which used to be a popular scripting language. The structures are scalars, arrays, and hashes.

Scalars are numeral, string, or Boolean values. In YAML, arrays are sequences of scalars. Finally, hashes are associative arrays comprised of key-value pairs, which is why they're sometimes called maps or dictionaries.

Defining a scalar in YAML is similar to defining a literal in Python. Here's how you can define different data types in YAML:

Data Type	YAML
Null	null, ~
Boolean	true, false (Before YAML 1.2: yes, no, on, off)
Integer	0x10, 0b10, 10, 0o10 (Before YAML 1.2: 010)
Floating-Point	12.5e-9, .nan, .inf, 5.21
String	This is a string
Date and Time	2023-11-01, 14:59; 2023-11-01, 14:59:59

YAML allows you to write its reserved words in uppercase, lowercase, and title case. It can parse it into the relevant data type regardless of the casing of the words. But bear in mind that YAML will treat any other case variant of the words as plain text.

This means writing null, NULL, and Null will mean the same thing. However, if letters are capitalized randomly (like in nUll), YAML will treat it as text. It's worth mentioning that leaving a value blank has the same effect as writing null or mentioning its alternative, the ~ symbol.

While it's great that YAML offers implicit typing, its developers realized how this can cause serious issues in edge cases. This is why built-in literals like yes and no were removed from YAML in version 1.2.

Now that you're familiar with scalars, it's time to learn about sequences. YAML works with sequences just like they're JSON arrays or Python lists. To define a sequence, you must enclose its values in square brackets and separate them by commas.

Interestingly, you can also define sequences in the inline-block mode. It involves adding a dash at the beginning of every item. Here's what all this looks like:

desserts: [cake, ice cream, pie]

snacks:

  - popcorn

  - pretzels

  - chips

When working with sequences in YAML, don't hesitate to add indentation if it makes it easier for you to read. The extra whitespace won't cause any parsing problems.

The third type of data structure in YAML is the hash. Hashes are like objects in JavaScript or dicts in Python. As you'd expect, they comprise keys, which are also called property names or attributes.

The keys are followed by a colon (:) and then the corresponding values. You've seen an instance of YAML hashes in this post already, but here's one that's more fleshed out:

employee:

name: Alice

jobTitle: Manager

department: Sales

salary: 60000

contact:

email: [email protected]

phone: +1-123-456-7890

projects:

- name: Project A

status: In Progress

- name: Project B

status: Completed

In this code, all the important details of an employee have been defined. This employee is named Alice and holds the position of a Manager in the Sales department. Her annual salary is set at $60,000.

Additionally, her contact information is specified, which includes an email address ([email protected]) and a phone number (+1-123-456-7890).

Furthermore, you can see that this employee is associated with a list of projects they are involved in. There are two projects listed.

The first project, named "Project A," is currently marked as "In Progress," while the second project, named "Project B," is marked as "Completed."

Did you notice how some attribute names have whitespace characters? YAML is flexible and allows you to do this. The whitespace characters can even span multiple lines!

It's more impressive that YAML doesn't restrict you to using strings only. You can create a hash that holds any data type as a key, which is something you cannot do with JSON.

So far, we've only covered the basics of YAML. But it has a lot more features packed into it. We'll explore some of them in the next section.

YAML's Advanced Features

YAML offers several advanced features, including tags, merged attributes, anchors, aliases, multi-doc streams, and flow and block styles. You'll learn how to use some of these in this section.

Type Systems in YAML

One of the features that makes YAML stand out is how it integrates with the type systems of modern programming languages. YAML's alternatives, XML and JSON, cannot work with data types at the level YAML does.

For example, YAML can serialize data types that are built into Python, including date and time:

YAML	Python
2023-11-01 08:30:45	datetime.datetime(2023, 11, 1, 8, 30, 45)
2023-11-01	datetime.date(2023, 11, 1)
08:30:45	30645
30:45	1845

As you can see, YAML is familiar with date and time formats. It can also work with different time zones.

The above table showcases how the dates and timestamps are deserialized for use by Python. For instance, the 08:30:45 time value gets deserialized into the number of seconds that have passed since midnight.

Using YAML seems like a convenient way to convert timestamps into the number of seconds. However, this guide to using YAML with Python using PyYAML is based on YAML 1.1, which is an older version.

Parsing such literals in YAML 1.1 can result in errors and unexpected outcomes. This is because YAML 1.1 parses a number with a leading zero as a string rather than a datetime object.

If you want to rule out the ambiguity for YAML, you will need to cast values to specific data types. You can do this with YAML tags, which start with two exclamation points.

Though you can alternatively use language-independent tags, your parser may or may not support them. It's better to write !! followed by the data type you want to store, like so:

employee:

  name: Alice
  jobTitle: Manager
  department: Sales
  salary: 60000
  contact:
    email: [email protected]
    phone: +1-123-456-7890
  projects:
    - name: Project A
      status: In Progress
    - name: Project B
      status: Completed

As you can see, the !!str tag makes YAML treat the values as regular strings. The question marks in the middle represent a mapping key.

Adding question marks in code isn't necessary, but it helps define a compound key from a different collection of data or a key using reserved words.

What's interesting is that you can add images and other resources using the !!binary tag. The tag embeds Base64-encoded binary files and turns them into instances of bytes for Python to process.

The YAML example above translates to the Python dictionary below:

{
    'vehicle': 'Car',
    'make': 'Toyota',
    'model': 'Camry',
    'year': 2023,
    'color': 'Blue',
    'coordinates': {
        'latitude': 37.7749,
        'longitude': -122.4194
    },
    'features': {'Air Conditioning', 'GPS Navigation', 'Bluetooth Connectivity'},
    'owner': {
        'name': 'Sarah Smith',
        'age': 35,
        'city': 'New York'
    },

    'rating': 4.7,
    'is_available': True
}

Can you see how the parser has turned the attribute into specific data types?

Anchors and Aliases

This interesting feature allows you to define an element once so you can refer to it several times in the document. This feature can come in handy in many situations, for instance, when reusing one address to create an invoice.

You can do this by declaring an anchor with the & symbol. You can dereference it using the * symbols later in the program. Let's see anchors in action:

fruits:
  - &apple
    name: Apple
    color: Red
  - &banana
    name: Banana
    color: Yellow

fruit_basket:
  - *apple
  - *banana
  - &orange
    name: Orange
    color: Orange

selected_fruit: *banana

In this example, we define a list of fruits under the "fruits" key. Two fruits, Apple and Banana, are given as items in the list. We use anchor labels (&apple and &banana) to mark these items, allowing us to reuse their properties later.

Next, we create a "fruit_basket" list. We use the * notation to refer to the previously defined fruits. The first two items in the "fruit_basket" are aliases for the Apple and Banana fruits, while the third item, an Orange, is defined along with its properties.

Finally, we create "selected_fruit" and set it to the alias *banana. This means that "selected_fruit" references the properties of the Banana fruit.

YAML also supplies the flexibility of inheriting and overriding attributes with the merge (<<) feature. Here's how this works:

base_properties: &base
  shape: Circle
  color: Red

custom_properties: &custom
  size: Large

merged_object:
  << : *base
  << : *custom
  color: Blue
  weight: 2 kg

Here, we have two sets of properties defined as anchors, namely base_properties and custom_properties.

The merged_object is created by merging the attributes from both base_properties and custom_properties. This results in the shape and color properties being inherited from base_properties and the size property from custom_properties.

In the merged_object, we override the color property by setting it to "Blue" and introduce a new property, weight, with a value of "2 kg."

If you're interested in folding lines but want to maintain the indentation determined by the paragraph's first line, you can use the > indicator like this:

folded_text: >
  This is an example of folding lines with
  the greater than sign (>) indicator in YAML.
  It preserves the first-line indentation.
  It allows you to represent multi-line text
  while keeping the initial indentation determined
  by the first line of the paragraph.

This is a convenient way to represent multi-line text with consistent formatting in YAML. Its output in Python is:

{'folded_text': 'This is an example of folding lines with the greater than sign (>) indicator in YAML. It preserves the first-line indentation.\nIt allows you to represent multi-line text while keeping the initial indentation determined by the first line in the paragraph.\n'}

Flow and Block Styles

The two styles of scalars in YAML provide you with different levels of control over multi-line strings and newline handling.

A flow scalar typically starts on the same line as its attributes and can span several lines, like so:

details: This is an example of flow style
  that spans multiple lines.
  It provides a concise way to represent
  complex data structures in YAML.

The whitespace in the beginning and end of these scalars are always folded into a single space. This turns paragraphs into lines, similar to how HTML works.

On the other hand, block scalars can alter how the indentation, newlines, and trailing newlines work. For instance, putting a "|" after the name of an attribute will preserve the new line. This comes in handy when embedding shell scripts:

code_block: |
  def calculate_sum(a, b):
      """
      This function calculates the sum of two numbers.
      """
      return a + b
  if __name__ == "__main__":
      result = calculate_sum(5, 7)
      print("The result is:", result)

This YAML document defines a property called "code_block." The document holds a Python script, but without the pipe indicator, the YAML parser would have treated the script as nested elements.

It's worth noting that you can store several YAML documents in one file by separating the files with a triple dash (---). Alternatively, you can separate the files with three dots (…).

Setting Up PyYAML On Your Machine

As mentioned earlier, YAML doesn't work with Python straight out of the box. As Python doesn't officially support the format, you must take some extra steps to get YAML to work with Python.

Specifically, you need to set up a third-party library that can deserialize YAML into Python objects and serialize the objects into YAML.

Additionally, installing the following command-line tools using pip can help you debug your code:

shyaml: A command-line YAML processor.
yq: Another YAML processor that works on CLI. Based on jq, it is suitable for filtering data.
yamllint: A linter for YAML. It checks for syntactical issues and more.

Instead of these Python-only tools, you could choose to install yq, a popular Go implementation, to work with YAML and Python. Bear in mind that this tool has a slightly different command-line interface.

If you're not interested in downloading anything on your machine, feel free to use one of the popular online tools like YAML Parser. There are several similar tools available online, and most are free to use!

Serializing YAML as JSON

Every Python programmer reads about how flexible the language is, but the proof is in the pudding. Though Python doesn't support YAML directly, you can get it working using the json module that is built into Python.

If you're unfamiliar with JSON, it is actually a subset of YAML. This means you can dump your Python data in the JSON format, and YAML parsers will accept it as an input.

Let's see this in action by creating a simple Python script that prints out a JSON dictionary:

# print_data.py
import json
data = {
    "name": "Alice",
    "age": 30,
    "city": "New York",
    "interests": ["Reading", "Traveling"],
    "is_student": True,
}

print(json.dumps(data, indent=4))

At the end of the code, we're calling json.dumps() to dump a string.

Now, run the script and feed its output to a command-line YAML parser (e.g., yq or shyaml) through a Unix pipeline (|):

$ python print_data.py | yq -y .
name: Alice
age: 30
city: New York
interests:
  - Reading
  - Traveling
is_student: true

$ python print_data.py | shyaml get-value

name: Alice
age: 30
city: New York
interests:
  - Reading
  - Traveling
is_student: true

As you can see, both the parsers accepted the data and formatted it in YAML – without any hiccups.

But notice how we've requested that the yq parser do the transcoding with the -y option since yq is only a thin wrapper built on top of JSON's jq. This also means you will need to install jq on your machine before you can install yq.

Of course, we've also put a trailing dot after the -y flag as a filtering expression. You may also notice a small difference between yq and shyaml in the resulting indentation.

Aren't these parsers great? It would feel like cheating to use these and make working with YAML so easy, but the problem with these parsers is they only work one way.

You cannot use them to read a YAML document back to Python.

The good news? There is another way to do it.

Installing the PyYAML Library

As far as YAML libraries go, the PyYAML library has remained the go-to choice for Python programmers ever since its release. To this day, it's one of the top packages downloaded on PyPI.

Its interface shares similarities to the built-in JSON module and is recommended on the official YAML website, fortifying its reputation. As you'd expect, it's actively maintained, and installing it on your virtual environment is as simple as running:

(venv) $ python -m pip install pyyaml

One of the best things about this library is that it doesn't require dependencies since it was written in Python.

However, it's commonplace for distribution bundles to come with a compiled C binding for the LibYAML library. The binding makes PyYAML a lot faster and a must-have for any serious programmer.

You can double-check that your installation comes with a C binding by launching Python's interpreter and running the following:

>>> import yaml
>>> yaml.__with_libyaml__

If it returns "true," your installation of PyYAML is sure to run as fast as possible – with one extra step. You will need to instruct the library to take advantage of the shared C library by running the following code:

>>> try:
...     from yaml import CSafeLoader as SafeLoader
... except ImportError:
...     from yaml import SafeLoader
>>> SafeLoader
<class 'yaml.cyaml.CSafeLoader'>

Without explicit instruction, your PyYAML installation will run on pure Python. As you can see, we import one of the loader classes with the prefix "C," which denotes using the C library.

If this fails, you must import a corresponding class that has been implemented in Python. While it will work, it'll make the code look verbose. Plus, you won't be able to use the shortcut functions you can otherwise use in Python.

Conclusion

Though PyYAML is popular, it comes with some unique drawbacks. The main drawback is that it's not optimized for using the features introduced in YAML 1.2.

If you want to use features such as safe literals and JSON compliance, it's better for you to use the ruamel.yaml library, which is a fork of an older version of PyYAML. The library can also pull off round-trip parsing, allowing you to preserve the original formatting of the YAML document and also the comments.

In contrast, if you're worried about type safety or need to validate your YAML documents against schemas, the StrictYAML library is right for you. It disables the high-risk features of YAML to give you a smooth experience. However, it works slower than other libraries.

Now that you've understood the basics of YAML and how to make it work with Python, you're ready to start working with Python libraries like PyYAML that support it.