Saturday, December 7, 2019

5 Reasons To Use YAML Files In Your Machine Learning Projects

During my Master in Data Science, no one had told me about the necessity of config files and how much easier my life could have been with them. To be honest, no one had told me many very important aspects that mattered as a Data Scientist and I severely lacked knowledge about best practices. However, I already addressed most of the issues in a previous article and I will not go over it again.
In short, when I graduated, my projects were sort of quick and dirty. They worked, had (sometimes) great results, but flexibility and reusability were obviously not a priority during the development of my solutions. The longer a project went on, the more I was yearning for a way to run different parts of my project easily, to reuse code or simply customize my environment without feeling sorry for myself by thinking at the amount of code I would have to duplicate or rewrite.

What is YAML?

The acronym YAML used to stand for Yet Another Markup Language. It now stands for YAML Aint Markup Language in order to better represent the data-oriented purpose that YAML currently has.
YAML is a human friendly data serialization standard that, among other characterics, follows Python indentation and is a JSON superset. YAML is not the only format that can be used for configuration files; configuration files can be made in .py files, in JSON or INI. I chose YAML for the simplicity of it, but I will come back to it later as one (or two) of my reasons in this article are based on its simplicity (spoiler alert!).

If you do not know what a configuration file is:

Config files are used to store configuration parameters, meaning initial parameters & settings for an application. The word application here is used with its larger definition. If your code does something with data, it fits in the application definition. If you want an easy way to understand one of the purposes of a configuration file, think about everything you might hard code in a project such as (but not exclusively!):
  • Input or output file locations
  • Parameter input for a machine learning model
  • Parameters for your pre-processing (e.g. size training vs test set)
  • Location of your log files

1. Even in small projects, scrawling through code to change parameters is never fun.

Without a doubt, I can commonly use a YAML file from the very beginning of a project, but I often skip that step. I always realize later that it was a mistake not to start with a config.yaml earlier. Knowing best practices does not always mean that I willingly apply them automatically, unfortunately.
Anything that can be hard coded in my script, can take its rightful place in a config file. While you might often start with just one data source, most projects are likely to get more complex as time passes by and starting with a config file from the start is a lot less painful than setting it up later.
Config files are not only useful to avoid hard coded paths in your scripts. Whenever I see myself changing a part of my code again and again to see the influence on the output or commenting out code to avoid one step to run each time, I (should) realize that this part of the code or this parameter represents a perfect candidate to be set in my config file. I will expand on these functionalities later in the article.

2. Writing your first YAML file is a piece of cake

The best part of YAML files is the simple syntax and the easiness in creating one. It is my main reason for choosing YAML files instead of any other type of configuration files in any of my (Machine Learning or not) project. Here is what would be in our config.yaml file if we wanted to give it an input and output location for our files:
files:
   location: 'C/Users/Yourname/Mydocuments/theinputlocation.txt'
   output: 'C/Users/Yourname/Mydocuments/theoutputlocation.txt'
Easy, right? But here are a few characteristics that make it even easier!
  • I can also write a string without quotes (as long as it does not include any special character that could be seen as YAML syntax )! And if I really want or need to use quotes, both single quotes and double quotes can do the job.
files:
   location: C/Users/Yourname/Mydocuments/theinputlocation.txt
   output: C/Users/Yourname/Mydocuments/theoutputlocation.txt
  • YAML uses comments the same way other languages such as Python do using a # which comments out the rest of the line.
files:
#put the locations of your files here
   location: 'C/Users/Yourname/Mydocuments/theinputlocation.txt'
   output: 'C/Users/Yourname/Mydocuments/theoutputlocation.ext'
  • No need for specific statements, you start directly with the content of your YAML file.
I cheated a little by adding subreasons to this second reason, but if we can put lists in lists, I can definitely put reasons in reasons.

3. Reading a YAML file in Python is incredibly easy as well

So if writing the .yaml file is so easy, what about reading the yaml file and telling Python how to deal with its content?
pip install pyyaml
Only one package is needed to read YAML files in Python.
The following code allows you to properly and safely load the .yaml file in Python, returning a clear error message if you messed up your configuration file.
import yaml
try: 
    with open (path_to_yaml, 'r') as file:
    config = yaml.safe_load(file)
except Exception as e:
    print('Error reading the config file')
Taking the same example we used above for my input location and my output location, here is how I would access my settings:
location_input = config['files']['location']
location_output = config['files']['output']
  • What if you do not want to write the file each time?
What should be in your YAML file:
write_file: True #turn to False to prevent from running
What should be in your python code (assuming your data is in a pandas dataframe):
if config['write_file']:
    mydf.to_csv(location_output)

4. People (developers or not) will love you when you share your application with them

What happens if you need to share your program and people have a different location for their file? As much as most data scientists & analysts love to read through Python code, they will really appreciate you for not forcing them to scroll through each script to find out where to change the input/output files path.
What if an entire data science project needs to move to a server? If you only need to import one file from one single location in your project (which rarely happens to me), it might not be such an issue. What if you need to import several files from several locations (and the files are likely to change names sometimes)? Anyone else running the code or pulling it from your git repository would have to go through the code and find each single path and replace it with their own.
Moreover, having people run through your code to change parameters is not always a great idea. Restricting the access only to a YAML file can also help limiting the damage that can be done to your code while keeping your project flexible.

5. It can help you focus on your Machine Learning code (or any other part of your project)

The current project I am in requires a bunch load of code, a ton of file loading and very extensive processing. It would be unthinkable for us to develop it while running the entire project every single time to test the code. Using a YAML file (or another configuration file) allows us to set up our project in a modular way.
When we think back to Machine Learning, most projects require expensive data cleaning. Once your data is cleaned, you might not need to rerun each step. Of course, Jupyter Notebook allows you to run different parts of my code separately as well. But in all honesty, I crashed a kernel more than once with stupid infinite loops, or heavy models or for very obscure reasons that I will never know. Everyone crashes a kernel and gets freaking upset about the entire experience and about having to rerun the entire code.
For instance, you could save your cleaned dataframe in a .csv file at the end of the preprocessing steps, allowing you to run the steps only when necessary.
steps:
   preprocessing: False
   machinelearningstep: Truetemp_location: mytempfile.csv
And an idea of how it should look like in Python:
if config['preprocessing']:
    #do extensive cleaning on my dataframe
else:
    #load the temp file that is being saved each time I run 
    #the preprocessing step in the location indicated in the config
    #file

YAML files come with advantages and limitations, just like any language or format for a configuration file. I realize that many of the reasons I gave could be valid as well for another type of configuration file (my second favorite being a .py file). Whether my colleagues or I choose for a YAML file or .py file does not matter, but structuring a project in an easy, readable, friendly way matters a lot for the productivity and happiness of a team in the long run. If we can avoid frustrations in the long run, why wouldn’t we?

Towards Data Science

Sharing concepts, ideas, and codes.

No comments:

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...