Shuffling an out of box large data file in python

sourajit roy chowdhury
2 min readDec 19, 2021

Shuffling data is one of the important steps in data preparation stage for any machine learning model. Once we load the through any library to any data structure (e.g. pandas.DataFrame or python List) we can easily shuffle with it. But at it’s core we are using python’s random module typically.

Now, it’s quite simple to do when the data is small and fits into the memory. The challenge is when the data is quite large and doesn’t fit in to the single box or into the RAM.

For example, we have a text file which contains 1 billion lines of text which is around 20 GB of size. Now, we have a laptop with RAM of 16 GB. So, it is obvious that the above code snippet won’t work. We will get some sort of OutOfMemory error or the system will crash eventually.

In this article we will try to shuffle large files that doesn’t fit into the memory.

Execution Steps

  1. Set a buffer_size so that, we can chunk the large file into several chunks which can be fit in to the memory.
  2. We can sequentially divide the large file into small ones using buffer_size and shuffle them in memory.
  3. Last step is to write that shuffled chunked file into the output file sequentially.
  4. Repeat step-2 and step-3 until the large file is chunked completely.

You can find the complete and fancy code in the below link.

--

--