Shuffling an out of box large data file in python
Shuffling data is one of the important steps in data preparation stage for any machine learning model. Once we load the through any library to any data structure (e.g. pandas.DataFrame or python List) we can easily shuffle with it. But at it’s core we are using python’s random
module typically.
Now, it’s quite simple to do when the data is small and fits into the memory. The challenge is when the data is quite large and doesn’t fit in to the single box or into the RAM.
For example, we have a text file which contains 1 billion lines of text which is around 20 GB of size. Now, we have a laptop with RAM of 16 GB. So, it is obvious that the above code snippet won’t work. We will get some sort of OutOfMemory
error or the system will crash eventually.
In this article we will try to shuffle large files that doesn’t fit into the memory.
Execution Steps
- Set a
buffer_size
so that, we can chunk the large file into several chunks which can be fit in to the memory. - We can sequentially divide the large file into small ones using
buffer_size
and shuffle them in memory. - Last step is to write that
shuffled chunked file
into the output file sequentially. - Repeat
step-2
andstep-3
until the large file is chunked completely.
You can find the complete and fancy code in the below link.