Python in Memory
This following is a simple concept that I was struggling with last week but managed to figure out with the help of some pair programming.
Here’s the setup. I have a few tables in RethinkDB that were ~100 MB each. RethinkDB is a NoSQL database and it stores data in JSON. The bits I specifically needed were nested inside an array. So something like this
To get at the bits I care about, I’d have to iterate through each top level object, get the other
key, then iterate through each element of the array in other
. My initial thought was to do this procedurally.
Read the data without modifications to a variable. Then iterate through each element of the array in the key other
. Then get the bits I need.
Doing this on one table was fine. But I needed to compare the bits_I_need between multiple tables. It consistently was bombing out whenever I tried to do this due to memory constraints. Even when I tried to use iPython’s magic commands to clear memory each time, it kept bombing out.
The solution was, of course, to get the bits_I_need as I’m reading the data into memory. Instead of doing something like this:
which is way too memory intensive and has much more than I need, I could do something like this:
Which runs like a breeze. Even better, since I was comparing between tables, I could wrap this in a function to get exactly what I want:
Then I can just iterate through all the tables I need to compare, call this function with the appropriate tables, and be on my merry way.
Of course, this could be optimized further, but it can also be extended. If I needed to only add to my list after satisfying a certain condition, I could do so. I should also probably use thing.get('other', [])
instead of thing['other']
because it allows the program to continue even if that key doesn’t exist.
This was a fun little challenge that was solved with the help of a second set of eyes and some insight into what I was actually trying to get accomplish. Instead of getting all the data in case I needed, we took a look at the goal I was trying to achieve and coded it in that manner.