Tuesday, December 9, 2014

Building strings in Python

The Saturday morning Python Christmas tree was fun and brought up a few interesting questions. A friend, Chris, made the comment that building strings as a list and then joining is more efficient than concatenating as you go along (like I did). I'm sitting in a room waiting for new tires to be placed on my truck, so this is a great little problem to take a look at.

Here's the code in question:

We build each line of the tree by concatenating strings together. This is a common thing to do and worthy of closer examination.

Here's an example of what I did in the Christmas tree code:
string = "" # start with a blank string
for _ in xrange(i): # make a loop
    string += " " # for each iteration of the loop, redefine the string

Here's what I think Chris recommended:
appendlist = [] # start with an empty list
for _ in xrange(i): # make a loop
  appendlist.append(" ") # append to the list
string = string.join(appendlist) # join the list into a string

As a first step, I wrote each of these and timed them using the Linux time command. By just picking random sizes of strings and timing the results, it seemed like my version was faster.

I then used Python's cProfile module to take a closer look, but wasn't able to discover anything useful with it. Plotting the two methods over a range of string lengths seemed like a logical next step and ended up making things much more clear. With a timing decorator that wrote results to csv files and another loop to iterate over a range of string lengths, I was able to create the data to make this chart:

It appears that "append as you go" does outperform "build a list" up until a point. The erratic behavior of appending to larger strings probably has to do with how memory is allocate for strings vs lists. Lists are made to grow smoothly, while strings are more difficult to allocate memory for because growing them in this manner isn't something that happens often. Even though strings can be abstracted as a list of characters, they aren't the same thing according to this. All that said, if your concatenating a string that's under 6 million characters (even on my little XPS12 laptop), it seems there's no real benefit to either method.

The work on my truck is complete, so I have to end this post. Maybe I'll pick this back up some other time and try it on different computers. If you end up looking at this, share your code or thoughts on the github repository.

Here's the code I used to generate the data for the chart which I made with OpenOffice Calc:

(thanks reddit user dangerbird2)
The better performance of += operations can be explained by optimizations in the CPython compiler. This might not hold true for other compilers (but probably does). However, the Python Style guidelines recommend against relying on compiler specific optimizations. So it's still good form to use list based string concatenation.

There's also a great reference to an article by Spolsky with a joke about a painter that illustrates the problem we'd expect to see by concatenating strings poorly.

Here's the article:

Here's the joke:
This code uses the Shlemiel the painter's algorithm. Who is Shlemiel? He's the guy in this joke:
Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. "That's pretty good!" says his boss, "you're a fast worker!" and pays him a kopeck.
The next day Shlemiel only gets 150 yards done. "Well, that's not nearly as good as yesterday, but you're still a fast worker. 150 yards is respectable," and pays him a kopeck.
The next day Shlemiel paints 30 yards of the road. "Only 30!" shouts his boss. "That's unacceptable! On the first day you did ten times that much work! What's going on?"
"I can't help it," says Shlemiel. "Every day I get farther and farther away from the paint can!"
If you ever hear the term "Schlemiel the Painter algorithms" this is what is being referenced.