Archive for the ‘Python’ Category

it’s a geeky meme

Sunday, April 13th, 2008

lars@bozeman:~$ history|awk ‘{a[$2]++} END{for(i in a){printf “%5d\t%s\n”,a[i],i}}’|sort -rn|head
133 cd
114 ls
44 svn
31 vi
28 python
24 ssh
21 ./ConfigurationManager.py
17 make
13 rsync

It looks to me like I spend too much time moving around the file system. I should try to type more pathnames and stick around in one place…

Python Generators: Searching Java Jar Files

Sunday, March 2nd, 2008

Here is an example of a utility that uses a recursive generator. It is a command line utility that assists Java programmers in finding missing classes. I wrote this script several pears ago when I was dragged kicking and screaming into a Java project.  The script recursively searches a directory tree for jar files. When it finds a jar file, it scans the file’s directory for the target Java class.

#!/usr/bin/python

import sys, os, os.path
import fnmatch

def findFileGenerator(rootDirectory, acceptanceFunction):
  for aCurrentDirectoryItem in [ os.path.join(rootDirectory, x) for x in os.listdir(rootDirectory) ]:
    if acceptanceFunction(aCurrentDirectoryItem):
      yield aCurrentDirectoryItem
    if os.path.isdir(aCurrentDirectoryItem):
      for aSubdirectoryItem in findFileGenerator(aCurrentDirectoryItem, acceptanceFunction):
        yield aSubdirectoryItem

if __name__ == "__main__":
  rootOfSearch = '.'
  if sys.argv[1:]:
    rootOfSearch = sys.argv[1]
  if sys.argv[2:]:
    classnameFragment = sys.argv[2].replace('.', '/')
    def anAcceptanceFunction (itemToTest):
      return not os.path.isdir(itemToTest) and fnmatch.fnmatch(itemToTest, '*.jar') and
             classnameFragment in os.popen('jar -tf %s' % itemToTest).read()
  else:
    def anAcceptanceFunction (itemToTest):
      return not os.path.isdir(itemToTest) and fnmatch.fnmatch(itemToTest, '*.jar')

  try:
    for x in findFileGenerator(rootOfSearch, anAcceptanceFunction):
      print x
  except Exception, anException:
    print anException

The focus is on the generator function findFileGenerator. It creates an iterator for the results of a recursive search through a directory tree. It accepts as parameters a path to begin the search and a function to determine if a given file satisfies the search parameters.

Generators can be kind of confusing because even though they look like a function, they do not execute immediately when called. They return a reference to an object that works like an iterator. The code defined in the generator function is executed by that iterator object. The first time that the iterator’s ‘next’ function is called, execution begins at the beginning of the code and goes until it encounters a ‘yield’ statement. The ‘yield’ statement returns the next value of the iterator. The next time the ‘next’ function is called, execution resumes at the next statement after the ‘yield’.

Let’s examine this example closely. Imagine that the first call to the iterator has happened and we’ve got the resultant iterator-like object. The first call on that object to ‘next’ starts execution at this line:

for aCurrentDirectoryItem in [ os.path.join(rootDirectory, x) for x in os.listdir(rootDirectory) ]:

Here we’re getting our first directory listing of all the files in the current directory. Because the call to ‘os.listdir(rootDirectory)’ returns a list of file names with their paths stripped off, we’re going to have to re-attach them. The list comprehension (the code between the [ … ]) welds the current directory path to each of the files in the list and returns a new list. The for loop then sets us up to iterate through that list.

   if acceptanceFunction(aCurrentDirectoryItem):
      yield aCurrentDirectoryItem

Here’s where we decide if the current entry in this directory is interesting or not. We call the acceptance function on the item. Since the acceptance function is passed in when we originally called this generator, it could be anything the programmer desired. In the case of this particular utility, we’re looking for Java Jar files that meet a certain criteria. But it really could have been anything at all: find all files that have vowels in their name, or all files that have a specific type or content.
If the acceptance function returns ‘True’, then we yield. The current file is returned by the iterator and execution stops until the ‘next’ function is called.

  if os.path.isdir(aCurrentDirectoryItem):

If the acceptance function rejected the item, this is immediately the next line to execute. If the acceptance function accepted the item, this line won’t be called until after the next call to ‘next’. In either case, our goal is to find the next item for the iterator to return.

Since we’re iterating through a list of entries in a directory, some of those will be directories themselves. The item that we sent to the acceptance function could have been a subdirectory. Regardless of the outcome of the acceptance function, we need to recurse into subdirectories.

   for aSubdirectoryItem in findFileGenerator(aCurrentDirectoryItem, acceptanceFunction):
        yield aSubdirectoryItem

Hang onto your hat, here’s where your brain may explode. We’ve got a sub-directory and we need to recurse into it and iterate through its entries. Well, we’ve got this handy generator that does exact that: it returns an iterator that will cycle through the contents of directory. ‘for’ statements in Python have a special relationship with iterators. You can provide one instead of a list and the ‘for’ loop will dutifully iterate through them for you. We recursively call the generator, passing in the subdirectory and the acceptance function. The generator returns an iterator to us and the for statement starts the iteration by silently calling the next function. Remember that the iterator returns only items that have passed the acceptance function, so each item that we get here we’re just going to pass on as the next item in our iterator. Hence, we yield every item that we get in this loop.

The rest of the file is in the problem domain: a command line utility that will find Java Jar files with certain classes in them.

if __name__ == "__main__":

Perhaps someday in the future, we’ll want to use the generator in another application. By putting the code of the command line utility under this ‘if’, we’ll prevent it from executing when we use the ‘import’ statement on this file.

  rootOfSearch = '.'
  if sys.argv[1:]:
    rootOfSearch = sys.argv[1]

The root of the path that we’re to search is option on the command line. If no path is specified, we’ll assume that we’re to start in the current working directory.

  if sys.argv[2:]:
    classnameFragment = sys.argv[2].replace('.', '/')
    def anAcceptanceFunction (itemToTest):
      return not os.path.isdir(itemToTest) and fnmatch.fnmatch(itemToTest, '*.jar') and
             classnameFragment in os.popen('jar -tf %s' % itemToTest).read()

The name of the class that we’re to search for is also optional. If the user does not provide one, then we’ll assume that we’re to just find all jar files regardless of their content.

This code fragment is the other case: a fragment of a class name has been given. It is our task here to create an acceptance function that meets the criterion.

First thing to do is cook the class name a bit. In Java, class names are qualifies with paths. Inside Java code, ‘.’ is used as a separator. However, inside jar files, ‘/’ is the separator. To be friendly, we want Java programmers to be able to use either notation. We make sure the command line argument is converted to the ‘/’ notation and stored in ‘classnameFragment’. Next we define an acceptance function that receives a pathname as a parameter. All we have to do is subject that pathname to some tests and give it either a thumbs up or down. In this case, we test to see if the pathname represents a directory, then test to see if it is a jar file and finally we run the command line function ‘jar \-tf’ to give us a listing of the jar to see if our class name fragment is in there. Since Python can do “short-circuit” expression evaluation, if any of the earlier tests fail in the boolean expression, the other tests do not get executed.

  else:
    def anAcceptanceFunction (itemToTest):
      return not os.path.isdir(itemToTest) and fnmatch.fnmatch(itemToTest, '*.jar')

In the case where the user did not provide a class name fragment, we assume that we’re looking for all jar files. The acceptance function here just drops the additional criterion where we looking into the content of the jar file.

  try:
    for x in findFileGenerator(rootOfSearch, anAcceptanceFunction):
      print x
  except Exception, anException:
    print anException

Finally, we ‘re ready to actually use the tools. We call the generator function with the path from which to start the search and our acceptance function. That returns an iterator that we loop through and print the matching jar files.

a Pythonic Ospid

Monday, February 11th, 2008

I’m suffering an ospid, I wrote some code last weekend that I keep looking at over and over again because I like it so much.

I’ve got relational database schema that looks like this:

For this blog posting, I am only interested in the first six tables of the top cascade of tables and the ‘updateParamters’ table just below them.

I’m trying to populate this schema with its initial data by walking a filesystem tree. I search for files within the filesystem fetching each file’s pathname. The directories in a pathname correspond to values in the cascading tables.

listOfTables = ['product','version','buildTarget','buildId','locale','channel']

I wrote a function that takes the name of a directory as an argument. The function’s objective is to put the directory name into an appropriate table whenever the value isn’t already there. I could have written the function such that the target table name is also a parameter to the function, but I took a different path instead. I decided that each table should have its own function. This didn’t mean that I had to individually write the function for each table, I could get Python to do that for me.

def getInsertFunctionForTable(tableName,  databaseConnection, cache,
                              insertSqlTemplate = genericInsertSql,
                              fetchSqlTemplate = genericFetchIdSql):
  insertSql = insertSqlTemplate.replace('TABLENAME', tableName)
  fetchSql = fetchSqlTemplate.replace('TABLENAME', tableName)
  def insertIntoTable(value):
    try:
      return cache[tableName][value]
    except KeyError:
      databaseConnection.executeSql(insertSql % value)
      id = databaseConndatabaseInsertFunctionsection.singleValueSql(fetchSql % value)
      cache[tableName][value] = id
      return id
  return insertIntoTable

In this code, I define a function that, when given the name of table, will return another function. This second returned function is the one that I defined earlier. If I take my list of table names, and use a list comprehension to create a second list of functions appropriate for handling each of the directories in a pathname.

databaseConnection = ...
cache = collections.defaultdict(dict)
databaseInsertFunctions = [ getInsertFunctionForTable(x, databaseConnection, cache) for x in listOfTables ]

Now I can take a pathname and my list of functions and use another list comprehension to process them:

pathname = 'firefox/2.00.12/linux-gcc3.1/2008020101/en/somechannel/file.txt'
idForPathname = [x[0](x[1]) for x in zip(databaseInsertFunctions, pathname.split('/'))]

The result is a list of the database’s id for each of the directory names in their respective tables.

As it happens, this is the value that I need to populate the next table in my diagram. Now I can use the same function again for this next table:

updateParametersInsertFunction = getInsertFunctionForTable('updateParameters', databaseConnection, cache,
                       updateParametersInsertSql, updateParametersFetchIdSql)

Using that idea, I can process entire tree of data, inserting all the values into all the tables with this loop:

for path, name, pathname in cse.FileSystem.findFileGenerator(root,lambda a: a[1] == 'complete.txt' ):
  updateParametersId = updateParametersInsertFunction(tuple([x[0](x[1]) for x in zip(databaseInsertFunctions, path.split('/'))]))

I keep looking at this over and over again. I really like it.

The actual software that I wrote was a touch more complicated. I added the capability to translate values in the tables with a reference to a translation function. I also took into account the rest of the tables that I’ve not mentioned in this posting.

Lars at OSCON 2005 - on Python’s whitespace

Monday, August 8th, 2005

Python uses syntactically significant whitespace to define blocks of code. When casually talking with some Ruby and Java developers at OSCON about Python, I inevitably had to wait for them to stop ranting about mandatory whitespace. When I queried as why they we’re so opposed, I never once really understood their objections. I really love Python’s whitespace rules.

One noteworthy complaint I heard was that code written in one editor by programmer A can get its indentation munged by another editor used by programmer B. Yeah, I can see how that can happen, but only if tabs are expanded or replaced by the second editor. That would break a Python program. In another language, you’d have to make the same corrections anyway or the code could be rendered unreadable or deceptively indented even though it still compiles. In Python, properly indented readable code is a requirement. Maintenance programmers in the future will call this a good thing. The problem can be avoided entirely by not using tabs.

The first code time I encountered indenting code was in Pascal in 1978. It made the code so much more readable. The “begin” was by itself on a line, subsequent lines were indented until, on a line by itself, the “end” statement terminated the block. Later, programming in Fortran 77, indentation was used in a similar manner, though without the “begin” and an “endif”, a “wend” (DEC variant) or a labeled “continue” to terminate the block. C was just like Pascal, only with braces to define the block.

Programmers in C started a unsavory habit of placing the opening brace at the end of the line preceding the block while the close brace lived on a line of its own. This meant that the opening and closing braces rarely, if ever, lined up horizontally. To this day, my eye/brain combination has a terrible time matching braces. To my dismay, that style has become the standard, not only in C, but in all its children: C++, Objective-C, Java and C#. Yet, it is the convention to also use indentation. Why? Because the braces alone are not good enough at making blocks readable. Try eliminating the indentation and the code becomes nearly impossible to read. Braces are good enough for the compiler but not enough for humans (at least this one). It’s like the parenthesis in Lisp: once you get four or more opening or closing parenthesis in a row, I can no longer mentally match them up at a glance. Perhaps my brain is defective.

Have you ever heard the doctrine of “once and only once”? Originating in our own software world, it suggests that it is a bad idea to duplicate effort. Throughout the realm of programming, trying to maintain two parallel solutions to the same problem leads to problems. Remember the relief that Java gave us over maintaining separate header and implementation files in C and C++? Remember how difficult it is to keep documentation synchronized with the source code?

Why do the C derived languages use two separate techniques to indicate a block of code? We use braces for the compiler and whitespace indentation for the programmers. Python demonstrates that the compiler can use the same queues that we use to see blocks of code.

Honestly, the technique used to delineate a block of code in a language is a minor issue. I really like Python’s technique because it synchronizes with how my brain seems to work. I’ve lived with braces in my code for years and years. It hardly seems a reason for such vehemence as I heard at OSCON from some other developers.