wiki:ModelsParser

The Models Parser

South comes in two parts; the migrations engine, which actually applies the migrations, and startmigration, the bit that makes them for you.

One problem we had from the start was how exactly the migrations should be made; we needed some way to, given a model, spit out tuples of (fieldname, models.SomeField(foo=bar)).

This job is one of the muckiest in the code, and is the job of the models parser.

The job

The models parser is responsible for reading the models.py file of the given model, and extracting the definitions so we can use them.

Sensible people, at this point, dare to ask the important question: why not just look at the field classes (they're available at creation time, after all), and reconstruct the field definition that way?

The problem with doing that is that the arguments that get passed into fields don't get stored anywhere verbatim - they're split up into umpteen different internal variables, so we would have to go and examine every one (bearing in mind they could change between Django versions) to reconstruct the definition. Even going that far - with the several hundred lines of special cases for each field class - we can't reconstruct custom fields, since we don't know how they store arguments.

The second solution is to do the traditional Python thing, and pickle the fields using the pickle module. The security concerns of that aren't a worry to us - migrations are already Python code to start with - but it's just not transparent enough. Johnny User isn't going to be able to go round editing pickles, and auto-generated migrations are supposed to be editable, if tweaking is needed.

There's also some wonderfully crazy ideas involving overriding every Field's init to capture arguments and store them, but that's tricky to get right, very hard to extend to custom fields, and just feels dirty.

Thus, parsing is the only way. In our opinion, it's also more clean and forwards-compatable than other approaches. But, how is it done?

South 0.4 and below

Andy McCurdy? was the main push behind the first models parser, and it consisted of the inspect module combined with some rather interesting regular expressions to try to get the right bits. It also kept passing things through the Python parser until it didn't get a SyntaxError? to see when to stop adding lines onto a field's definition.

This, surprisingly (to both of us) worked shockingly well, proving that Python really does enforce a reasonably strict grammar. It did, however, have a few downfalls; the inspect module trips up if there's no newline at the end of a file (#47), you can't import the name _ from models.py (#9), and others. So, it was rewritten.

South 0.5 and up

The new models parser uses the python parser module - for those who don't know, this gives you Python's internal syntax tree of a bit of code. The new parser gets the models.py file, takes its syntax tree, and then the magic happens.

The modelsparser.py file has a class called STTree that implements some very helpful wrappers around syntax trees; in particular, it allows two very vital things:

  • Searching them using CSS-like selectors (to find the right clauses)
  • Turning trees back into source code (something Python can't do)

Using this and many searching functions, the new parser splits the tree into model class definitions, finds all the class body assignments, turns those into {name: defn} dictionaries, and passes them all back out.

This is all done in less than 400 lines of well-commented, spaced-out code (there's probably only 250 of actual code; a big dict of token-to-string mappings takes up some space, too).

So far, in tests, it's proven to be pretty solid, and takes every model file I've been able to come up with, which I consider Quite Good.

That, everyone, is the story of how a project dedicated entirely to playing with database schemas grew up to have a python parser in it. We have, however, achieved our goal: it Just Works™. And that, I'd say, is worth 5 hours of horrible syntax trees*.

  • Python's syntax trees are bad. The code """Some documentation""" turns into this:
(257,
 (264,
  (265,
   (266,
    (267,
     (307,
      (287,
       (288,
        (289,
         (290,
          (292,
           (293,
            (294,
             (295,
              (296,
               (297,
                (298,
                 (299,
                  (300, (3, '"""Some documentation."""'))))))))))))))))),
   (4, ''))),
 (4, ''),
 (0, ''))

...and that's pretty-printed. The AST module is, alas, only in 2.5 and up. Those of a syntax-tree-using nature might want to have a look for our better prettyprinters in modelsparser.py that use the token names, not numbers.