Tuesday, December 02, 2008

Data Structures in Imperative Languages

So in the past few weeks I have realized I don't really know how to write data structures in Java or in any other imperative language for that matter.

The issue is that once one wants follow certain patterns like visitors or immutable classes, lot of code has to be written by hand in a tedious and repetitive fashion.

The reason is that these design decisions affect all the data structures in a similar way: a visitor pattern requires an accept method for each class; equality, hash code, cloning have the same properties over all classes (such as deep cloning); etc.

Code generation does seem like a good path to follow here.

Thanks to a colleague who is an avid UML modeler I had the opportunity to work with classes generated via the Eclipse Modeling Framework (EMF).

So how did I like it? Not too much. There's almost no control over the code being generated as the template is well embedded in the innards of the framework.

No equals method, no hashCode method, no make methods (which create a class from the given attributes). All have to be written by hand - get me out of here!

Hence I went and looked for other technologies.

A colleague of mine Radu developed a light-weight tool for the purpose in question. The class diagram is specified in a custom language similar to grammar description languages. For instance:

BinaryOperator :> Implies, Iff;

tells us that the classes Implies and Iff extend the class BinaryOperator. The code is generated using templates, which are language-ignorant and may access properties of the classes using special keywords.

For instance, \ClassName is expanded to the name of currently processed class and \BaseName to the name of the base class. This enables us to write templates like this:

public class \ClassName extends \BaseName { ....

More details about Radu's tool can be found in this article.

Radu's approach is nice, concise, and more flexible than EMF. The problem is, however, the template language. Very soon I have discovered that there are properties of the classes that are hard to access. Moreover, it's not quite clear how one extends the generated classes with custom code.

Even though tempted to extend the existing tool with constructs of my desire, I decided to look for other existing approaches based on code generation using templates.

Surprisingly enough, little did I find. The most promising is described in a tutorial using the velocity tool. The tutorial can be found here.

The template language is richer, it supports if statements, variables, and other goodies. Theinclude construct, I believe, could be used to extend classes with custom code.

The language is a bit awkward at times, this could probably be overcome by defining some velocity variables. It seems that one would have to write support for navigating the structures describing the class diagrams one wants to support.

All in all, I have a nagging, ominous feeling that I'm missing something. Is it not the case that pretty much every object orient program needs data structures? Isn't there like million papers on UML and/or code generation?

Could someone please point me to a decent tool that would let me generate data structure code using templates driven by class diagrams?

Labels: ,


Blogger rgrig said...

I resent the use of "Radu" and "class diagram" in the same paragraph.

03 December, 2008 10:25  
Blogger mikolas said...

Should we change class diagram to Entity-relationship diagram?

03 December, 2008 10:35  
Blogger Ed Merks said...

Dynamic templates in EMF would easily allow you to add things like equals and hashCode, but doing so would violate framework assumptions about modeled objects. EMF supports things like EcoreUtil.equals to do fully general structural equality testing. Implementing structural equality is not as easy as you might assume in a fully general data structure that supports arbitrary references. It's also important to keep in mind that mutable data structures that override hashCode cannot be used safely as keys most maps and sets. I'd be willing to bet that your first attempt to implement equals correctly will likely be wrong...

Of course you could also generate something entirely different from an Ecore model, e.g., I've seen people generate C code. But most important of all, you've completely overlooked the fact that in EMF you wouldn't need to write a visitor because EMF reflection supports that directly. As an example of the power of reflection, XML serialization, copying, equality testing, and change recording are all implemented using a reflective approach to visit the data structures. Note that deep cloning and structural equality testing, two of your stated goals, are directly supported by this.

It seems an unfortunate fact of life that people are constantly reinventing the wheel without looking closely at the thousand other wheels that have already been invented. They end up with not a better wheel, but rather yet another different wheel. When building large scale applications composed from the contributions of thousands of inventive people, the variations in the wheel design don't end up helping to create an integrated result, but rather help to create complexity or even chaos. Those who do not learn from history are doomed to repeat its mistakes.

All that being said, using an EMF EClass to model something like a simple immutable data structure, e.g., a Point or a Rectangle, really isn't the right tool for the job. Such things are best supported as EDataTypes that present the leaf "nodes" in a model structure.

03 December, 2008 11:15  
Blogger mikolas said...

Thanks Ed for your comment. I was hoping somebody will tell me I'm an idiot. Not to reinvent the wheel was the main motivation for this post.

I should say that I wouldn't probably consider code generation for mutable data structures. Moreover I think I very seldom need them (typically I deal with some logic formulas of programming languages ASTs).

I know EMF has predefined cloning and equality mechanisms but I have no control over that.

When I was talking about deep cloning, I meant that this is a design decision that spans across multiple structs.

I might as well decide not to use deep cloning and implement equality via reference equality but then I need to know what I'm doing and I believe that such decisions have nothing to do with the modeling part.

The clone, equals, hashCode, and compareTo methods need to be compatible and that's why I prefer them encapsulated.

Another problem with the EMF EqualityHelper is that I can't use the any API that uses equals.

My point is that templates give me more flexibility without which I find my life difficult. There's always going to be something that I will miss. (toString method and make method pop up in my mind right now).

Isn't this exactly the point of the EMS's JET project? (about which I didn't know until yesterday).

03 December, 2008 12:54  
Blogger Ed Merks said...

I didn't say you were an idiot; you're reading between the lines.

EMF is perfectly suited to supporting ASTs so you should consider using it for that. Check out the Xtext project, for example. Then you'll be spending your time on creative things, i.e., designing your model and syntax, rather than tedious on things that a generator can do better, e.g., writing getters and setters.

The equality and copy utilities are designed around extensible utility classes that you could use to impose our own specializations. In any case, I suspect your concerns about using APIs that use equals is more an anticipated problem than one that arises often in reality. That's because equality is generally well defined (easily defined) only on simpler data structures, not on arbitrarily complex models. Consider the problems that arise for a list which contains itself. Generally complex models don't suffer as a result of not having a specialized equality method.

You can always add convenience methods to the generated factory if creating a bare object and calling setters is too tedious or verbose.

EMF's generator is based on JET templates and with dynamic templates you can write a small template snippet to generate additional methods in each class and interface. Sure it's buried in the framework, but where do you expect it to be?!

Your blog title is "Applied Formal Methods" and something like EMF's Ecore is the epitome of applying formal modeling to your every day development tasks. Just look at the huge stack of technology based on it at Eclipse. Even IBM Rational's entire tool structure is grounded on it. If you don't see significant value in it, you're likely overlooking something.

03 December, 2008 15:05  
Blogger rgrig said...

Two things:

1. My tool has nothing to do with any `diagrams': it merely allows one to use the nicer syntax of functional languages (available since the 70s) for sum and product types when working with Java.

2. I developed the first version in two weeks in the beginning of 2004. Then I reimplemented a much stripped down version later (for legal reasons).

03 December, 2008 17:35  
Blogger mikolas said...

Radu: I would say that "nothing to do with diagrams" is a too strong statement. The information captured is basically the same. You have inheritance and containment relations.

Just because some people draw this rather than write it, I wouldn't say there's any substantial difference (except that I can't read the former in emacs or vi).

03 December, 2008 19:35  
Blogger mikolas said...

Thanks Ed for your response. Despite your assumption that people do not use Java API's sets or maps, I do. I do so frequently.

Which is why I had to implement manually the equals method for classes generated from EMF. I must say I felt rather stupid while carrying out that task.

My first idea was to stick the EMF's EqualityHelper into the equals method. That does not work. It doesn't work because the EqualityHelper is using Java API's hashtable to record the object, which in turn uses the equals method, which loops happily until the stack overflows.

So not only that the EMF generated code does not generate equals methods, it requires them not to be extended. That is clearly an unadulterated misconstruction of Java and Java API's design.

Ed's explanation for this is, as I read it, that writing equals methods is hard for some classes. With all due respect, that sounds to me as: "Programming first-order theorem provers is hard, let's stop programming."

I think there's also another reason. I've learned, after fiddling with various pieces of program code for 20 years, that there's only so much that my brain is capable focusing on at the same time. Therefore I prefer simple program units to complex ones. I think that simple, immutable data structures are great.

Interestingly enough, it seems that some people prefer their code complicated and complex. While I wouldn't sleep well at night if I were incapable of writing an equals method for my ASTs because they are too complex, the EMF community seems to be enjoying it.

26 December, 2008 12:24  

Post a Comment

<< Home