How can one create a data flow graph (DFG/SDFG) for any application from its source code

4k views Asked by At

I have done a lot of research to figure out how a DFG can be created for an application from its source code. There are DFG's available online for certain applications such as MP3 Decoder, JPEG compression and H.263 Decoder.

I haven't been able to figure out how I can create a DFG for an application such as HEVC from its source code? Are there any tools which can instantly generate data flow graphs for such elaborate applications or does it have to be done manually?

Please advise me regarding this matter.

EDIT: I used Doxygen for HEVC and I could see how different functions were interacting with each other. However, every function had many entry and exit points and output of Doxygen became too confusing to follow after a while.

I also looked at StreamIt: http://camlunity.ru/swap/Library/Conflux/Stream%20Programming/streamit-cc_stream_graph_programming_language.pdf

It seemed handy but the graphs it generated for even simpler applications (like MP3 Decoder) were too complex. In order to generate a coherent DFG, will I have to re-write the entire source code?

1

There are 1 answers

35
Ira Baxter On BEST ANSWER

You want to extract data flow graphs from arbitrary languages. You imply you want a single way to do it. This isn't practical by hand... you need a tool.

Such a tool is singularly hard to build.

To do this, for each language you must be able to:

  • Define the language to the tool, in the form you find it in practice (not just the language reference manual version). C++ in the wild is bent in a lot of funny ways compared to the standard.
  • Parse programs in the language as found in the field, perhaps as one file, perhaps as tens of thousands; some programs aren't small.
  • Build structures representing the language elements and their relations to one another (this is often done as an abstract syntax tree)
  • Determine for each literal what its actual value is; "a\xbc" has very different values depending on whether the language thinks it is ascii or unicode text with escape sequences
  • Find all the identifiers in the code, and determine for each one which definitional/type information is associated with it according the language scoping rules
  • Determine the sources of data (literal values, inputs from the outside world, results of expressions) and track where those data values are used in other parts of the program across various control flow constructs
  • Presumably draw some picture of the resulting data flow.

Each of these tasks by themselves are hard, because the languages tend to be complex. Most language tools that can do this at all (mostly compilers) do it only for one dialect of the language.

To do it for more than one language/dialect, you need a tool that can be configured for all the details for each language, and you have to configure for all the languages of interest. [Realistically you cannot "do them all"; there's thousands of computer languages in use right now].

Even limiting yourself to the "everyday" common programming languages, this is an enormous amount of work; it can take a few years to do all this well for a single mainstream language. You won't succeed in doing this by yourself.

My company builds a single, unified tool that is intended to be capable of doing this: the DMS Software Reengineering Toolkit. The simple "secret" is to realize that the machinery needed to accomplish the above tasks is actually very similar across languages, and can be designed to configured for a particular language with relatively modest (that doesn't mean "small") effort.

After 20 linear years of engineering with a team of PhD level engineers, we have parsers (even this is hard) for a surprising variety of languages, with full up data flow analyzers of the type you are talking about for C++ (check this link for examples), C, COBOL and almost Java 8.

I don't know of any other unified tools that are this far down the path toward your ideal. Check my bio before you decide I'm clueless about this. (Rascal/MPL has some ambitions but is a research tool at this point; they don't do C or C++ at all) We're only part way there, with many languages and scale battles to fight remaining.

[The goal of DMS isn't data flow analysis; that's just a stepping stone. It is to do automated code transformation, which requires data flow analysis to do safely and correctly].

Of course, you could just hope to find a separate tool for each language. You wouldn't get consistent quality or consistent style/granularity of data flow graphs from separate tools from different authors, if you can indeed get a full set of such tools at all.