Announcing Madeline 2.0

Reinventing and expanding the functionality of Madeline using modern object-oriented software design

Madeline 2.0 Development team showing, from left to right, Adrian Marrs, Ed Trager, and Ritu Khanna

Fig. 1. The Madeline 2.0 development team. From left to right: Adrian Marrs, Ed Trager, and Ritu Khanna.

July 29, 2005 -- In recent years Madeline has become known as a flexible package for preparing, visualizing, and exploring pedigree data used in genetic linkage studies. Although the program has gained important new functionality and stability since version 0.935 was released, architectural issues in the legacy code base have begun to hinder the extent, ease, and frequency with which new features can be added.

In order to get around these limitations, we recently made the decision to rewrite Madeline from the ground up using a modern, object-oriented design with a carefully thought-out application interface (API) that will provide a solid foundation for extending the program well into the future.

The new master plan includes many practical and some innovative improvements that we anticipate will make the program much more flexible, easier to use, and vastly more powerful. In addition to numerous improvements in base functionality, the program will, for the first time, feature an integrated graphical user interface (GUI). We plan to make the program available on all major operating system platforms (Linux, Mac OSX, and Windows 2K/XP).

This coding effort has become possible with the creation of a software development team with a significant number of man-hours dedicated to this project. Ed Trager, author of the original version of Madeline and the senior engineer on the Madeline 2.0 team in Dr. Julia Richards' lab, is delighted to be joined in this effort by Ritu Khanna, database programmer in Dr. Anand Swaroop's lab, and Adrian Marrs, a new C++ programmer also in Dr. Richards' lab.

We are currently at the stage of writing and testing the set of foundation classes which will form the basis of the new program. No release date for the program has been set at this point in time. Planned features of the new program are listed below. We look forward to receiving suggestions and comments from current Madeline users regarding ways to make the new program as useful as possible. If you have comments or suggestions, please email them directly to Ed Trager <ehtrager@umich.edu>.

New features in Madeline 2.0

Graphical User Interface

Madeline 2.0 will provide a graphical user interface for the first time.

Fig. 2. GUI. Madeline 2.0 will provide a graphical user interface to complement the command line interface.

For the first time, Madeline will have a modern graphical user interface (GUI) to complement the traditional command line interface (CLI). The GUI will be built using Trolltech's QT toolkit which provides a completely cross-platform user interface solution.

The command line interface will not go away. To the contrary, we are looking at streamlining the command set and investigating how to best take advantage of the respective strengths of the CLI and GUI interface paradigms. We hope to eventually provide a program that can be run in either graphical or command-line mode with equal facility.

Support for Approximate Values

When collecting medical information on extended families, it is common to obtain only approximate information about some family members or sampled individuals. Doctors may only record approximate values in medical records and patients may not be able to recall exactly when they began having symptoms. Most software has no intrinsic support for approximations whatsoever. In contrast, Madeline 2.0 provides robust support for precise values, missing values, and imprecise values in the form of approximations or ranges for both numeric attributes and dates. Some examples are provided in Table 1 below.

Table 1. Examples of precise and approximate date and numeric values in Madeline 2.0
Attribute Example Entry Type of Value Description
Date of Birth 1972-12-11 precise A precise date of birth in ISO format: December 11, 1972. Dates in this format are supported in all versions of Madeline.
Date of Birth 1943 ranged approximation Information on a relative's date of birth is limited to knowledge of the year of birth only. Madeline 2 automatically converts this entry internally to a range spanning the entire year, i.e., {1943-01-01 R 1943-12-31}.
Date of Last Exam 1983-07 ranged approximation The patients last exam occurred in July, 1983. This is equivalent to saying sometime between July 01 and July 31, 1983. Madeline 2 stores the date internally as a range, {1983-07-01 R 1983-07-31}.
Date of Last Exam 1995-06 R 1996-03-14 ranged approximation Based on various recollections, the son of an elderly patient deduces that he had taken his mom to clinic sometime between the latter half of 1995 and the time he moved (on March 15, 1996). In the program, the letter 'R' can be used to delimit the start and end of a range. Madeline stores the date internally as an inclusive range, {1995-06-01 R 1996-03-14}.
Systolic Blood Pressure 125 precise Precisely measured values, such as the systolic blood pressure value shown here, are supported in all versions of Madeline.
Number of Drusen 7 R 12 ranged approximation In a medical record, the doctor recorded the number of drusen as a range from 7 to 12. Madeline stores the value internally as a range spanning from 7 to 12 inclusive.
Ocular Pressure ~21 unranged approximation Although it is generally preferrable to record an exact value or a range of possible values as shown in the previous two examples, Madeline 2 also supports the notion of an unranged approximate value entered as shown here.
Blood Sugar . missing value Madeline accepts the dot character, ".", as a universal missing value indicator in numeric and other types of fields. The value is stored internally as #missing.

When performing queries or calculating descriptive statistics (such as means or standard deviations), the user will have the option of including or excluding approximate values as appropriate. When approximate values are included in calculations, the simple mean of a range calculated from the lower and upper bound will be used.

For example, in the study of an adult onset disease, a researcher might want to treat families with relatively young age of onset differently from families with relatively high age of onset. Since age of onset is often acquired based on a patient's own memory of the past, self-reported ranges such as "40-45" may appear frequently. The researcher may consider these values perfectly valid for subsetting families even though, as approximations, they cannot be used directly as quantitative traits in, for example, a QTL analysis.

Madeline goes global: non-Latin scripts and numbers

non-ASCII digits support in Madeline 2.0

Fig. 3. Madeline 2.0 goes global by providing extensive Unicode support, including transparent support for non-ASCII digits, such as Thai and Arabic as shown here.

Madeline 2.0 will provide substantial support for non-Latin scripts and numbers. Data files encoded using the Unicode UTF-8 transformation format will be fully supported (partial support is already provided in the current version of Madeline). It will be possible to enter and display data in almost all of the world's writing systems both on screen and in all graphical output (pedigree drawings, LOD plots, etc.). Specifically, all scripts supported by the latest version of the Unicode standard (version 4.1 as of July, 2005) will be supported.

Digit conversion samples

Fig. 4. Transparent interpretation of International numbers. The constructors of the Number, Date, and Genotype classes provide transparent, built-in support for reading international digits.

In addition, Madeline 2.0 will have native support for reading non-ASCII digits directly in numberic values, dates, genotypes (Fig. 3, 4). There will be no need to transcribe digits into their Arabic-Indic equivalents when importing international data into Madeline.

Vastly Expanded File Format Support

In addition to supporting Madeline's native legacy flat file format, Madeline 2.0 will now also support a host of XML-based file formats, including the legacy Open Office format, the upcoming OASIS open document format, the Microsoft Excel 2003 XML format, and the W3C XHTML format.

It will now be possible for multiple tables to be parsed from a single file. For example, you could now place a pedigree table on the first sheet or tab of a spreadsheet, a genetic map of the markers on the second sheet, and an allele frequency table on the third sheet. Madeline will be able to automatically recognize and parse all three tables from a single document.

Transparent Decompression of Compressed Files

M>open "/data/linkage/xlcrd.data.bz2"
Bzip2-compressed file successfully opened with 1547 records as follows:
...

M>

Fig. 5. On-the-fly decompression. Madeline 2.0 will be capable of decompressing files transparently.

Madeline 2.0 will provide completely transparent support for decompressing zip, gzip, and bzip2-compressed files from local and network file systems.

Network Transparency

M>open "https://linkage.data.net/xlcrd.xml"
File successfully opened with 1547 records as follows:
...

M>

Fig. 6. Network transparency. Madeline 2.0 will be capable of opening files across the network using standard protocols like HTTP and HTTPS as seamlessly as files on a local file system.

Madeline 2.0 will be able to open data files directly across the internet and office intranets using both standard and secure hypertext transport protocols (HTTP and HTTPS).

MySQL Database Support

M>mysql query "select * from linkageData.diabetes where sibship_bmi>=37"
237 records retrieved as follows:
...

M>

Fig. 7. MySQL database support. Retrieve records from a MySQL database directly into Madeline 2.0.

In Madeline 2.0, you will be able to connect directly to MySQL data sources and retrieve data directly into Madeline using standard SQL queries.

Improved Pedigree Drawing

Due to the wide interest in graphically displaying family data used in human genetic studies, a fair number of non-commercial and commercial pedigree drawing packages have become available in recent years. Algorithmic investigations by researchers like Tores and Barillot have shown that ideal solutions for drawing pedigrees only exist for limited subsets of pedigrees1. As soon as multiple consanguinous loops, individuals with multiple mates, or several related families are present together in a pedigree, calculating an "optimal" pedigree drawing becomes very difficult problem.

Although Madeline v. 0.935+ is very flexible in terms of the number of data labels that can be displayed on pedigree drawings and can draw hundreds of pedigrees with small or large numbers of people very quickly, the pedigree drawing algorithms used in Madeline v. 0.935+ were originally designed only to handle simple single-founding-pair pedigrees that form inverted "v"-shaped trees. Support for multiple spouses, consanguinous loops, and multiple founding-pairs has been added to Madeline with varying degrees of success. For example, Madeline handles multiple spouses fairly well, but support for consanguinous loops remains buggy at best.

In Madeline 2.0, we plan to change this situation completely. We have gone back to the drawing board and have designed a new set of much more powerful algorithms that we expect will provide much more robust pedigree drawing in the new program. Although solving for "ideal" representations of complex pedigree remains a thorny problem, we believe that we will be able to achieve very usable representations of complex pedigrees by employing algorithms that will minimize a limited subset of undesirable features, such as limiting extensive crossing-over caused by consanguinous spouse connections.

Since we are still in the stage of creating the foundational class structure for the new program, a lot of work on pedigree drawing still remains for us to tackle. Stay tuned for more complete information in the future!

References

1. F. Tores and E. Barillot. The art of pedigree drawing: algorithmic aspects. Bioinformatics, 17(2):174-179, 2001.