Encoded in UTF-8 Unicode Valid CSS! Valid XHTML 1.0!

Madeline Version 0.935 Documentation

by Edward H. Trager <ehtrager@umich.edu> (June 2004)

© 2003, 2004 by the Regents of the University of Michigan ALL RIGHTS RESERVED

This software program is released under the GNU General Public License .


Section 1
Features
Section 2
Commands
Section 3
Write Formats
Section 4
Reference Tables

Section 1. Overview and Features What is Madeline? A Brief History of the Program Credits Supported Platforms Tutorial FUSION Study Support Running the Program Interactively and in Batch Mode Start Up File -- initial.script Tables Overview Supported Formats Madeline Format Storing Data Containing Accented Latin Letters or Non-Latin Characters Opening Madeline-formatted Tables Recognizing Unannotated Tables Tables in Legacy Formats Commands To Open Tables Pedigree Tables Tables continued ... Pedigree Tables Containing Allele Columns Genetic Map Tables Decomposed and Composed Tables Analysis Results Tables Data Supported Data Types Character Data Numeric Data Logical or Boolean Data Date Data Display of Dates Extent of Date Support Missing Value Support Categorization of Data Core Data Fields Interpretation of Core Data Database Field Naming Conventions Family Identifier Individual and Parental Identifiers Gender Data Monozygotic and Dizygotic Twin Data Affection Status Data Death Status Field Proband Field Data continued ... Liability Class Field Date of Birth and Death Data Genotype Data Estimation of Allele Frequencies from Genotype Data Phenotype Data Marking and Ordering Data Fields for Output Genetic Map Data Log and Error Reporting Features Display of Warning and Error Levels Pedigree Reconstruction and the Categorization of Individuals Data Classifications of Individuals Twin Management Consanguinity Multiple Mates Multiple Original Founders Data Evaluation And Management Tracking Inclusion and Exclusion of Pedigrees and Individuals Queries and Subsetting References References to Internal Information About An Individual References To Relatives Query and Subsetting Commands Pedigree Drawings Producing Output Files for Analysis

Section 1 Overview and Features

What is Madeline?

Madeline is a program for preparing, visualizing, and exploring human pedigree data used in genetic linkage studies. In addition to converting pedigree and marker data into various formats required by linkage analysis software, including Crimap, Genehunter, Allegro, Mendel, Merlin, PedCheck, and Simwalk2, Madeline also provides functionality for querying pedigree data sets and drawing pedigrees.

By combining a database engine with a software engine that understands the relationships between people in pedigrees, Madeline provides functions for investigating data on individuals and pedigrees in genetic linkage studies (Fig. 1).

Madeline combines pedigree and database engines
Fig. 1. Madeline combines pedigree and database engines to provide useful functions for investigating and formatting data used in genetic linkage analyses.

Note that this release of the program, version 0.935, has numerous changes compared to the previous release (version 0.933). Even if you are thoroughly versed with the workings of version 0.933, you are advised to take a careful look at the numerous changes and new features in the program described in the documentation here.

A Brief History of the Program

As the old adage says, necessity is the mother of invention. When I first started this project a number of years ago, I had never written a recursive descent parser, never implemented balanced binary sorted trees, and I think at that time I had not even heard of the Postscript graphics language! There certainly was no master plan for this program, only the desire to get work done more easily with fewer data conversion headaches. Nor did I or my colleagues sit down and specify a coding standard for the program, much less a documentation standard. Whatever I could write in a reasonable amount of time that happened to work well enough to get the job done was just that, good enough.

Fortunately, a few early decisions were fundamentally correct, even if my implementations were less than perfect. The program began to take shape back in the days when all I knew was DOS and one of my first decisions was to use a DOS 32-bit protected mode library for the Borland compiler. Since the program used a 32-bit flat memory model from day one, it proved easy to port it over to Solaris and HP-UX when I finally got around to learning Unix.

Another early decision was to add an interactive command interface to the program. Early versions of the program required arguments passed from the command line, and it quickly became evident that hundreds of command-line arguments would be difficult to remember and would impair program flexibility. After reading a book on programming in C by Herbert Schildt which showed how to write a BASIC interpreter, as an experiment I decided to create a version of the program with an interactive command parser. That of course proved to be much nicer than the earlier non-interactive versions.

A third pivotal decision was to add pedigree drawing. When I initially suggested doing this, I remember being told that this was a hard problem which I should not waste my time on since other programs existed which already provided that functionality. In retrospect, I'm exceedingly glad I didn't listen to that advice, since the ability to display pedigree data graphically is probably one of the program's greatest strengths. The graphics were originally created using HP's PCL printer language; Postscript was introduced in a later revision of the program.

The program --and the programmer!-- have now begun to mature, but maturation has occurred, and continues to occur, as a slow process. A number of people in the labs where I have worked and elsewhere have found Madeline useful and a timesaver. As a result, I am encouraged to move the program closer to the ideal that I imagine it could be. Although the program and its code still have numerous shortcomings, you can still use these intermediate releases to help you complete your work more quickly, with less hassle, and fewer errors.

Finally, please note that this version of Madeline is released under the GNU General Public License which grants authors and users certain rights. I encourage you to read the license if you are not already familiar with it. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Credits

A lot of work goes into a program like this and I am indebted to many people for their help and suggestions. I would especially like to thank the following people:

Supported Platforms

Madeline v. 0.935 has been successfully compiled using at least the following hardware and compiler combinations:

Intel C++ v. 7.0 Compiler on:
SuSE v. 8.1 Linux (2.4.19 i386 kernel)
GNU g++ v. 3.3.1 or v. 3.3.3 on:
SuSE v. 9.1 Linux (2.6.4 i386 kernel) (g++ 3.3.3)
SuSE v. 9.0 Linux (2.4.21 i386 kernel)
FreeBSD v. 5.2.1 on i386 (g++ 3.3.3)
Solaris (SunOS v. 5.8) on UltraAX-e2
GNU g++ v. 3.2.2 on:
SuSE v. 7.3 Linux (2.4.16 i386 kernel)
SuSE v. 8.1 Linux (2.4.19 i386 kernel)
OpenBSD v. 3.2 (i386)
GNU gcc 2.95.3 on:
RedHat v. 6.2 Linux (2.2 i386 kernel)
SuSE v. 7.2 Linux (2.4 i386 kernel)
SuSE v. 7.3 Linux (2.4 i386 kernel)
OpenBSD v. 2.9 on i386
FreeBSD v. 4.4 on i386
Sun Solaris 8 (SunOS 5.8) on i386
SunOS 5.6 on Sparc Ultra-1
Solaris 8 (SunOS 5.8) on Sparc UltraSPARC-IIi
Cygwin on Windows 2000 on i386
GNU gcc 2.95.2 on:
Apple Macintosh OS X 10.1.5 on G4
GNU gcc 3.3 on:
Apple Macintosh OS X 10.3.4 on G4
Sun Forte Workshop 6 Update 2 C/C++ v. 5.3 Compiler:
Solaris 8 (SunOS 5.8) UltraAX-e2 32-bit executable
Solaris 8 (SunOS 5.8) UltraAX-e2 64-bit executable
Solaris 8 (SunOS 5.8) on Sparc UltraSPARC-IIi 32-bit executable
Solaris 8 (SunOS 5.8) on Sparc UltraSPARC-IIi 64-bit executable
SunOS 5.6 on Sparc Ultra-1 32-bit executable

Madeline now uses the GNU Autoconf system for automatic configuration. In light of this, we expect that the program can be built successfully on virtually all modern UNIX-like platforms.

Please review Notes on Installing Madeline Version 0.935 for more information about compiling and installing Madeline on specific platforms.

FUSION Study Support

Madeline was originally designed to meet the needs of the Finland-United States Investigation of NIDDM Genetics (FUSION) study. Because of this, Madeline has specific knowledge about FUSION study IDs. A narrow subset of Madeline's functionality makes use of this knowledge. Click here if you are interested in learning more about Madeline's FUSION-specific functionality. Paragraphs or headings preceded by "FUSION:" describe FUSION-specific functionality. This functionality is only available when FusionSupport is set on. FusionSupport is off by default.

Note: All current development on Madeline focuses on providing general support for genetic linkage studies. FUSION-specific development ceased long ago.

Tutorial

Included with the distribution of the program is an extensive tutorial that will guide you through the entire linkage analysis process using Madeline interactively. The tutorial is located in the tutorial/Documentation subdirectory under the name MadelineTutorial.html. The tutorial subdirectory also contains all of the data files needed to complete the tutorial.

The tutorial can serve as quick introduction to the program and give you a feel to how the program works. After completing the tutorial, you can return to reading the main documentation for an in-depth treatment of the program's features.

Running the Program Interactively and in Batch Mode

Instructions to Madeline are entered at a command prompt. Madeline's command interpreter is not sensitive to capitalization. However, capitalization is often used in this document for clarity of presentation.

Madeline can be run interactively or in batch mode (Fig 2). To run Madeline interactively, type "madeline" at your system prompt and press return. Madeline's "M>" prompt will appear.

Batch files contain a sequence of Madeline commands that have been saved in a text file in ASCII or UTF-8 format. There are two ways to run batch files. The first way is to provide the name of a batch file as a command line parameter after the name of the program. The second way is to start Madeline interactively and then use the run command to execute the batch file. Madeline returns to interactive mode if an error occurs, or when a batch file terminates without a goodbye or quit command.

edtrager@retina:~> madeline starting Madeline in interactive mode
 ______________________________________________________________________________
 ______________________________________________________________________________
  __    __       _       ______     _______   _          _   __    _   _______
 |  \  /  |     / \     |  ___  \  |  _____| | |        | | |  \  | | |  _____|
 |   \/   |    / ^ \    | |   \  \ | |___    | |        | | |   \ | | | |___
 | |\  /| |   / /_\ \   | |    | | |  ___|   | |        | | | |\ \| | |  ___|
 | | \/ | |  /  ___  \  | |___/  / | |_____  | |______  | | | | \   | | |_____
 |_|    |_| /__/   \__\ |_______/  |_______| |________| |_| |_|  \__| |_______|
 ______________________________________________________________________________
 ______________________________________________________________________________

                                 Version  0.935
                           Written by Edward H. Trager
                              <ehtrager@umich.edu>

COPYRIGHT  2003
THE REGENTS OF THE UNIVERSITY OF MICHIGAN
PORTIONS COPYRIGHT  1995 EDWARD H. TRAGER
ALL RIGHTS RESERVED

Madeline comes with ABSOLUTELY NO WARRANTY.  This is free software and
you are welcome to redistribute it under certain conditions.  For details,
type "license"

+-----------------------+-----------+-----------------------------------------+
| Variable or State Flag| Setting   | Description                             |
+-----------------------+-----------+-----------------------------------------+
| EXTERNAL PROGRAMS     |           |                                         |
+-----------------------+-----------+-----------------------------------------+
| Editor                | edith     | Program used to edit files              |
| PostscriptViewer      | gv        | Program used to view Postscript drawings|
| PrintCommand          | lpr       | System program used to print files      |
| WebBrowser            | mozilla   | Program used to view HTML documentation |
+-----------------------+-----------+-----------------------------------------+
| EVALUATION SETTINGS   |           |                                         |
+-----------------------+-----------+-----------------------------------------+
| EvaluationInterval    |   0.50 cM | Value to write to control file.         |
| OffEndDistance        |  10.00 cM | Value to write to control file          |
+-----------------------+-----------+-----------------------------------------+
| DRAWING SETTINGS      |           |                                         |
+-----------------------+-----------+-----------------------------------------+
| Color                 | ON        | Draw pedigrees in color                 |
| ReverseShading        | OFF       | Black is first icon shade               |
| DividedDrawings       | ON        | Paginate drawings by founding group     |
| HighlightRows         | ON        | Alternately highlight data on drawings  |
| LabelCreatedIndividual| ON        | Label virtuals created by Madeline      |
| Orientation           | AUTOMATIC | Automatic based on drawing dimensions   |
| PaperMargin           | 1.00 cm   | Margin (in cm) on all four sides        |
| PaperSize             | USLETTER  | 8.5 x 11.0 inches                       |
+-----------------------+-----------+-----------------------------------------+
| OTHER SETTINGS        |           |                                         |
+-----------------------+-----------+-----------------------------------------+
| AutoExclude           | ON        | Exclude pedigrees automatically         |
| AutoCheckInheritance  | ON        | Check inheritance on OPEN               |
| ConsoleHighlights     | ON        | Use bold/color highlights on console    |
| Delimiter             | TAB       | Delimiter for tables and other output.  |
| FusionSupport         | OFF       | FUSION customizations disabled          |
| HaplotypeDisplay      | OFF       | Display genotypes delimited with "/"    |
| Language              | American E| Language convention used for date, time |
| MapDetails            | OFF       | LIST MAP summary display                |
| SaveAlleleFrequencies | OFF       | Calculate new frequencies on next OPEN  |
| Time                  |           | Friday, December 5, 2003                |
| Verbosity             | VERBOSE   | All messages are printed to the console |
+-----------------------+-----------+-----------------------------------------+
M>
M>quit entering a command in interactive mode
Releasing resources ...
Goodbye!
edtrager@retina:~>madeline chromosome20.script starting Madeline in batch mode
  open 'linkage/chr20.data.mfh' executing first batch command 
  Calculating allele frequencies for 7. D20S173... 
  Calculating allele frequencies for 10. D20S889... 
  Calculating allele frequencies for 13. D20S898... 
  ...
Fig. 2. Starting Madeline. Madeline can be run either interactively or in batch mode.

Start Up File -- initial.script

You can set parameters and run commonly needed commands automatically each time Madeline is started by providing a script file called "initial.script". Madeline will first look for a local version of initial.script in the current working directory from which Madeline is invoked. Failing to find initial.script there, Madeline will look in the share/madeline/ subdirectory under the directory prefix where Madeline was installed. For example, if Madeline was installed in /usr/local, then the program will look for /usr/local/share/madeline/initial.script.

Any commands that can normally be invoked on the command line or in a batch file can be placed into initial.script. Assignments to specify default field names or environmental settings are typically placed in initial.script (Fig. 4).

//
// Typical initial.script file:
//

//
// Environment settings:
//
quiet
set language to English
Editor="emacs"
PostscriptViewer="gv"
//
// Pedigree drawing-specific settings:
//
set color off
set PaperSize to A4
// margin in centimeters:
set PaperMargin to 1.5
set orientation to automatic
//
// Pedigree database-specific settings:
//
GenderField='GENDER'
FamilyIDField='FAMILY'
IndividualIDField='INDIVIDUAL'
//
// Map standard missing value indicators:
//
NumericMissingValue[1]=-1
NumericMissingValue[2]=-9
//
// Map database-specific settings:
//
PositionField="POSTN"
OrdinalField ="ORDNL"
Fig. 4. Example initial.script file.

Note: Starting with Madeline v. 0.933, it is recommended that site-wide defaults be compiled directly into Madeline by customizing the config.h source file generated by the configure script run by you or your system administrator when Madeline is installed. Many (but not all) parameters can be configured in config.h. Remaining settings can be specified in the initial.script as necessary. If you don't require any site-specific customizations, you can just leave the global default initial.script as is.

Tables

Tables Overview

Madeline processes data stored in tables. A database table is a rectangular array of data. A record is a row in the array. A field is a column in the array. Each record contains one or more identifiers or keys which identify the entity, and the data -- all the measured variables -- for the entity. The measured entity may be an individual in a pedigree, a genetic marker, a position along a genetic map, or something else.

Tables Supported Formats

The program currently supports the following table formats:

  1. Madeline column-aligned, space-delimited ASCII and UTF-8 flat files. This is the recommended format.
  2. FoxPro and other generic xbase databases (such as dBaseIII, IV).
  3. Visual FoxPro which is a variant on the xbase structure.
  4. SAS transport file format (theoretically regardless of platform of origin).

Note: Of these four formats, we now recommend using only the Madeline native format because it is open, non-proprietary, human-readable, and editable in any text editor (in the case of UTF-8 files, in any UTF-8 capable text editor: see this link). Although supported in versions 0.933 and 0.935, the legacy xbase and SAS transport formats are deprecated and we may not support them at all in future versions of the program.

The structure of the Madeline format is described below. Sample PHP code for creating Madeline files from database tables is also provided.

Tables Madeline Format

A Madeline-formatted table is a human-readable flat file containing ASCII or UTF-8 characters having the following structure:

The following figure illustrates how the example "relationships.dat" data file included in the software distribution conforms to this structure:

File structure

Madeline Table Format. Tables in the Madeline format are flat files divided into a header block consisting of one or more lines containing column labels and optional type designators, and a data block consisting of even-length records divided into space-delimited data columns. (The vertical blue arrow illustrates how data in a single column can, if necessary, contain embedded white space, as long as that white space does not stretch uninterrupted from the first to the last record: vertically uninterrupted white space delimits columns).

The header contains:

Column labels should be CAPITALIZED. Column labels are separated by any amount of white space and can span as many lines as necessary. Each line in the header can contain one or more column labels (This is illustrated above where seventeen column labels span fifteen header lines).

The order of the column labels from left-to-right and top-to-bottom indicates the order of the columns in the data block. Lines in the header can be of varying lengths.

A column type designator consists of single capital letter following after a column label. Any amount of white space can be used between a column label and type designator. The following column type designators are recognized:

Column Type Designators in Madeline Tables

Column Type Designator Description Example
C Designates:
  1. that a column contains character (string) data   or
  2. that a column containing numeric data be treated as string data (individual and pedigree identifiers, even if they are completely numeric, must be treated as string identifiers in Madeline).
STUDYID C
X Designates the gender (sex) column. The gender column may alternatively be designated with a "C" for character codes or with an "N" for numeric gender codes. SEX X
N Designates that a column containing numeric values be treated as numeric data. AGE_DX N
D Designates that a column contains date data in ISO-8601 format (four-digit year followed by a two-digit month and finally a two-digit day). Note that Madeline permits use of any of the usual date delimiters:
YYYY .
-
/
MM .
-
/
DD
DOB D
G Designates that a column contains genotypes as numeric allele labels separated by a forward slash "/" character. Genotype columns can alternatively be designated with a "C": the program will automatically recognize character columns containing genotypes. D12S4321 G
A Designates that a column contains one of the two alleles that constitutes a genotype. Allele columns must exist as identically-named pairs. D12S4321 A
D12S4321 A

At least one blank line must follow after the header in order to separate the header block from the data block.

The data block consists of:

Each record (or row) contains the identifiers and data measured for one entity. For example, a pedigree table contains one row for each individual in a family, where as a map table contains one row for each marker in a genetic map.

The identifiers and data for each record are formatted in:

Note #1: Column type designators are technically optional in most (but not all) cases. The program contains code to automatically detect column types. Certain type promotions, such as from "C" to "G" and from "C" to "X" are permitted and performed automatically when required. The primary exception occurs when completely numeric individual and pedigree identifiers are used: the program automatically detects the columns as being "N" numeric, but you MUST cast them as "C" because the program requires string identifiers for individuals and pedigrees. The best practice is to include column type designators since this increases file readability for humans and prevents surprises.

Note #2: A problem that can occur with hand-edited data files (as opposed to those generated by a script, program, or database system), is embedded tabs or extra spaces or tabs appended to the end of various lines of the data -- but quite invisible when viewed in a typical editor! These make the data block non-rectangular, which is not allowed.

Madeline's recognize command now specifically looks for embedded tabs and inconsistent data row lengths caused by extra tabs or space characters at the ends of rows. The rectify command will fix both types of problems in most cases. You should also open the data file using a good file editor such as Edith and use the column-highlighting feature (i.e., CTRL+<Left Mouse Button>) to select white space after the last data column. This quickly reveals whether the lines are all the same length or -- as is quite likely going to be the case in problematic files -- not (See figure below). If they are not, a simple CTRL-X in Edith trims off all of the selected segments.

Having trouble getting Madeline to recognize your data file? Often the culprit is hidden spaces and/or tab characters trailing after the last column of data, making it non-rectangular. Another culprit could be tab characters embedded within rows (not illustrated). The rectify command will handle both types of problems. Alternatively, you can use Edith or a similar file editor to display embedded tabs and terminal tabs and spaces. Hold down the CTRL key while pressing the left mouse button to select the trailing white space and remove it (CTRL-X).

Tables Storing Data Containing Accented Latin Letters or Non-Latin Characters

If your data contain strings in non-English languages that use accented Latin letters (such as "ç" and "é" in French, "ñ" in Spanish, and "ü" in German) or non-Latin scripts (Cyrillic, Japanese, Chinese, etc.), then your data must be encoded in the Unicode UTF-8 format and stored using the Madeline table format. You will also need to run Madeline in a Unicode-capable terminal emulator under a UTF-8 locale.

What does this mean? For European users, it means that the Madeline does not support any of the legacy ISO-8859-x character sets, not even ISO-8859-1. For other users, it means that your country's legacy character sets, whether it be KOI-8, JIS, GB18030, or something else, are not supported. The primary issue is that many people's computers are still set to use some legacy encoding system. Fortunately, major Linux distributions are now enabling UTF-8 locales by default. If you are using a recent release of SuSE or Redhat in North America or Europe, your system will be set to use UTF-8 by default. However, if you are using a recent Linux distribution in East Asia (China, Japan, etc.), you should check your locale settings. This document should provide you with most of the information you need to switch over to Unicode under Linux or a similar *nix-based system.

The rules for formatting a UTF-8 table in Madeline format are the same as those for ASCII. In particular, note that columns in the data block must be aligned on byte boundaries with white space (ASCII 0x0020) separating columns. Trying to format such a file manually is not recommended. Instead, store your data in a database and use a scripting language like Perl or PHP to extract the data into the correct format.

See the examples/utf8/utf8.data file as an example of a properly formatted UTF-8 file:

Example UTF-8 file

Data files in the Madeline flat file format can contain data in any of the world's scripts encoded in Unicode UTF-8. The last column in the file appears unaligned, but is actually aligned on byte boundaries, not on displayed character boundaries.

Tables Opening Madeline-formatted Tables

Before you can use a flat-file table, you must first run the recognize command which creates a small, complementary binary file with a ".mfh" ( Madeline File Header ) extension. This complementary file stores meta information about the table (names and number of columns, number of rows, etc.) in a binary format which the program uses to optimize table access. To open the table, specify the complementary ".mfh" file name in place of the original data file.

Tables Recognizing Unannotated Tables

An unannotated data file contains no header describing the fields. What do you do if you receive an unannotated data file from a colleague or client? Let Madeline do some of the work for you!

The recognize command contains code to automatically detect columns in unmarked files. The program can even often identify the core gender, individual, father, and mother fields required for pedigree reconstruction. This can save time and tedium. Consider the following excerpt from the "unannotated.data" file included in the examples subdirectory of the software distribution:

NT5641 M AB0115 FM_012 A A AB0116 190/202 166/169 0/0     172/175 154/154 65 89.94 
NT5661 F AB0147 FM_012 A A AB0148 208/211 153/157 201/207 160/175 154/160 50 81.03 
NT5675 F AB0119 FM_012 A A AB0120 205/211 153/166 204/207 166/175 157/157 73 82.06 
NT5676 F AB0123 FM_012 A A AB0124 190/208 156/166 201/207 175/177 154/157 63 88.76 
NT5678 F AB0140 FM_012 A I NT5676 190/208 157/166 201/207 175/177 154/157 60 65.86 
NT5679 F AB0115 FM_012 A I AB0116 202/211 157/169 207/207 172/175 154/154 78 82.92 
NT5724 F AB0135 FM_012 A I AB0136 205/205 166/169 204/210 166/172 154/157 69 71.82 
NT5728 F AB0113 FM_012 A A AB0114 190/190 156/169 204/207 160/175 148/157 64 84.55 
NT5749 F AB0121 FM_012 A A AB0122 205/211 157/157 0/0     166/175 154/160 .  87.46 
NT5752 F AB0121 FM_012 A I AB0122 205/211 153/157 204/207 166/175 148/154 83 88.57 
NT5753 F NT5641 FM_012 U I AB0132 .       .       .       .       .       .  62.94 
NT5757 F AB0130 FM_012 A A AB0131 202/211 157/169 201/207 166/172 154/160 55 70.16 
NT5790 F AB0138 FM_012 U I AB0139 208/208 157/160 207/207 172/172 154/154 .  71.18 
  . . .

Unannotated Table. Having trouble deciphering which column is which? Let Madeline help you!

Running the recognize command on this file produces the following:

M>recognize 'unannotated.data'
Recognizing file "unannotated.dat" to "unannotated.dat.mfh" ...
Skipping a total of 1 line at top.
There are 0 non-empty header lines and 54 data lines.
Data records are 83 bytes long.
The gender field has been identified and will appear in the ".run" file
The individual, father, and mother ID fields have been identified
and will appear in the ".run" command file

 # . Field Name  Start End   Length Prec. Space Type
---- ----------- ----- ----- ------ ----- ----- -----
  1. INDIVIDUAL      1     6     6     0     1 C
  2. GENDER          8     8     1     0     1 X
  3. FATHER         10    15     6     0     1 C
  4. CHAR_003       17    22     6     0     1 C
  5. CHAR_004       24    24     1     0     1 C
  6. CHAR_005       26    26     1     0     1 C
  7. MOTHER         28    33     6     0     1 C
  8. GENO_001       35    41     7     0     1 G
  9. GENO_002       43    49     7     0     1 G
 10. GENO_003       51    57     7     0     1 G
 11. GENO_004       59    65     7     0     1 G
 12. GENO_005       67    73     7     0     1 G
 13. NUME_001       75    76     2     0     1 N
 14. NUME_002       78    82     5     2     1 N
Binary recognition header file (".mfh") written.
   --> If this is a pedigree file,    type 'open "unannotated.dat.mfh" '.
   --> If this is a genetic map file, type 'load "unannotated.dat.mfh"'

The template batch file unannotated.dat.run has been created.

NOTE: The ".run" file contains commands and parameters to assist
      you in opening a flat file database, but generally requires
      editing before use.
M>

Clearly the program cannot perform magic. For example, the program was not able to determine the FamilyIDField (the fourth column in the table) and there are other limitations to what the program can do. Nevertheless, being able to correctly identify the core pedigree structure and genotype columns in an unannotated data file can save you from a tedious manual investigation of the data.

Tables Tables in Legacy Formats

When using xbase or SAS transport formats, Madeline column type designators like X,G, and A do not exist. For legacy formats, only use character, numeric, and date column types. For example, the gender attribute can be stored in a character field coded using "M" and "F". Genotypes can also be stored in character fields. Floating-point and integer fields are cast to Madeline's double-precision numeric floating-point type. Also avoid the logical and date-time field types found in dBase and FoxPro as Madeline won't understand these.

Tables Commands To Open Tables

The command used to open or manipulate a table depends on the type of table being processed:

Madeline's database engine detects operating system and file byte-ordering at run time, permitting tables from PCs to be opened on UNIX workstations, and vice versa.

The different types of tables are described below in turn.

Tables Pedigree Tables

In a pedigree table, each row (record) contains the data for one individual. In Madeline, the names of the family and individual ID fields are stored in variables called FamilyIDField and IndividualIDField, respectively. Basic pedigree reconstruction additionally requires knowledge of the father (FatherIDField), mother (MotherIDField), and gender (GenderField) of each individual. Together, these field variables comprise the five core fields that must be present in every pedigree table:

  1. FamilyIDField -- key identifier
  2. IndividualIDField -- key identifier
  3. FatherIDField -- required for pedigree reconstruction
  4. MotherIDField -- required for pedigree reconstruction
  5. GenderField -- required for pedigree reconstruction

The remaining identifiable data fields in a pedigree table are classified by Madeline into two groups: (1) phenotype and (2) genotype. Madeline automatically classifies all identifiable fields in a pedigree table into one of these three categories. Each identifiable field is tagged with a single-letter identifier shown below:

  1. C -- Core fields
  2. P -- Phenotype fields
  3. G -- Genotype fields

Core fields are identified by matching field names against the names stored in the field name variables (i.e., FamilyIDField, StudyIDField, GenderField, etc.). Genotype fields are identified by scanning the data. Remaining fields are classified as phenotype fields.

Field classifications are shown in the figure below, illustrating a portion of a typical pedigree table:

Pedigree Tables. All identifiable fields in a pedigree table are classified into one of three categories in Madeline: "C" for core fields, "P" for phenotype fields, and "G" for genotype fields.

Note: The complete set of core fields consists of the five required core fields shown above, plus optional core fields such as AffectionStatusField, MZTwinField, and DateOfBirthField. See Core Data Fields for a complete listing.

Fields containing only missing value indicators that cannot be categorized as "C", "P", or "G" will be marked with an asterisk, "*". Phenotype "P" fields can be tagged by the user as being covariate "V" fields using the toggle command. Phenotype fields are never automatically classified as covariate "V" fields.

Tables Pedigree Tables Containing Allele Columns

When a pedigree table containing allele "A" columns instead of genotype columns, such as a Linkage file, is opened in Madeline, the paired and identically-named allele columns automatically appear as single genotype "G" fields.

Below is an excerpt from the "linkage.ped" data set distributed with the program. Note that while the data block is in the unchanged Linkage format, Madeline still requires that you provide a conformant header in Madeline format to identify the columns:

FAMID    C
STUDYID  C
FATHER   C
MOTHER   C
SEX      N
AFFECTED N
MARKER1  A
MARKER1  A
MARKER2  A
MARKER2  A
MARKER3  A
MARKER3  A
MARKER4  A
MARKER4  A
MARKER5  A
MARKER5  A
MARKER6  A
MARKER6  A


F0021 K0001A 0      0      1 0      0   0    0   0    0   0    0   0    0   0    0   0  
F0021 K0001B 0      0      2 0      0   0    0   0    0   0    0   0    0   0    0   0  
F0021 K00158 K0001A K0001B 1 1      1   4    1   3    1   1    3   4    1   3    1   3  
F0021 K00159 K0001A K0001B 1 0      1   3    1   2    1   1    1   2    1   2    1   2
. . .

Here are the commands to recognize and open this file. Note how the pairs of allele columns are recognized as genotype columns:

M>recognize "linkage.ped"
Recognizing file "linkage.ped" to "linkage.ped.mfh" ...
  ...
 # . Field Name  Start End   Length Prec. Space Type
---- ----------- ----- ----- ------ ----- ----- -----
  1. FAMID           1     5     5     0     1 C
  2. STUDYID         7    12     6     0     1 C
  3. FATHER         14    19     6     0     1 C
  4. MOTHER         21    26     6     0     1 C
  5. SEX            28    28     1     0     1 N
  6. AFFECTED       30    30     1     0     6 N
  7. MARKER1        37    41     5     0     4 G
  8. MARKER2        46    50     5     0     4 G
  9. MARKER3        55    59     5     0     4 G
 10. MARKER4        64    68     5     0     4 G
 11. MARKER5        73    77     5     0     4 G
 12. MARKER6        82    86     5     0     2 G
  ...
M>// Because the LINKAGE file uses zeros to represent
M>// missing parents, we need to add zero to the 
M>// CharacterMissingValue[] array:
M>list CharacterMissingValue
CharacterMissingValue has 5 elements:
CharacterMissingValue[ 1]="."
CharacterMissingValue[ 2]="/"
CharacterMissingValue[ 3]="0/0"
CharacterMissingValue[ 4]="0/ 0"
CharacterMissingValue[ 5]="0/  0"
M>cmv[6]="0"
M>// Madeline's GenderStatus[] and AffectionStatus[]
M>// arrays now contain LINKAGE code mappings by default, so 
M>// we can immediately open the file:
M>open "linkage.ped.mfh"
  6. AFFECTED has 3 levels.
Calculating allele frequencies for   7. MARKER1...
Calculating allele frequencies for   8. MARKER2...
Calculating allele frequencies for   9. MARKER3...
Calculating allele frequencies for  10. MARKER4...
Calculating allele frequencies for  11. MARKER5...
Calculating allele frequencies for  12. MARKER6...
Pedigree table "linkage.ped.mfh" opened with        13 records
  ...
  1.FAMID      Co__1    5.SEX        Co__5    9.MARKER3    Go__3
  2.STUDYID    Co__2    6.AFFECTED   Co__6+  10.MARKER4    Go__4
  3.FATHER     Co__3    7.MARKER1    Go__1   11.MARKER5    Go__5
  4.MOTHER     Co__4    8.MARKER2    Go__2   12.MARKER6    Go__6
M>

Tables Genetic Map Tables

A map table contains map information related to markers on one or more chromosomes. The key fields in a map table are:

  1. MapChromosomeField -- chromosome on which marker appears
  2. MapMarkerField -- name of the marker

The data fields in a map table are:

  1. MapPositionField -- map position from p terminus in centiMorgans
  2. MapOrdinalField -- ordinal ranking of the marker in the map from 1 to n where n is the number of markers mapped for the given chromosome

The following optional fields may also be present:

  1. MapFemalePositionField -- map position for a female-specific map
  2. MapMalePositionField -- map position for a male-specific map
  3. MapPositionBPField -- physical map position in base pairs. This is defined but not currently used in the program.

An example of a map table is shown below:

MARKERNAME
CHROMOSOME
ORDINAL
POSITION

D17S944  17  1   82.6
D17S949  17  2   93.3
D17S1304 17  3   94.0
D17S1351 17  4   96.0
D17S1352 17  5   98.1
D17S1807 17  6   99.2
D17S929  17  7   99.2
D17S1301 17  8  100.0
D17S785  17  9  103.5
D17S674  17 10  105.7

A marker map table.

Note: A single map table can contain marker map data for multiple chromosomes. A single table is easier to maintain than multiple tables.

Tables Decomposed and Composed Tables

In traditional metal typography, metal casts of individual letters are "composed" into rows of type for printing on a printing press. In Madeline, we have borrowed the idea of "composing" text to refer to the operation of rearranging a table in which each row contains the alleles for one marker measured on one individual:

FAMID STUDYID MARKERNAME ALLELE1 ALLELE2
----- ------- ---------- ------- -------
P0001 I00001  D12S1304     121     127
P0001 I00001  D12S127       98     104
P0001 I00001  D12S341      134     142
P0001 I00002  D12S1304     117     119
P0001 I00002  D12S127      102     102
P0001 I00002  D12S341      136     138
.     .       .            .       .

Excerpt from a decomposed table. One row contains the alleles for one marker measured on one individual.

... into a table in which each row contains the genotypes for all of the markers measured on one individual ...

FAMID STUDYID  D12S1304  D12S127  D12S341
----- -------  --------  -------  -------
P0001 I00001    121/127   98/104  134/142
P0001 I00001    117/119  102/102  136/138
  . . .

Excerpt of the same data after composition. One row contains the genotypes for all markers measured on one individual.

A marker table contains the alleles for a specific marker measured on a specific individual. Output from the ABI Genotyper software is in this table format. This type of table has three key fields:

  1. FamilyIDField -- family of the individual
  2. IndividualIDField -- ID of the individual
  3. MarkerField -- name of the marker

There are only two essential data fields in a marker table:

  1. Allele1Field -- positive integer label assigned to first allele
  2. Allele2Field -- positive integer label assigned to second allele

Madeline provides support for integrating the information in a marker table into a pedigree table via the compose and merge commands. The compose command takes care of converting the paired allele fields into the single genotype fields expected in a pedigree table. The merge command allows you to integrate family structure, phenotype, or genotype data from separate tables into a combined table for use in Madeline.

Tables Analysis Results Tables

Output files containing logarithm of odds (LOD) results from certain analysis programs such as Simwalk2 can be converted by Madeline directly into a table format which the program can then use for graphing the results. Output from other analysis programs may require a small amount of manual formatting which is usually not difficult to do.

The convert command may used to produce a table in the right format:

M>convert simwalk file 'sw_chr10.out' to 'chr10.results'
Converting input file "sw_chr10.out" to Madeline-formatted output files ...
        ...
M>

Converting an analysis result file directly into a Madeline table format.

An analysis results table must contain at least a POSITION and a SCORE column. Other columns may also be present. Here is a table of results from a non-parametric Simwalk2 analysis already formatted for use by Madeline. For graphing this data in Madeline, you would have to specify which of the score columns to use by setting Madeline's GraphScoreField to one of the "STAT" columns:

POSITION
STAT_A
STAT_B
STAT_C
STAT_D
STAT_E

 0.0000  0.073   0.046   0.072   0.069   0.077
 4.1691  0.182   0.076   0.180   0.167   0.181
 7.9053  0.163   0.076   0.162   0.156   0.183
10.0506  0.298   0.156   0.311   0.300   0.325
18.0591  0.578   0.409   0.614   0.633   0.665
21.3661  0.633   0.527   0.675   0.726   0.779
36.2864  1.526   0.690   1.478   1.445   1.359
39.7003  1.222   0.529   1.201   1.150   1.056

An analysis results table from Simwalk2 formatted for graphing in Madeline.

For more information on how to graph tables of LOD results, see the graph command.

Supported Data Types

Madeline's database engine supports: character string, numeric (floating point and integer), and date data types.

A logical data type (such as the "L" field type of xbase) or boolean data type is not distinguished from the numeric data type. Use appropriately coded numeric columns for boolean attributes (for example, 1=true, 0=false). Other derived types, such as date-time or monetary types are not supported.

Data Character Data

Character data are read from tables by trimming leading and trailing space characters. Thus, blank entries in a database appear as the empty string, "", which Madeline interprets as a missing value indicator. When entered on the command line, literal character data must be delimited by a pair of matching single or double quotes, e.g., "0001-230" or '0980A'.

Madeline's interpretation of character data is affected by the values stored in the CharacterMissingValue (aliased as CMV) array. Each value stored in this array is interpreted as an additional value to treat as a missing value indicator. The CharacterMissingValue array contains a set of default values that are appropriate for most data sets. Users can reassign current values, or assign additional values as needed.

Users planning on using Linkage files with Madeline in particular should note that the character string "0" (zero, used to represent missing parent IDs in Linkage files) is not considered a missing value by default in Madeline. Recall that Madeline treats individual and pedigree identifiers as character strings.

Data Numeric Data

All numeric data types are converted to double-precision floating point numbers. Literal numeric values are entered on the command line without delimiters. Interpretation of numeric data is affected by the values stored in the NumericMissingValue (aliased as NMV) array. Each value stored in this array is interpreted as an additional value to treat as a missing value indicator. Users should verify whether Madeline's default numeric missing values are appropriate for their data and reassign or add values to this array as necessary.

Data Logical or Boolean Data

Madeline does not recognize a logical or boolean data type separate from the numeric data type. In contexts where a value is to be interpreted as a logical value, Madeline treats zero (0) as #false, and any non-zero, non-missing value as #true. True/false data should thus be coded using a numeric field type with values of 0, 1, and a numeric missing value indicator if required.

Data Date Data

Dates are converted internally into Julian day integers. When entered at the command line, dates must be delimited between curly braces, { }.

Dates in tables or entered on the command line should be in ISO-8601 format (a four-digit year followed by a two-digit month and finally a two-digit day). In tables and on the command line, Madeline permits use of any of the following date delimiters:

YYYY .
-
/
MM .
-
/
DD

In addition, on the command line you can also use single spaces as delimiters:

M>? {2003 05 17}
{Saturday, May 17, 2003}
M>? ( {2003 05 17} - {1965 07 28} ) / 365.2425
37.8023
M>

Note: Previous versions of Madeline supported entry of dates in non-ISO formats, such as "{December 11, 1963}". However, the program only supported a few locales (American English, British English, etc.) and we felt that locale-specific date entry conventions could lead to confusion or even errors for international collaborations. The program continues to support the display of dates in numerous locales but date entry has been standardized to the ISO format. For example:

M> set language to Japanese
 ...
M>? {2003 05 17}
{2003年5月17日 (土曜日)}

Dates must be entered in ISO YYYY MM DD format, but can be displayed using specific locale conventions.

Pope Gregory XIII instituted the calendar that is now used internationally in October of 1582. (However, only the Catholic countries in Europe adopted the Gregorian calendar immediately. Other countries adopted it much later. For example, both England and the American colonies did not adopt it until the middle of the 18th century. Countries such as Thailand did not adopt the international calendar until the late 19th century). Madeline reports all dates since October of 1582 using the Gregorian calendar. In order to have Easter fall at the right time of the year once again, ten days were skipped in October of 1582. Madeline handles this correctly:

M>?{1582.10.04}
{Thursday, October 4, 1582}
M>?{1582.10.04} + 1
{Friday, October 15, 1582}
M>

Ten days were skipped in October of 1582 by Pope Gregory XIII.

Dates prior to October of 1582 are reported using a proleptic calendar that projects the Gregorian-Julian calendar back in time.

Data Display of Dates

By default, Madeline displays dates based on your computer's locale setting. If your computer is set to the "C" or "POSIX" locale or any other non-UTF-8 locale, Madeline defaults to AmericanEnglish conventions for displaying dates.

When you run Madeline under a UTF-8 locale, Madeline displays dates in the selected locale if possible. The program evaluates the following environment variables in order of precedence: LC_ALL, LC_DATE, LC_CTYPE, and LANG to determine how to display dates. Note that Madeline needs to be run in a Unicode-enabled terminal emulator, such as mlterm, for proper display of many languages.

You can also change the conventions used to display dates interactively using the set language command.

Examples are shown below:

edtrager@eyegene:~>LANG=fr_FR.UTF-8 madeline
...
+-----------------------+-----------+-----------------------------------------+
| OTHER SETTINGS        |           |                                         |
+-----------------------+-----------+-----------------------------------------+
| AutoExclude           | ON        |Exclude pedigrees automatically          |
| AutoCheckInheritance  | ON        |Check inheritance on OPEN                |
| ConsoleHighlights     | ON        |Use bold/color highlights on console     |
| Delimiter             | TAB       |Delimiter for tables and other output.   |
| FusionSupport         | OFF       |FUSION customizations disabled           |
| HaplotypeDisplay      | OFF       |Display genotypes delimited with "/"     |
| Language              | French    |Language convention used for date, time  |
| MapDetails            | OFF       |LIST MAP summary display                 |
| SaveAlleleFrequencies | OFF       |Calculate new frequencies on next OPEN   |
| Date                  |           |le lundi 22 décembre 2003                |
| Verbosity             | VERBOSE   |All messages are printed to the console  |
+-----------------------+-----------+-----------------------------------------+
M>? {1983.12.03}
{le samedi  3 décembre 1983}
M>set language to japanese
...
M>? {1983.12.03}
{1983年12月3日 (土曜日)}
M>

UTF-8-based locale settings read from environment variables are used for determining how to display dates. You can also change the language settings interactively using the set language command.

Data Extent of Date Support

Dates may be added and subtracted from one another, with the results being expressed in days. Date data may be displayed on pedigree drawings. Dates may also be used in an expression passed to a view command, a draw command, or to a subsetting command such as exclude, or to the sort command (which sorts the order in which siblings appear on a pedigree drawing).

Most statistical genetics programs for which Madeline provides formatted files as output do not support date data. However, dates can be written to output files in Madeline's generic formats. You will need to toggle date fields on for output since they are toggled off by default.

Data Missing Value Support

Madeline supports entry of missing values from the command line, and also provides a simple mechanism for the user to define sets of values that should be mapped as missing values when a data are read from files.

On the command line, Madeline provides the following ways to represent missing values:

Protocols in scientific studies often require that missing values be coded to specify reasons for missingness. For example, a set of negative integers outside the range of a measured phenotype may be chosen to represent missing conditions, such as assay pending, no assay, no tube, or similar conditions that result in missing data.

To accomodate such conventions, Madeline permits the user to specify lists of values that are to be treated as missing values. These lists of missing value indicators are stored in two arrays. CharacterMissingValue[] is used whenever character fields, including genotype fields, are referenced. NumericMissingValue[] is used whenever numeric fields are referenced (see table below). For expediency, these arrays can be referenced using abbreviated names, cmv[] and nmv[], respectively. There is currently no missing value array for dates.

Character and numeric missing value arrays in Madeline.

Full Name Abbreviated Name Default Values
CharacterMissingValue[] cmv[] cmv[1] = "."
cmv[2] = "/"
cmv[3] = "0/0"
cmv[4] = "0/ 0"
cmv[5] = "0/ 0"
NumericMissingValue[] nmv[] nmv[1] = -9999

 

In ASCII and UTF-8 data files, a space-padded blank entry or a single dot (i.e., a period) in a character or numeric column is treated as a "native" missing value. (Therefore, the empty string "" and single dot "." need not be included in the missing value arrays). A typical Madeline-ready data file using single dots as missing value placeholders is shown below:

FAMID C  
STUDYID C 
SEX X  
FATHER C  
MOTHER C  
MZTWIN C  
DZTWIN C  
AFFECTED C
AGE_DX N
D9S247 G 
D9S325 G 
D9S462 G 
D9S1017 G 
D9S1321 G 

L0012  M02448   F N00332   N00333  . . A  68  0/0      349/349  0/0      .        0/0     
L0012  M05605   F N00334   N00335  . . A  63  244/252  349/353  157/167  240/244  234/238 
L0012  M06039   F N00334   N00335  . . A  68  252/254  0/353    157/157  228/240  234/238 
L0012  N00332   M .        .       . . I  .   .        .        .        .        .       
L0012  N00333   F N00336   N00337  . . I  .   .        .        .        .        .       
L0012  N00334   M N00336   N00337  . . I  .   .        .        .        .        .       
L0012  N00335   F .        .       . . I  .   .        .        .        .        .       
L0012  N00336   M .        .       . . I  .   .        .        .        .        .       
L0012  N00337   F .        .       . . I  .   .        .        .        .        .       
L0034  M02453   M N00167   M05758  . . A  48  242/244  0/0      .        232/244  234/242 
L0034  M05758   F N00165   N00166  . . U  89  242/248  0/0      .        232/0    242/246 
L0034  M05759   M N00167   M05758  . . A  45  0/256    0/0      .        232/240  238/242 
L0034  M05856   M N00167   M05758  . . U  53  242/256  0/0      .        232/232  238/246 
L0034  M05876   F N00167   M05758  . . U  59  244/248  0/0      .        240/244  234/242 
L0034  N00165   M .        .       . . I  .   .        .        .        .        .       
L0034  N00166   F .        .       . . I  .   .        .        .        .        .       
L0034  N00167   M .        .       . . I  .   .        .        .        .        .       
L0075  M02454   F N00207   N00208  . . A  61  252/254  355/357  157/169  .        234/236 
L0075  M05526   F N00205   N00206  . . A  83  0/0      339/359  0/0      .        0/0     
...

A typical Madeline-ready data file with single dots (periods) as missing value placeholders. Madeline automatically recognizes single dots and blank entries as missing values in ASCII and UTF-8 data files.

Note: To increase human and machine readability, we highly recommend using dots (periods) as missing-value column placeholders in data files, as shown in the example above.

When data are read from a file, all "native" missing values (blank and single dot entries) and any values that match the values specified in Madeline's CharacterMissingValue[] or NumericMissingValue[] arrays are treated as missing values by Madeline. When data are written back out to files, missing values are automatically translated according to the conventions required by each file format (For example, when Madeline is used to create a file in the Linkage format, the digit zero (0) is used to represent missing values).

At startup, CharacterMissingValue[], contains a set of default missing value indicators appropriate for most character and genotype data. NumericMissingValue[] contains the single missing value of -9999 by default. It is the user's responsibility to recognize whether these defaults are appropriate for your data and make adjustments as required before attempting to open or load data files.

New values can be assigned to existing cells or appended to the end of these lists as required:

M>list cmv view CharacterMissingValue array
CMV has 5 elements:
CMV[ 1]="."
CMV[ 2]="/"
CMV[ 3]="0/0"
CMV[ 4]="0/ 0"
CMV[ 5]="0/  0"
M>cmv[6]="./."        append new value to the end of the list
M>list cmv
CMV has 6 elements:
CMV[ 1]="."
CMV[ 2]="/"
CMV[ 3]="0/0"
CMV[ 4]="0/ 0"
CMV[ 5]="0/  0"
CMV[ 6]="./."
M>list nmv            view NumericMissingValue array
NMV has 1 element:
NMV[ 1]=         -9999
M>nmv[1]=-1           overwrite one value
M>nmv[2]=-9           and append another value
M>list nmv
NMV has 2 elements:
NMV[ 1]=            -1
NMV[ 2]=            -9
M>

Assigning missing value indicators. Missing value indicators may be assigned to existing cells or appended to the ends of Madeline's character and numeric missing value lists.

Assignments should be done before a data table is opened so that the values will be recognized appropriately. The initial.script script file is an appropriate place to set character and numeric missing value indicator defaults.

Data Categorization of Data

Upon opening a pedigree table, Madeline categorizes each field into one of three categories:

When a field is completely empty or contains only missing values, Madeline assigns the field to a null category represented by an asterisk, "*".

When required, Madeline allows the user to designate a subset of "P" phenotype fields as "V" covariate fields using the toggle command. Madeline does not automatically assign fields to the "V" covariate category. Field categories are summarized in the table below and described in greater depth below.

Field Categories in Madeline.

Category Symbolic Designation Description
Core C Set of five required fields like GenderField that must be present in all pedigree tables, plus additional optional fields, like AffectionStatusField, that are not required by default but may be required for some operations.
Genotype G Character fields containing two numeric labels separated by a forward slash character representing allele calls, e.g., "141/142"
Phenotype P Character, numeric, or date fields that contain categorical or continuous phenotype information.
Covariate V A subset of phenotype fields that are to be used as covariates. The user must use the toggle command to change the designation of a "P" field to "V".
Null * Character, numeric, or date fields that are completely empty or contain only missing value indicators. These fields cannot be operated upon.

Data Core Data Fields

Core "C" data fields provide key information about an individual (see table below). Madeline identifies core fields by their names (in contrast, "G" and "P" fields are distinguished by scanning the data in the table). These names are stored in variables whose values may be reassigned by the user.

In conformance with the requirements of the supported legacy database types, field names must be capitalized, and cannot exceed 10 letters in length. When assigning names to the field variables, Madeline will automatically capitalize and truncate non-conforming names, and will issue warning messages to the user.

Note: Limitations on field name length will likely be relaxed in the next release of the program when support for SAS and dBase/xBase file formats is removed.

Core data fields are either required or optional. The absence of one or more of the five required core fields will generate an error when a pedigree table is opened.

Optional core fields may be required for some operations, but are not required by default. Madeline makes use of the additional information provided in optional core fields whenever they are present. In particular, Madeline's pedigree drawing functionality is greatly enhanced by the presence of optional core fields such as AffectionStatusField and MZTwinField, among others.

Core fields representing categorical attributes of an individual, such as the GenderField and AffectionStatusField have corresponding associative arrays for mapping user data codes to Madeline internal codes.

Core Data Fields in Madeline.

Variable Name Description Default Value Allowed Field Types Associative Array
Required Core Fields
1. IndividualIdField Individual identifier "STUDYID" Character only n/a
2. FatherIdField Father's identifier "FATHER" Character only n/a
3. MotherIdField Mother's identifier "MOTHER" Character only n/a
4. GenderField Gender "SEX" Character or Numeric GenderStatus[]
5. FamilyIdField Family identifier "FAMID" Character only n/a
Optional Core Fields
6. AffectionStatusField Affection status "AFFECTED" Character or Numeric AffectionStatus[]
7. DeathStatusField Death status "DECEASED" Character or Numeric DeathStatus[]
8. ProbandField Index case or proband indicator "PROBAND" Character or Numeric ProbandStatus[]
9. LiabilityClassField Liability class "LCLASS" Numeric or Character LiabilityClass[]
10. MZTwinField Monozygotic twin status indicator "TWIN" Character only n/a
11. DZTwinField Dizygotic twin status indicator "DZTWIN" Character only n/a
12. DateOfBirthField Date of birth "DOB" Date only n/a
13. DateOfDeathField Date of death "DOD" Date only n/a

Data Interpretation of Core Data

It is extremely easy to tell Madeline how to translate coded information stored in your data files into values that the program knows about and can process.

Coded values in the GenderField, AffectionStatusField, DeathStatusField, ProbandField, and LiabilityClassField are mapped to Madeline constants using the set of associative arrays shown in the table in the preceding section. Madeline provides default mappings that are appropriate for reading many character-coded and Linkage-coded data tables. For example, the GenderStatus[] array contains the following values by default:

M>list GenderStatus
GenderStatus has 6 elements:
GENDERSTATUS[ 1 ]=0  zero is defined as male in Madeline
GENDERSTATUS[ 2 ]=1 one is defined as female in Madeline
GENDERSTATUS["F"]=1
GENDERSTATUS["M"]=0
GENDERSTATUS["♀"]=1 For animal studies; requires UTF-8 data files in a UTF-8 locale.
GENDERSTATUS["♂"]=0 For animal studies; requires UTF-8 data files in a UTF-8 locale.

These defaults are equivalent to issuing the following sequence of map commands. Note that Madeline constants are prefixed by the hash sign (#):

M>map GenderStatus  1  as #male
M>map GenderStatus  2  as #female
M>map GenderStatus "F" as #female
M>map GenderStatus "M" as #male
M>map GenderStatus "♀" as #female
M>map GenderStatus "♂" as #male

Note how the mapping of 1 as #male and 2 as #female is appropriate for reading data coded according to Linkage file format conventions. The mapping of (Unicode u+2640) and (Unicode u+2642) are appropriate for animal studies.

Assignments can be made to the associative arrays directly without using the map command. The following assignment statements replicate the default mappings for the AffectionStatus[] array. The first three assignments allow Madeline to process files coded using the Linkage format conventions. The remaining three assignments support the processing of a substantially more intuitive coding convention that we prefer:

M>// Assignments to support the Linkage/Genehunter format:
M>AffectionStatus[ 0 ]=#missing
M>AffectionStatus[ 1 ]=#unaffected
M>AffectionStatus[ 2 ]=#affected
M>// Assignments to support a substantially more intuitive coding convention:
M>AffectionStatus["A"]=#affected
M>AffectionStatus["I"]=#missing
M>AffectionStatus["U"]=#unaffected

The mappings shown above for GenderStatus[] and AffectionStatus[] are the default mappings present when you start a Madeline session. Codes not present in these associative arrays will be mapped to #missing by default. If your codes match these codes, then you don't need to do anything. If your codes differ from the defaults, then you will need to provide the correct mappings in the initial.script, in a batch file, or on the command line. For example, if your pedigree table used the capitalized words "MALE" and "FEMALE" to indicate males and females respectively, then you would want to execute the following:

M>map GenderStatus "FEMALE" as #female
M>map GenderStatus "MALE" as #male

For the default values in all associative arrays, see Table 4.4.

Note: To insure that Madeline recognizes values in core fields correctly, assignment of values in associative arrays that affect the interpretation of core fields must be made before any open or load command.

Data Database Field Naming Conventions

Different databases impose different restrictions on the length and format of field names. Up to 10 characters can be used for field names in an xbase file, but only up to 8 characters in a SAS transport file. Although Madeline now supports several different file formats, the program originally only supported the xbase file format. As a result of this legacy, Madeline restricts field name identifiers as follows:

Here is an example:

M>AffectionStatusField="AffectionStatus"
Field name assignment has been truncated and capitalized to "AFFECTIONS".
M>

Field names are restricted to capitalized labels of 10 or fewer characters in length.

Note: Madeline will warn you if you try to assign a name with embedded spaces or control characters to a field name variable. However, the program does not actively check for all possible errors in field identifiers. This is the user's responsibility. Madeline also has no way of knowing in advance what type of database file will be opened. For example, the program will not notice if you enter a ten-letter name for use with a SAS transport file that permits only 8-letter field identifiers.

Note: Support for legacy xbase and SAS transport formats may removed in the next version of the program. Field name limitations would then become less restrictive.

Data Family Identifier

The value in FamilyIDField tells Madeline the name of the family ID field to look for in a pedigree table. The default value is "FAMID".

Data Individual and Parental Identifiers

The values in IndividualIDField, FatherIDField, and MotherIDField identify the individual and parent identifier fields for Madeline to look for in a pedigree table. The default values are "STUDYID", "FATHER", and "MOTHER", respectively.

Note: We recognize that the default value of IndividualIDField as STUDYID is not a good choice. The default will very likely become INDIVIDUALID in the next version of the program.

Parent IDs should be present in both the FatherIDField and MotherIDField of all non-founder individuals. The program interprets any individual with missing value indicators for both parents as a founder.

In the event that one of the two parent identifiers is missing for an individual or individuals in a sibship, Madeline automatically generates a random eight-letter identifier to represent the missing parent. The randomly-generated IDs begin and end with exclamation marks to distinguish them from regular IDs. Using the generated ID, Madeline constructs a virtual parent in memory who will appear on pedigree drawings (figure below) and in output generated by the write command. Madeline assumes that all the sibs with the one identified parent are full sibs sharing the one identified and other assumed parent.

Virtual constructed parent in Madeline. A virtual parent with a randomly-generated ID (male on the right) is constructed when the ID of one parent is missing among a sibship of individuals (not shown). Sibs are assumed to be full sibs.

Note: Lack of one parent usually indicates that a data set has not yet been thoroughly examined for errors or missing data. Unlike other programs, Madeline tolerates certain types of missingness and errors. This enhances the program's utility as a proofing tool. However, in the end you still have to fix your errors ;-).

Data Gender Data

The default value for GenderField is "SEX". The GenderField can be either numeric or character. Madeline detects the field type when the pedigree table is opened. Madeline defines two symbolic constants for gender:

The GenderStatus[] array is used to map external gender codes to Madeline's internal gender constants, #male and #female, as described above under Interpretation of Core Data.

Only terminal individuals without offspring may retain a gender attribute of #missing. During pedigree reconstruction, if Madeline detects any father or mother with a missing gender attribute, the program will automatically change the gender of the individual in memory to be consistent with the reconstruction, and will warn the user of the change (example below). The database file on disk will not be changed.

Madeline will also automatically correct the gender attribute of mislabeled individuals in memory, for example, of a male listed as a mother, or of a female listed as a father (example below), to the extent that these changes still result in logical consistency. Madeline always warns the user of these types of data errors. Again, the data file on disk will not be changed; that is the user's responsibility.

M>open "family.mfh"
   ...
   
ConnectIndividual(): Gender in database is incomplete:
        Gender of G-10-162's mother, G-10-159, changed from MISSING to FEMALE
ConnectIndividual(): Gender in database is incorrect:
        Gender of G-15-012's father, G-15-003, changed from FEMALE to MALE
13 WARNINGS, 11 SEVERE WARNINGS M>

Inconsistencies in Gender. During pedigree reconstruction, Madeline automatically corrects inconsistencies in gender in the data set (as long as such changes do not violate the logical consistency of the reconstruction) and warns the user.

Madeline will warn the user and terminate if conflicting and unresolvable gender roles exist for an individual (for example if an individual is listed as both a mother and a father in the data set).

Note: We recommend coding the GenderField as a character field using conventional codes such as "M" and "F", or "male" and "female". Not only does this enhance human interpretability of the raw data files, but also enhances Madeline's ability to automatically identify columns when the recognize command is used. Numerically-coded fields, such as those in Linkage/Genehunter files, generally cause unecessary confusion and introduce a greater potential for errors.

Data Monozygotic and Dizygotic Twin Data

The MZTwinField should remain blank (or use a single dot) for non-twins, and should contain a single-letter identifier for each twin pair or group of monozygotic siblings. For example, "A" can be used to designate the first twin pair in a family, "B" the second pair, and so on. Since version 0.90 of the program, the MZTwinField has been considered an optional core field.

The optional DZTwinField, used to show dizygotic twins on pedigree drawings, should be coded in the same manner to designate dizygotic twins.

Data Affection Status Data

The AffectionStatusField may be either character or numeric. Madeline defines two symbolic constants for describing the affection status of sampled individuals:

Madeline provides the AffectionStatus[] associative array for mapping affection status codes.

Note: Coding the AffectionStatusField as a character field using mnemonic codes is recommended to enhance interpretability of the data in the absence of additional metadata. Numeric fields tend to cause confusion and may increase the potential for human error.

Data Death Status Field

The optional DeathStatusField may be either character or numeric. The default value of DeathStatusField is "DECEASED". Madeline defines the constants #alive, with a value of 0, and #dead, with a value of 1. The DeathStatus[] associative array contains a set of defaults for mapping the DeathStatusField. This is shown below:

M>
M>?DeathStatusField
"DECEASED"
M>?#alive
0
M>?#dead
1
M>list DeathStatus
DeathStatus has 4 elements:
DeathStatus[0]=0
DeathStatus[1]=1
DeathStatus["N"]=0
DeathStatus["Y"]=1
M>

The DeathStatusField, DeathStatus array, and #alive and #dead constants.

Note: Coding the DeathStatusField as a character field using mnemonic codes is recommended to enhance interpretability of the data in the absence of additional meta data. Consider how coding a column using "L" for "living" and "D" for "deceased" is more meaningful than using "1" for "living" and "0" for deceased -- especially when you realize that Madeline's default encoding is exactly the opposite, with "1" being "deceased" and "0" being "living"!

Data Proband Field

The optional ProbandField must be numeric. Madeline assumes that the probands or index cases will be coded using a value of 1, and all other individuals with a value of 0.

Data Liability Class Field

Some output formats, such as Genehunter, have the option of including liability class information. The LiabilityClassField may be numeric or character. Madeline does not interpret the values in this field, but simply passes the values on directly. This means that if a program like Genehunter requires a numeric encoding of liability classes, you must insure that the source data are encoded numerically in a conformant manner.

Data Date of Birth and Death Data

The DateOfBirthField and DateOfDeathField are optional core date fields. When present, Madeline performs checks to insure that dates in these fields are reasonable, and looks for twins based on date of birth who have not been designated as such in the MZTwinField or DZTwinField.

Data Genotype Data

Genotype "G" data are character fields that contain allelic marker data separated by the forward slash "/" character. The allele labels themselves must be numeric, non-alphabetic labels, e.g. "1/2" or "141/142".

The names of genotype fields should be the capitalized names of the markers themselves. This allows Madeline to automatically place the genotype fields into map order whenever a map database for the markers is loaded using the load command. Make sure that marker names in the map table are capitalized to correspond with the required capitalization of field names.

Data Estimation of Allele Frequencies from Genotype Data

When a database is opened, Madeline automatically estimates allele frequencies for all genotype fields using gene counting ignoring family relationships. Allele frequencies are estimated from all records in a database.

Allele frequencies calculated from one pedigree table may be saved out using the save command. A table of allele frequency information can subsequently be read into Madeline using the read command. The format of the allele frequencies table is nearly identical to the format used by Mendel v. 4.1. You need only modify the header at the top of an allele frequency table to conform with the Mendel program convention.

Data Phenotype Data

Phenotype "P" fields are any remaining fields that are not core "C" or genotype "G" fields. Phenotype fields may be character, numeric, or date fields, and are assumed to contain categorical or continuous phenotype information. Because date fields cannot be written to output from the write command, date fields are the only type of phenotype field not flagged for output when a pedigree table is opened.

For some types of output, it may be necessary to designate certain phenotype fields as representing covariates. Madeline therefore maintains a separate covariate or "V" field category which is a subset of the "P" category. Covariate fields are automatically recognized as phenotype fields when writing any format that does not distinguish between phenotype and covariate fields. "P" fields can be marked as "V" fields using the toggle command.

Data Marking and Ordering Data Fields for Output

When a pedigree table is opened, most core "C" fields, all genotype "G" fields, and all phenotype "P" fields (except date fields), are flagged, or toggled on, for output by default. Madeline indicates which fields in a database are toggled for output by placing the letter "o" after the category indicator "C","G", or "P" (example below). A number after the "o" indicates the order in which fields will appear in pedigree drawings and in output from the write command. Fields may be manually reordered using the set field order command.

M>list fields
  1.FAMID      Co__1   20.D20S482    Go__6   39.D20S96     Go_25
  2.STUDYID    Co__2   21.D20S849    Go__7   40.D20S119    Go_26
  3.SEX        Co__3   22.D20S905    Go__8   41.D20S481    Go_27
  4.FATHER     Co__4   23.D20S846    Go__9   42.D20S836    Go_28
  5.MOTHER     Co__5   24.D20S892    Go_10   43.D20S888    Go_29
  6.TWIN       Co__6   25.D20S115    Go_11   44.D20S886    Go_30
  7.AFFECTED   Co__7+  26.D20S851    Go_12   45.D20S197    Go_31
  8.BMI        Po__1   27.D20S917    Go_13   46.D20S178N   Go_32
  9.INS_FAST   Po__2   28.D20S894    Go_14   47.D20S866    Go_33
 10.INS_2H     Po__3   29.D20S189    Go_15   48.D20S196    Go_34
 11.BW_REAL    Po__4   30.D20S898    Go_16   49.D20S857    Go_35
 12.GLU_FAST   Po__5   31.D20S114    Go_17   50.D20S480    Go_36
 13.GLU_2H     Po__6   32.D20S912    Go_18   51.D20S211    Go_37
 14.GAD_DUP    Po__7   33.D20S477    Go_19   52.D20S840    Go_38
 15.D20S103    Go__1   34.D20S874    Go_20   53.D20S120    Go_39
 16.D20S117    Go__2   35.D20S195    Go_21   54.D20S100    Go_40
 17.D20S906    Go__3   36.D20S909    Go_22   55.D20S102    Go_41
 18.D20S193    Go__4   37.D20S107    Go_23   56.D20S171    Go_42
 19.D20S889    Go__5   38.D20S170    Go_24   57.D20S173    Go_43
M>

Fields Categorization and Ordering in Madeline. Core "C" fields are detected by name. Genotype "G" are detected by scanning the data: all remaining fields are assumed to be phenotype "P" fields. Fields are ordered for output respectively within the three groups, "C", "G", and "P". The plus "+" sign after AFFECTED indicates that Madeline has detected this field as the AffectionStatusField: categorical levels of this field will be used to color icon symbols on pedigree drawings.

A field listing is shown when a pedigree table is first opened or at any other time using the list fields command.

The order of genotype fields is automatically set to map order when a marker map database is loaded using the load command. Load can be issued either before (the preferred method) or after an open command. Genotype fields whose names match the names of markers in the map database will be set to the map order.

Fields toggled on for output are displayed in pedigree drawings created with the draw command.

When a write command is executed, the set of core "C" fields required by the specific format being produced will generally be output regardless of the on/off output flag status. For example, Madeline will output the GenderField even if you toggle it off because it is required for almost all output formats. This behavior is required to insure proper file construction. Genotype "Go" fields toggled for output will be written, along with phenotype "Po" (and possibly covariate "Vo") fields toggled for output if the analysis format supports phenotype fields. Some analysis programs, such as Genehunter and Siblink, do not use phenotype data beyond affection status (which is a core field).

Fields may be toggled on or off for output using the toggle command.

Data Genetic Map Data

Madeline makes use of marker map information to:

The load command is used to load a table containing genetic maps for one or more chromosomes. A genetic map table may contain only one map for each chromosome. At a minimum, the map table must have columns specifying the chromosome, rank or ordinal position of the marker within the map for a given chromosome, name of the marker, and the position of the marker in centiMorgans:

Minimum Required Fields in a Map Table

Variable For Storing Field Name Default Value Description
ChromosomeField "CHROMOSOME" Numeric field storing the chromosome number.
OrdinalField "ORDINAL" Numeric field storing the ordinal position or rank of the marker on the map for this chromosome.
MarkerField "MARKERNAME" Character field storing the name of the marker
PositionField "POSITION" Numeric field storing the map position from the p terminus in centiMorgans.

Additional columns for sex-specific maps may also be present: see the load command for details.

A map may be viewed using the list map command:

M>load 'marshfield.map.mfh'
Marker maps based on marshfield.map.mfh are now installed.
M>list map for chromosome 7

                    Map Position (Kosambi cM)
                  -----------------------------
Ch Or Marker Name Sex-avg.   Female     Male
-- -- ----------- --------- --------- ---------
 7  1 035XB9         0.0000     .         .
 7  2 GATA24F03      0.0001     .         .
 7  3 GATA61G06      3.7001     .         .
 7  4 TATC010        6.4001     .         .
 7  5 GATA119B03    10.0001     .         .
 7  6 TATT019       17.3001     .         .
 7  7 GATA137H02N   22.0001     .         .
 7  8 GATA41G07     26.0001     .         .
 7  9 GATA137A12    30.5001     .         .
 7 10 GGAA3F06      35.0001     .         .
 7 11 AGAT103       37.7001     .         .
 7 12 GATA13G11     43.0001     .         .
 7 13 GATA026       48.8001     .         .
 7 14 GATA31A10     51.0001     .         .
 7 15 ATA31F09      55.1001     .         .
 7 16 TAT028        57.3001     .         .
 7 17 GATA24D12     63.0001     .         .
 7 18 GATA4E04      65.8001     .         .
 7 19 GATA118G10    72.0001     .         .
 7 20 GATA21D12     77.0001     .         .
 7 21 GATA73D10N    84.0001     .         .
 7 22 GATA87D11     88.4001     .         .
 7 23 GATA3F01      91.0001     .         .
 7 24 ATA78C09NZ    96.6001     .         .
 7 25 GATA5D08     102.0001     .         .
 7 26 GATA23F05    107.0001     .         .
 7 27 ATAC037      112.9001     .         .
 7 28 TTTA001      118.1001     .         .
 7 29 AGAT133      119.1001     .         .
 7 30 GGAA6D03N    121.0001     .         .
 7 31 ATA55A05     123.2001     .         .
 7 32 GATA145G10   127.6001     .         .
 7 33 GATA43C11    130.0001     .         .
 7 34 GATA63F08    143.0001     .         .
 7 35 GATA32C12    143.0002     .         .
 7 36 GATA104      148.0002     .         .
 7 37 AGAT049      150.8002     .         .
 7 38 GATA189C06   156.0002     .         .
 7 39 TATG002      161.0002     .         .
 7 40 GATA30D09N   167.0002     .         .
 7 41 MFD442-GTTT  171.6002     .         .
M>

Loading and viewing marker maps. A map table is loaded using the load command. The list map command is used to print a table showing marker name, chromosome, mapped order, and position in centiMorgans.

Log and Error Reporting Features

Madeline produces three types of log files (table below). The first is a summary file that has a ".log" extension by default and records each command that was entered and a summary of execution results. The second is a detail file that has a ".dtl" extension by default. It provides details of command results, such as which pedigrees and individuals were included or excluded and why. The third log file is an error log that has a ".err" extension by default. It records warning and error conditions that occur.

Three Types of Log Files

Type of File Default Name Purpose
Summary madeline.log Records commands and summaries of execution results.
Detail madeline.dtl Records details regarding inclusion and exclusion of individuals and pedigrees.
Error madeline.err Records warning and error conditions.

You can change the names of the log files individually or en masse, as shown below:

M>?LogFile
"madeline.log"
M>LogFile="MyLogFile.log"
LogFile has been changed from "madeline.log" to "MyLogFile.log"
M>DetailFile="MyDetailLogFile.dtl"
DetailFile has been changed fro