Madeline Version 0.935 Documentation
by Edward H. Trager <ehtrager@umich.edu> (June 2004)
© 2003, 2004 by the Regents of the University of Michigan ALL RIGHTS RESERVED
This software program is released under the GNU General Public License .
Madeline is a program for preparing, visualizing, and exploring human pedigree data used in genetic linkage studies. In addition to converting pedigree and marker data into various formats required by linkage analysis software, including Crimap, Genehunter, Allegro, Mendel, Merlin, PedCheck, and Simwalk2, Madeline also provides functionality for querying pedigree data sets and drawing pedigrees.
By combining a database engine with a software engine that understands the relationships between people in pedigrees, Madeline provides functions for investigating data on individuals and pedigrees in genetic linkage studies (Fig. 1).
Note that this release of the program, version 0.935, has numerous changes compared to the previous release (version 0.933). Even if you are thoroughly versed with the workings of version 0.933, you are advised to take a careful look at the numerous changes and new features in the program described in the documentation here.
As the old adage says, necessity is the mother of invention. When I first started this project a number of years ago, I had never written a recursive descent parser, never implemented balanced binary sorted trees, and I think at that time I had not even heard of the Postscript graphics language! There certainly was no master plan for this program, only the desire to get work done more easily with fewer data conversion headaches. Nor did I or my colleagues sit down and specify a coding standard for the program, much less a documentation standard. Whatever I could write in a reasonable amount of time that happened to work well enough to get the job done was just that, good enough.
Fortunately, a few early decisions were fundamentally correct, even if my implementations were less than perfect. The program began to take shape back in the days when all I knew was DOS and one of my first decisions was to use a DOS 32-bit protected mode library for the Borland compiler. Since the program used a 32-bit flat memory model from day one, it proved easy to port it over to Solaris and HP-UX when I finally got around to learning Unix.
Another early decision was to add an interactive command interface to the program. Early versions of the program required arguments passed from the command line, and it quickly became evident that hundreds of command-line arguments would be difficult to remember and would impair program flexibility. After reading a book on programming in C by Herbert Schildt which showed how to write a BASIC interpreter, as an experiment I decided to create a version of the program with an interactive command parser. That of course proved to be much nicer than the earlier non-interactive versions.
A third pivotal decision was to add pedigree drawing. When I initially suggested doing this, I remember being told that this was a hard problem which I should not waste my time on since other programs existed which already provided that functionality. In retrospect, I'm exceedingly glad I didn't listen to that advice, since the ability to display pedigree data graphically is probably one of the program's greatest strengths. The graphics were originally created using HP's PCL printer language; Postscript was introduced in a later revision of the program.
The program --and the programmer!-- have now begun to mature, but maturation has occurred, and continues to occur, as a slow process. A number of people in the labs where I have worked and elsewhere have found Madeline useful and a timesaver. As a result, I am encouraged to move the program closer to the ideal that I imagine it could be. Although the program and its code still have numerous shortcomings, you can still use these intermediate releases to help you complete your work more quickly, with less hassle, and fewer errors.
Finally, please note that this version of Madeline is released under the GNU General Public License which grants authors and users certain rights. I encourage you to read the license if you are not already familiar with it. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
A lot of work goes into a program like this and I am indebted to many people for their help and suggestions. I would especially like to thank the following people:
Madeline v. 0.935 has been successfully compiled using at least the following hardware and compiler combinations:
| Intel C++ v. 7.0 Compiler on: |
|---|
| SuSE v. 8.1 Linux (2.4.19 i386 kernel) |
| GNU g++ v. 3.3.1 or v. 3.3.3 on: |
|
SuSE v. 9.1 Linux (2.6.4 i386 kernel) (g++ 3.3.3)
SuSE v. 9.0 Linux (2.4.21 i386 kernel) FreeBSD v. 5.2.1 on i386 (g++ 3.3.3) Solaris (SunOS v. 5.8) on UltraAX-e2 |
| GNU g++ v. 3.2.2 on: |
|
SuSE v. 7.3 Linux (2.4.16 i386 kernel)
SuSE v. 8.1 Linux (2.4.19 i386 kernel) OpenBSD v. 3.2 (i386) |
| GNU gcc 2.95.3 on: |
|
RedHat v. 6.2 Linux (2.2 i386 kernel)
SuSE v. 7.2 Linux (2.4 i386 kernel) SuSE v. 7.3 Linux (2.4 i386 kernel) OpenBSD v. 2.9 on i386 FreeBSD v. 4.4 on i386 Sun Solaris 8 (SunOS 5.8) on i386 SunOS 5.6 on Sparc Ultra-1 Solaris 8 (SunOS 5.8) on Sparc UltraSPARC-IIi Cygwin on Windows 2000 on i386 |
| GNU gcc 2.95.2 on: |
| Apple Macintosh OS X 10.1.5 on G4 |
| GNU gcc 3.3 on: |
| Apple Macintosh OS X 10.3.4 on G4 |
| Sun Forte Workshop 6 Update 2 C/C++ v. 5.3 Compiler: |
|
Solaris 8 (SunOS 5.8) UltraAX-e2 32-bit executable
Solaris 8 (SunOS 5.8) UltraAX-e2 64-bit executable Solaris 8 (SunOS 5.8) on Sparc UltraSPARC-IIi 32-bit executable Solaris 8 (SunOS 5.8) on Sparc UltraSPARC-IIi 64-bit executable SunOS 5.6 on Sparc Ultra-1 32-bit executable |
Madeline now uses the GNU Autoconf system for automatic configuration. In light of this, we expect that the program can be built successfully on virtually all modern UNIX-like platforms.
Please review Notes on Installing Madeline Version 0.935 for more information about compiling and installing Madeline on specific platforms.
Madeline was originally designed to meet the needs of the
Finland-United States Investigation of NIDDM Genetics (FUSION) study. Because of this, Madeline has specific knowledge about
FUSION study IDs. A narrow subset of Madeline's functionality makes use of this knowledge.
Click here if you are interested in learning more about Madeline's FUSION-specific functionality.
Paragraphs or headings preceded by "FUSION:"
describe FUSION-specific functionality. This functionality is only available when FusionSupport is
set on.
FusionSupport is off by default.
Note: All current development on Madeline focuses on providing general support for genetic linkage studies. FUSION-specific development ceased long ago.
Included with the distribution of the program is an extensive tutorial
that will guide you through the entire linkage analysis process using Madeline
interactively. The tutorial is located in the tutorial/Documentation
subdirectory under the name MadelineTutorial.html. The
tutorial subdirectory also contains all of the data files
needed to complete the tutorial.
The tutorial can serve as quick introduction to the program and give you a feel to how the program works. After completing the tutorial, you can return to reading the main documentation for an in-depth treatment of the program's features.
Instructions to Madeline are entered at a command prompt. Madeline's command interpreter is not sensitive to capitalization. However, capitalization is often used in this document for clarity of presentation.
Madeline can be run interactively or in batch mode (Fig 2). To run Madeline interactively, type "madeline" at your system prompt and press return. Madeline's "M>" prompt will appear.
Batch files contain a sequence of Madeline commands that have been saved in a text file in
ASCII or
UTF-8 format.
There are two ways to run batch files. The first way is to provide the name of a batch file
as a command line parameter after the name of the program. The second way is to
start Madeline interactively and then use the run command to
execute the batch file. Madeline returns to interactive mode if an error occurs, or when a
batch file terminates without a goodbye or
quit command.
edtrager@retina:~> madeline starting Madeline in interactive mode ______________________________________________________________________________ ______________________________________________________________________________ __ __ _ ______ _______ _ _ __ _ _______ | \ / | / \ | ___ \ | _____| | | | | | \ | | | _____| | \/ | / ^ \ | | \ \ | |___ | | | | | \ | | | |___ | |\ /| | / /_\ \ | | | | | ___| | | | | | |\ \| | | ___| | | \/ | | / ___ \ | |___/ / | |_____ | |______ | | | | \ | | |_____ |_| |_| /__/ \__\ |_______/ |_______| |________| |_| |_| \__| |_______| ______________________________________________________________________________ ______________________________________________________________________________ Version 0.935 Written by Edward H. Trager <ehtrager@umich.edu> COPYRIGHT 2003 THE REGENTS OF THE UNIVERSITY OF MICHIGAN PORTIONS COPYRIGHT 1995 EDWARD H. TRAGER ALL RIGHTS RESERVED Madeline comes with ABSOLUTELY NO WARRANTY. This is free software and you are welcome to redistribute it under certain conditions. For details, type "license" +-----------------------+-----------+-----------------------------------------+ | Variable or State Flag| Setting | Description | +-----------------------+-----------+-----------------------------------------+ | EXTERNAL PROGRAMS | | | +-----------------------+-----------+-----------------------------------------+ | Editor | edith | Program used to edit files | | PostscriptViewer | gv | Program used to view Postscript drawings| | PrintCommand | lpr | System program used to print files | | WebBrowser | mozilla | Program used to view HTML documentation | +-----------------------+-----------+-----------------------------------------+ | EVALUATION SETTINGS | | | +-----------------------+-----------+-----------------------------------------+ | EvaluationInterval | 0.50 cM | Value to write to control file. | | OffEndDistance | 10.00 cM | Value to write to control file | +-----------------------+-----------+-----------------------------------------+ | DRAWING SETTINGS | | | +-----------------------+-----------+-----------------------------------------+ | Color | ON | Draw pedigrees in color | | ReverseShading | OFF | Black is first icon shade | | DividedDrawings | ON | Paginate drawings by founding group | | HighlightRows | ON | Alternately highlight data on drawings | | LabelCreatedIndividual| ON | Label virtuals created by Madeline | | Orientation | AUTOMATIC | Automatic based on drawing dimensions | | PaperMargin | 1.00 cm | Margin (in cm) on all four sides | | PaperSize | USLETTER | 8.5 x 11.0 inches | +-----------------------+-----------+-----------------------------------------+ | OTHER SETTINGS | | | +-----------------------+-----------+-----------------------------------------+ | AutoExclude | ON | Exclude pedigrees automatically | | AutoCheckInheritance | ON | Check inheritance on OPEN | | ConsoleHighlights | ON | Use bold/color highlights on console | | Delimiter | TAB | Delimiter for tables and other output. | | FusionSupport | OFF | FUSION customizations disabled | | HaplotypeDisplay | OFF | Display genotypes delimited with "/" | | Language | American E| Language convention used for date, time | | MapDetails | OFF | LIST MAP summary display | | SaveAlleleFrequencies | OFF | Calculate new frequencies on next OPEN | | Time | | Friday, December 5, 2003 | | Verbosity | VERBOSE | All messages are printed to the console | +-----------------------+-----------+-----------------------------------------+ M> M>quit entering a command in interactive mode Releasing resources ... Goodbye! edtrager@retina:~>madeline chromosome20.script starting Madeline in batch mode open 'linkage/chr20.data.mfh' executing first batch command Calculating allele frequencies for 7. D20S173... Calculating allele frequencies for 10. D20S889... Calculating allele frequencies for 13. D20S898... ...
You can set parameters and run commonly needed commands
automatically each time Madeline is started by providing a script file
called "initial.script". Madeline will first look for a local version
of initial.script in the current working directory from which Madeline
is invoked. Failing to find initial.script there, Madeline will look in
the share/madeline/ subdirectory under the directory prefix where Madeline was
installed. For example, if Madeline was installed in /usr/local, then the program
will look for /usr/local/share/madeline/initial.script.
Any commands that can normally be invoked on the command line or in a batch file
can be placed into initial.script. Assignments to specify default field names or
environmental settings are typically placed in initial.script (Fig. 4).
// // Typical initial.script file: // // // Environment settings: // quiet set language to English Editor="emacs" PostscriptViewer="gv" // // Pedigree drawing-specific settings: // set color off set PaperSize to A4 // margin in centimeters: set PaperMargin to 1.5 set orientation to automatic // // Pedigree database-specific settings: // GenderField='GENDER' FamilyIDField='FAMILY' IndividualIDField='INDIVIDUAL' // // Map standard missing value indicators: // NumericMissingValue[1]=-1 NumericMissingValue[2]=-9 // // Map database-specific settings: // PositionField="POSTN" OrdinalField ="ORDNL"
Note: Starting with Madeline v. 0.933, it is recommended that
site-wide defaults be compiled directly into Madeline by customizing the config.h
source file generated by the configure script run by you or your system administrator
when Madeline is installed. Many (but not all) parameters can be configured in config.h.
Remaining settings can be specified in the initial.script as necessary. If you don't
require any site-specific customizations, you can just leave the global default initial.script
as is.
Madeline processes data stored in tables. A database table is a rectangular array of data. A record is a row in the array. A field is a column in the array. Each record contains one or more identifiers or keys which identify the entity, and the data -- all the measured variables -- for the entity. The measured entity may be an individual in a pedigree, a genetic marker, a position along a genetic map, or something else.
The program currently supports the following table formats:
Note: Of these four formats, we now recommend using only the Madeline native format because it is open, non-proprietary, human-readable, and editable in any text editor (in the case of UTF-8 files, in any UTF-8 capable text editor: see this link). Although supported in versions 0.933 and 0.935, the legacy xbase and SAS transport formats are deprecated and we may not support them at all in future versions of the program.
The structure of the Madeline format is described below. Sample PHP code for creating Madeline files from database tables is also provided.
A Madeline-formatted table is a human-readable flat file containing ASCII or UTF-8 characters having the following structure:
The following figure illustrates how the example "relationships.dat"
data file included in the software distribution conforms to this structure:
Madeline Table Format. Tables in the Madeline format are flat files divided into a header block consisting of one or more lines containing column labels and optional type designators, and a data block consisting of even-length records divided into space-delimited data columns. (The vertical blue arrow illustrates how data in a single column can, if necessary, contain embedded white space, as long as that white space does not stretch uninterrupted from the first to the last record: vertically uninterrupted white space delimits columns).
The header contains:
Column labels should be CAPITALIZED. Column labels are separated by any amount of white space and can span as many lines as necessary. Each line in the header can contain one or more column labels (This is illustrated above where seventeen column labels span fifteen header lines).
The order of the column labels from left-to-right and top-to-bottom indicates the order of the columns in the data block. Lines in the header can be of varying lengths.
A column type designator consists of single capital letter following after a column label. Any amount of white space can be used between a column label and type designator. The following column type designators are recognized:
Column Type Designators in Madeline Tables
| Column Type Designator | Description | Example | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| C |
Designates:
|
STUDYID C
|
|||||||||
| X | Designates the gender (sex) column. The gender column may alternatively be designated with a "C" for character codes or with an "N" for numeric gender codes. |
SEX X
|
|||||||||
| N | Designates that a column containing numeric values be treated as numeric data. |
AGE_DX N
|
|||||||||
| D |
Designates that a column contains date data
in ISO-8601 format
(four-digit year followed by a two-digit month and finally a two-digit day).
Note that Madeline permits use of any of the usual date delimiters:
|
DOB D
|
|||||||||
| G | Designates that a column contains genotypes as numeric allele labels separated by a forward slash "/" character. Genotype columns can alternatively be designated with a "C": the program will automatically recognize character columns containing genotypes. |
D12S4321 G
|
|||||||||
| A | Designates that a column contains one of the two alleles that constitutes a genotype. Allele columns must exist as identically-named pairs. |
D12S4321 A
|
At least one blank line must follow after the header in order to separate the header block from the data block.
The data block consists of:
Each record (or row) contains the identifiers and data measured for one entity. For example, a pedigree table contains one row for each individual in a family, where as a map table contains one row for each marker in a genetic map.
The identifiers and data for each record are formatted in:
Note #1: Column type designators are technically optional in most (but not all) cases. The program contains code to automatically detect column types. Certain type promotions, such as from "C" to "G" and from "C" to "X" are permitted and performed automatically when required. The primary exception occurs when completely numeric individual and pedigree identifiers are used: the program automatically detects the columns as being "N" numeric, but you MUST cast them as "C" because the program requires string identifiers for individuals and pedigrees. The best practice is to include column type designators since this increases file readability for humans and prevents surprises.
Note #2: A problem that can occur with hand-edited data files (as opposed to those generated by a script, program, or database system), is embedded tabs or extra spaces or tabs appended to the end of various lines of the data -- but quite invisible when viewed in a typical editor! These make the data block non-rectangular, which is not allowed.
Madeline's recognize command now specifically looks for embedded tabs
and inconsistent data row lengths caused by extra tabs or space characters at the ends of rows. The
rectify command will fix both types of problems in most cases.
You should also open the data file using a good file editor such as
Edith and use the column-highlighting feature (i.e., CTRL+<Left Mouse Button>) to select white space after
the last data column. This quickly reveals whether the lines are all the same length or -- as is
quite likely going to be the case in problematic files -- not (See figure below).
If they are not, a simple CTRL-X
in Edith trims off all of the selected segments.
Having trouble getting Madeline to recognize your data file?
Often the culprit is hidden spaces and/or tab characters trailing after the last
column of data, making it non-rectangular. Another culprit could be tab characters
embedded within rows (not illustrated). The rectify command
will handle both types of problems. Alternatively, you can use Edith
or a similar file editor to display embedded tabs and terminal tabs and spaces.
Hold down the CTRL key while pressing
the left mouse button to select the trailing white space and remove it (CTRL-X).
If your data contain strings in non-English languages that use accented Latin letters (such as "ç" and "é" in French, "ñ" in Spanish, and "ü" in German) or non-Latin scripts (Cyrillic, Japanese, Chinese, etc.), then your data must be encoded in the Unicode UTF-8 format and stored using the Madeline table format. You will also need to run Madeline in a Unicode-capable terminal emulator under a UTF-8 locale.
What does this mean? For European users, it means that the Madeline does not support any of the legacy ISO-8859-x character sets, not even ISO-8859-1. For other users, it means that your country's legacy character sets, whether it be KOI-8, JIS, GB18030, or something else, are not supported. The primary issue is that many people's computers are still set to use some legacy encoding system. Fortunately, major Linux distributions are now enabling UTF-8 locales by default. If you are using a recent release of SuSE or Redhat in North America or Europe, your system will be set to use UTF-8 by default. However, if you are using a recent Linux distribution in East Asia (China, Japan, etc.), you should check your locale settings. This document should provide you with most of the information you need to switch over to Unicode under Linux or a similar *nix-based system.
The rules for formatting a UTF-8 table in Madeline format are the same as those for ASCII. In particular, note that columns in the data block must be aligned on byte boundaries with white space (ASCII 0x0020) separating columns. Trying to format such a file manually is not recommended. Instead, store your data in a database and use a scripting language like Perl or PHP to extract the data into the correct format.
See the examples/utf8/utf8.data file as an example of a properly
formatted UTF-8 file:
Data files in the Madeline flat file format can contain data in any of the world's scripts encoded in Unicode UTF-8. The last column in the file appears unaligned, but is actually aligned on byte boundaries, not on displayed character boundaries.
Before you can use a flat-file table, you must first run the
recognize command which creates a small, complementary binary
file with a ".mfh"
(
Madeline
File
Header
)
extension. This complementary file stores meta information about
the table (names and number of columns, number of rows, etc.)
in a binary format which the program uses to optimize table access. To open the table, specify
the complementary ".mfh" file name in place of the original data file.
An unannotated data file contains no header describing the fields. What do you do if you receive an unannotated data file from a colleague or client? Let Madeline do some of the work for you!
The recognize command contains code to automatically detect columns in unmarked files.
The program can even often identify the core gender, individual, father, and mother fields
required for pedigree reconstruction. This can save time and tedium. Consider the following excerpt from the
"unannotated.data" file included in the examples subdirectory of the software distribution:
NT5641 M AB0115 FM_012 A A AB0116 190/202 166/169 0/0 172/175 154/154 65 89.94 NT5661 F AB0147 FM_012 A A AB0148 208/211 153/157 201/207 160/175 154/160 50 81.03 NT5675 F AB0119 FM_012 A A AB0120 205/211 153/166 204/207 166/175 157/157 73 82.06 NT5676 F AB0123 FM_012 A A AB0124 190/208 156/166 201/207 175/177 154/157 63 88.76 NT5678 F AB0140 FM_012 A I NT5676 190/208 157/166 201/207 175/177 154/157 60 65.86 NT5679 F AB0115 FM_012 A I AB0116 202/211 157/169 207/207 172/175 154/154 78 82.92 NT5724 F AB0135 FM_012 A I AB0136 205/205 166/169 204/210 166/172 154/157 69 71.82 NT5728 F AB0113 FM_012 A A AB0114 190/190 156/169 204/207 160/175 148/157 64 84.55 NT5749 F AB0121 FM_012 A A AB0122 205/211 157/157 0/0 166/175 154/160 . 87.46 NT5752 F AB0121 FM_012 A I AB0122 205/211 153/157 204/207 166/175 148/154 83 88.57 NT5753 F NT5641 FM_012 U I AB0132 . . . . . . 62.94 NT5757 F AB0130 FM_012 A A AB0131 202/211 157/169 201/207 166/172 154/160 55 70.16 NT5790 F AB0138 FM_012 U I AB0139 208/208 157/160 207/207 172/172 154/154 . 71.18 . . .
Unannotated Table. Having trouble deciphering which column is which? Let Madeline help you!
Running the recognize command on this file produces the following:
M>recognize 'unannotated.data'
Recognizing file "unannotated.dat" to "unannotated.dat.mfh" ...
Skipping a total of 1 line at top.
There are 0 non-empty header lines and 54 data lines.
Data records are 83 bytes long.
The gender field has been identified and will appear in the ".run" file
The individual, father, and mother ID fields have been identified
and will appear in the ".run" command file
# . Field Name Start End Length Prec. Space Type
---- ----------- ----- ----- ------ ----- ----- -----
1. INDIVIDUAL 1 6 6 0 1 C
2. GENDER 8 8 1 0 1 X
3. FATHER 10 15 6 0 1 C
4. CHAR_003 17 22 6 0 1 C
5. CHAR_004 24 24 1 0 1 C
6. CHAR_005 26 26 1 0 1 C
7. MOTHER 28 33 6 0 1 C
8. GENO_001 35 41 7 0 1 G
9. GENO_002 43 49 7 0 1 G
10. GENO_003 51 57 7 0 1 G
11. GENO_004 59 65 7 0 1 G
12. GENO_005 67 73 7 0 1 G
13. NUME_001 75 76 2 0 1 N
14. NUME_002 78 82 5 2 1 N
Binary recognition header file (".mfh") written.
--> If this is a pedigree file, type 'open "unannotated.dat.mfh" '.
--> If this is a genetic map file, type 'load "unannotated.dat.mfh"'
The template batch file unannotated.dat.run has been created.
NOTE: The ".run" file contains commands and parameters to assist
you in opening a flat file database, but generally requires
editing before use.
M>
Clearly the program cannot perform magic. For example, the program was not able to determine the
FamilyIDField (the fourth column in the table) and there are other limitations to what the program
can do. Nevertheless, being able to correctly identify the core pedigree structure and genotype columns in an unannotated
data file can save you from a tedious manual investigation of the data.
When using xbase or SAS transport formats, Madeline column type designators like X,G, and A do not exist. For legacy formats, only use character, numeric, and date column types. For example, the gender attribute can be stored in a character field coded using "M" and "F". Genotypes can also be stored in character fields. Floating-point and integer fields are cast to Madeline's double-precision numeric floating-point type. Also avoid the logical and date-time field types found in dBase and FoxPro as Madeline won't understand these.
The command used to open or manipulate a table depends on the type of table being processed:
open command.load or
graph load command.compose and decompose commands, respectively.
Composed tables can also be merged both row-wise and column-wise using the
merge command.read command. They
can be saved out using the save command.graph open command.Madeline's database engine detects operating system and file byte-ordering at run time, permitting tables from PCs to be opened on UNIX workstations, and vice versa.
The different types of tables are described below in turn.
In a pedigree table, each row (record) contains the data for one individual.
In Madeline, the names of the family and individual ID fields are stored in variables
called FamilyIDField and
IndividualIDField, respectively.
Basic pedigree reconstruction additionally requires knowledge of
the father (FatherIDField),
mother (MotherIDField),
and gender (GenderField)
of each individual. Together, these field variables comprise the five core fields that
must be present in every pedigree table:
FamilyIDField -- key identifierIndividualIDField -- key identifierFatherIDField -- required for pedigree reconstructionMotherIDField -- required for pedigree reconstructionGenderField -- required for pedigree reconstructionThe remaining identifiable data fields in a pedigree table are classified by Madeline into two groups: (1) phenotype and (2) genotype. Madeline automatically classifies all identifiable fields in a pedigree table into one of these three categories. Each identifiable field is tagged with a single-letter identifier shown below:
Core fields are identified by matching field names against the names stored in the
field name variables (i.e., FamilyIDField,
StudyIDField, GenderField, etc.).
Genotype fields are identified by scanning the data. Remaining fields are classified as phenotype fields.
Field classifications are shown in the figure below, illustrating a portion of a typical pedigree table:
Pedigree Tables. All identifiable fields in a pedigree table are classified into one of three categories in Madeline: "C" for core fields, "P" for phenotype fields, and "G" for genotype fields.
Note: The complete set of core fields consists of the five required core fields shown above, plus optional core fields such as AffectionStatusField, MZTwinField, and DateOfBirthField. See Core Data Fields for a complete listing.
Fields containing only missing value indicators that cannot be categorized
as "C", "P", or "G" will be marked with an asterisk, "*".
Phenotype "P" fields can be tagged by the user as being covariate "V" fields using the
toggle command. Phenotype fields are never automatically classified as
covariate "V" fields.
When a pedigree table containing allele "A" columns instead of genotype columns, such as a Linkage file, is opened in Madeline, the paired and identically-named allele columns automatically appear as single genotype "G" fields.
Below is an excerpt from
the "linkage.ped" data set distributed with the program. Note that while
the data block is in the unchanged Linkage format, Madeline still requires that you provide a conformant header
in Madeline format to identify the columns:
FAMID C STUDYID C FATHER C MOTHER C SEX N AFFECTED N MARKER1 A MARKER1 A MARKER2 A MARKER2 A MARKER3 A MARKER3 A MARKER4 A MARKER4 A MARKER5 A MARKER5 A MARKER6 A MARKER6 A F0021 K0001A 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 F0021 K0001B 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 F0021 K00158 K0001A K0001B 1 1 1 4 1 3 1 1 3 4 1 3 1 3 F0021 K00159 K0001A K0001B 1 0 1 3 1 2 1 1 1 2 1 2 1 2 . . .
Here are the commands to recognize and open this file. Note how the pairs of allele columns are recognized as genotype columns:
M>recognize "linkage.ped" Recognizing file "linkage.ped" to "linkage.ped.mfh" ... ... # . Field Name Start End Length Prec. Space Type ---- ----------- ----- ----- ------ ----- ----- ----- 1. FAMID 1 5 5 0 1 C 2. STUDYID 7 12 6 0 1 C 3. FATHER 14 19 6 0 1 C 4. MOTHER 21 26 6 0 1 C 5. SEX 28 28 1 0 1 N 6. AFFECTED 30 30 1 0 6 N 7. MARKER1 37 41 5 0 4 G 8. MARKER2 46 50 5 0 4 G 9. MARKER3 55 59 5 0 4 G 10. MARKER4 64 68 5 0 4 G 11. MARKER5 73 77 5 0 4 G 12. MARKER6 82 86 5 0 2 G ... M>// Because the LINKAGE file uses zeros to represent M>// missing parents, we need to add zero to the M>// CharacterMissingValue[] array: M>list CharacterMissingValue CharacterMissingValue has 5 elements: CharacterMissingValue[ 1]="." CharacterMissingValue[ 2]="/" CharacterMissingValue[ 3]="0/0" CharacterMissingValue[ 4]="0/ 0" CharacterMissingValue[ 5]="0/ 0" M>cmv[6]="0" M>// Madeline's GenderStatus[] and AffectionStatus[] M>// arrays now contain LINKAGE code mappings by default, so M>// we can immediately open the file: M>open "linkage.ped.mfh" 6. AFFECTED has 3 levels. Calculating allele frequencies for 7. MARKER1... Calculating allele frequencies for 8. MARKER2... Calculating allele frequencies for 9. MARKER3... Calculating allele frequencies for 10. MARKER4... Calculating allele frequencies for 11. MARKER5... Calculating allele frequencies for 12. MARKER6... Pedigree table "linkage.ped.mfh" opened with 13 records ... 1.FAMID Co__1 5.SEX Co__5 9.MARKER3 Go__3 2.STUDYID Co__2 6.AFFECTED Co__6+ 10.MARKER4 Go__4 3.FATHER Co__3 7.MARKER1 Go__1 11.MARKER5 Go__5 4.MOTHER Co__4 8.MARKER2 Go__2 12.MARKER6 Go__6 M>
A map table contains map information related to markers on one or more chromosomes. The key fields in a map table are:
The data fields in a map table are:
The following optional fields may also be present:
An example of a map table is shown below:
MARKERNAME CHROMOSOME ORDINAL POSITION D17S944 17 1 82.6 D17S949 17 2 93.3 D17S1304 17 3 94.0 D17S1351 17 4 96.0 D17S1352 17 5 98.1 D17S1807 17 6 99.2 D17S929 17 7 99.2 D17S1301 17 8 100.0 D17S785 17 9 103.5 D17S674 17 10 105.7
A marker map table.
Note: A single map table can contain marker map data for multiple chromosomes. A single table is easier to maintain than multiple tables.
In traditional metal typography, metal casts of individual letters are "composed" into rows of type for printing on a printing press. In Madeline, we have borrowed the idea of "composing" text to refer to the operation of rearranging a table in which each row contains the alleles for one marker measured on one individual:
FAMID STUDYID MARKERNAME ALLELE1 ALLELE2 ----- ------- ---------- ------- ------- P0001 I00001 D12S1304 121 127 P0001 I00001 D12S127 98 104 P0001 I00001 D12S341 134 142 P0001 I00002 D12S1304 117 119 P0001 I00002 D12S127 102 102 P0001 I00002 D12S341 136 138 . . . . .
Excerpt from a decomposed table. One row contains the alleles for one marker measured on one individual.
... into a table in which each row contains the genotypes for all of the markers measured on one individual ...
FAMID STUDYID D12S1304 D12S127 D12S341 ----- ------- -------- ------- ------- P0001 I00001 121/127 98/104 134/142 P0001 I00001 117/119 102/102 136/138 . . .
Excerpt of the same data after composition. One row contains the genotypes for all markers measured on one individual.
A marker table contains the alleles for a specific marker measured on a specific individual. Output from the ABI Genotyper software is in this table format. This type of table has three key fields:
There are only two essential data fields in a marker table:
Madeline provides support for integrating the information in a marker table into a
pedigree table via the compose and merge commands.
The compose command takes care of converting the paired allele
fields into the single genotype fields expected in a pedigree table. The merge command
allows you to integrate family structure, phenotype, or genotype data from separate tables into a combined table for use
in Madeline.
Output files containing logarithm of odds (LOD) results from certain analysis programs such as Simwalk2 can be converted by Madeline directly into a table format which the program can then use for graphing the results. Output from other analysis programs may require a small amount of manual formatting which is usually not difficult to do.
The convert command may used to produce a table in the right format:
M>convert simwalk file 'sw_chr10.out' to 'chr10.results'
Converting input file "sw_chr10.out" to Madeline-formatted output files ...
...
M>
Converting an analysis result file directly into a Madeline table format.
An analysis results table must contain at least a POSITION and a SCORE column. Other columns may also be present.
Here is a table of results from a non-parametric Simwalk2 analysis already formatted for use by
Madeline. For graphing this data in Madeline, you would have to specify which of the score columns to use
by setting Madeline's GraphScoreField to one of the "STAT" columns:
POSITION STAT_A STAT_B STAT_C STAT_D STAT_E 0.0000 0.073 0.046 0.072 0.069 0.077 4.1691 0.182 0.076 0.180 0.167 0.181 7.9053 0.163 0.076 0.162 0.156 0.183 10.0506 0.298 0.156 0.311 0.300 0.325 18.0591 0.578 0.409 0.614 0.633 0.665 21.3661 0.633 0.527 0.675 0.726 0.779 36.2864 1.526 0.690 1.478 1.445 1.359 39.7003 1.222 0.529 1.201 1.150 1.056
An analysis results table from Simwalk2 formatted for graphing in Madeline.
For more information on how to graph tables of LOD results, see the graph command.
Madeline's database engine supports: character string, numeric (floating point and integer), and date data types.
A logical data type (such as the "L" field type of xbase) or boolean data type is not distinguished from the numeric data type. Use appropriately coded numeric columns for boolean attributes (for example, 1=true, 0=false). Other derived types, such as date-time or monetary types are not supported.
Character data are read from tables by trimming leading and trailing space characters. Thus, blank entries in a database appear as the empty string, "", which Madeline interprets as a missing value indicator. When entered on the command line, literal character data must be delimited by a pair of matching single or double quotes, e.g., "0001-230" or '0980A'.
Madeline's interpretation of character data is affected by the values stored in the
CharacterMissingValue
(aliased as CMV) array. Each value stored in this array is
interpreted as an additional value to treat as a missing value indicator. The
CharacterMissingValue array contains a set of default values that
are appropriate for most data sets. Users can reassign current values, or assign additional values as
needed.
Users planning on using Linkage files with Madeline in particular should note that the character string "0" (zero, used to represent missing parent IDs in Linkage files) is not considered a missing value by default in Madeline. Recall that Madeline treats individual and pedigree identifiers as character strings.
All numeric data types are converted to double-precision floating point numbers.
Literal numeric values are entered on the command line without delimiters. Interpretation of
numeric data is affected by the values stored in the NumericMissingValue
(aliased as NMV) array. Each value stored in this array is
interpreted as an additional value to treat as a missing value indicator.
Users should verify whether Madeline's default numeric missing values
are appropriate for their data and reassign or add values to this
array as necessary.
Madeline does not recognize a logical or boolean data type separate
from the numeric data type. In contexts where a value is to be
interpreted as a logical value, Madeline treats zero (0) as #false,
and any non-zero, non-missing value as #true.
True/false data should thus be coded using a numeric field type
with values of 0, 1, and a numeric missing value indicator
if required.
Dates are converted internally into Julian day integers.
When entered at the command line, dates must be delimited between curly braces, { }.
Dates in tables or entered on the command line should be in ISO-8601 format (a four-digit year followed by a two-digit month and finally a two-digit day). In tables and on the command line, Madeline permits use of any of the following date delimiters:
| YYYY | ![]() |
. - / |
![]() |
MM | ![]() |
. - / |
![]() |
DD |
In addition, on the command line you can also use single spaces as delimiters:
M>? {2003 05 17}
{Saturday, May 17, 2003}
M>? ( {2003 05 17} - {1965 07 28} ) / 365.2425
37.8023
M>
Note: Previous versions of Madeline supported entry of dates in non-ISO formats,
such as "{December 11, 1963}". However, the program only supported a few locales (American English,
British English, etc.) and we felt that locale-specific date entry conventions could lead to confusion or even
errors for international collaborations. The program continues to support the display of dates in numerous
locales but date entry has been standardized to the ISO format. For example:
M> set language to Japanese
...
M>? {2003 05 17}
{2003年5月17日 (土曜日)}
Dates must be entered in ISO YYYY MM DD format, but can be displayed using specific locale conventions.
Pope Gregory XIII instituted the calendar that is now used internationally in October of 1582. (However, only the Catholic countries in Europe adopted the Gregorian calendar immediately. Other countries adopted it much later. For example, both England and the American colonies did not adopt it until the middle of the 18th century. Countries such as Thailand did not adopt the international calendar until the late 19th century). Madeline reports all dates since October of 1582 using the Gregorian calendar. In order to have Easter fall at the right time of the year once again, ten days were skipped in October of 1582. Madeline handles this correctly:
M>?{1582.10.04} {Thursday, October 4, 1582} M>?{1582.10.04} + 1 {Friday, October 15, 1582} M>
Ten days were skipped in October of 1582 by Pope Gregory XIII.
Dates prior to October of 1582 are reported using a proleptic calendar that projects the Gregorian-Julian calendar back in time.
By default, Madeline displays dates based on your computer's locale setting. If your computer is set to the "C" or
"POSIX" locale or any other non-UTF-8 locale, Madeline defaults to AmericanEnglish conventions
for displaying dates.
When you run Madeline under a UTF-8 locale, Madeline displays dates in the selected locale if possible. The program
evaluates the following environment variables in order of precedence: LC_ALL, LC_DATE, LC_CTYPE,
and LANG to determine how to display dates. Note that Madeline needs to be run in a
Unicode-enabled terminal emulator,
such as mlterm, for proper display of many languages.
You can also change the conventions used to display dates interactively using the set language command.
Examples are shown below:
edtrager@eyegene:~>LANG=fr_FR.UTF-8 madeline ... +-----------------------+-----------+-----------------------------------------+ | OTHER SETTINGS | | | +-----------------------+-----------+-----------------------------------------+ | AutoExclude | ON |Exclude pedigrees automatically | | AutoCheckInheritance | ON |Check inheritance on OPEN | | ConsoleHighlights | ON |Use bold/color highlights on console | | Delimiter | TAB |Delimiter for tables and other output. | | FusionSupport | OFF |FUSION customizations disabled | | HaplotypeDisplay | OFF |Display genotypes delimited with "/" | | Language | French |Language convention used for date, time | | MapDetails | OFF |LIST MAP summary display | | SaveAlleleFrequencies | OFF |Calculate new frequencies on next OPEN | | Date | |le lundi 22 décembre 2003 | | Verbosity | VERBOSE |All messages are printed to the console | +-----------------------+-----------+-----------------------------------------+ M>? {1983.12.03} {le samedi 3 décembre 1983} M>set language to japanese ... M>? {1983.12.03} {1983年12月3日 (土曜日)} M>
UTF-8-based locale settings read from environment variables are used for determining how to display dates. You can also
change the language settings interactively using the set language command.
Dates may be added and subtracted from one another, with the results being expressed in days.
Date data may be displayed on pedigree drawings. Dates may also be used in an
expression passed to a view command,
a draw command,
or to a subsetting command such as exclude, or to the
sort command (which sorts the order in which siblings appear
on a pedigree drawing).
Most statistical genetics programs for which Madeline provides formatted files as output do not support
date data. However, dates can be written to output files in Madeline's generic
formats. You will need to toggle date fields
on for output since they are toggled off by default.
Madeline supports entry of missing values from the command line, and also provides a simple mechanism for the user to define sets of values that should be mapped as missing values when a data are read from files.
On the command line, Madeline provides the following ways to represent missing values:
#missing for missing numeric dataProtocols in scientific studies often require that missing values be coded to specify reasons for missingness. For example, a set of negative integers outside the range of a measured phenotype may be chosen to represent missing conditions, such as assay pending, no assay, no tube, or similar conditions that result in missing data.
To accomodate such conventions, Madeline permits the user to specify lists of
values that are to be treated as missing values. These lists of missing value
indicators are stored in two arrays. CharacterMissingValue[] is used
whenever character fields, including genotype fields, are referenced.
NumericMissingValue[] is used whenever numeric fields are referenced
(see table below). For expediency, these arrays can be referenced using
abbreviated names, cmv[] and nmv[], respectively.
There is currently no missing value array for dates.
Character and numeric missing value arrays in Madeline.
| Full Name | Abbreviated Name | Default Values |
|---|---|---|
CharacterMissingValue[]
|
cmv[] |
cmv[1] = "."
|
NumericMissingValue[]
|
nmv[] |
nmv[1] = -9999
|
In ASCII and UTF-8 data files, a space-padded blank entry or a single dot (i.e., a period) in a character or numeric column is treated as a "native" missing value. (Therefore, the empty string "" and single dot "." need not be included in the missing value arrays). A typical Madeline-ready data file using single dots as missing value placeholders is shown below:
FAMID C STUDYID C SEX X FATHER C MOTHER C MZTWIN C DZTWIN C AFFECTED C AGE_DX N D9S247 G D9S325 G D9S462 G D9S1017 G D9S1321 G L0012 M02448 F N00332 N00333 . . A 68 0/0 349/349 0/0 . 0/0 L0012 M05605 F N00334 N00335 . . A 63 244/252 349/353 157/167 240/244 234/238 L0012 M06039 F N00334 N00335 . . A 68 252/254 0/353 157/157 228/240 234/238 L0012 N00332 M . . . . I . . . . . . L0012 N00333 F N00336 N00337 . . I . . . . . . L0012 N00334 M N00336 N00337 . . I . . . . . . L0012 N00335 F . . . . I . . . . . . L0012 N00336 M . . . . I . . . . . . L0012 N00337 F . . . . I . . . . . . L0034 M02453 M N00167 M05758 . . A 48 242/244 0/0 . 232/244 234/242 L0034 M05758 F N00165 N00166 . . U 89 242/248 0/0 . 232/0 242/246 L0034 M05759 M N00167 M05758 . . A 45 0/256 0/0 . 232/240 238/242 L0034 M05856 M N00167 M05758 . . U 53 242/256 0/0 . 232/232 238/246 L0034 M05876 F N00167 M05758 . . U 59 244/248 0/0 . 240/244 234/242 L0034 N00165 M . . . . I . . . . . . L0034 N00166 F . . . . I . . . . . . L0034 N00167 M . . . . I . . . . . . L0075 M02454 F N00207 N00208 . . A 61 252/254 355/357 157/169 . 234/236 L0075 M05526 F N00205 N00206 . . A 83 0/0 339/359 0/0 . 0/0 ...
A typical Madeline-ready data file with single dots (periods) as missing value placeholders. Madeline automatically recognizes single dots and blank entries as missing values in ASCII and UTF-8 data files.
Note: To increase human and machine readability, we highly recommend using dots (periods) as missing-value column placeholders in data files, as shown in the example above.
When data are read from a file, all "native" missing values (blank and single dot entries) and any values that match the
values specified in Madeline's CharacterMissingValue[] or
NumericMissingValue[] arrays are
treated as missing values by Madeline. When data are written back out to files, missing values are automatically translated
according to the conventions required by each file format (For example, when Madeline is used to create a file in the Linkage
format, the digit zero (0) is used to represent missing values).
At startup, CharacterMissingValue[], contains a set of default missing
value indicators appropriate for most character and genotype data. NumericMissingValue[]
contains the single missing value of -9999 by default. It is the user's responsibility to recognize whether these
defaults are appropriate for your data and make adjustments as required before attempting to open or
load data files.
New values can be assigned to existing cells or appended to the end of these lists as required:
M>list cmv viewCharacterMissingValuearray CMV has 5 elements: CMV[ 1]="." CMV[ 2]="/" CMV[ 3]="0/0" CMV[ 4]="0/ 0" CMV[ 5]="0/ 0" M>cmv[6]="./." append new value to the end of the list M>list cmv CMV has 6 elements: CMV[ 1]="." CMV[ 2]="/" CMV[ 3]="0/0" CMV[ 4]="0/ 0" CMV[ 5]="0/ 0" CMV[ 6]="./." M>list nmv viewNumericMissingValuearray NMV has 1 element: NMV[ 1]= -9999 M>nmv[1]=-1 overwrite one value M>nmv[2]=-9 and append another value M>list nmv NMV has 2 elements: NMV[ 1]= -1 NMV[ 2]= -9 M>
Assigning missing value indicators. Missing value indicators may be assigned to existing cells or appended to the ends of Madeline's character and numeric missing value lists.
Assignments should be done before a data table
is opened so that the values will be recognized appropriately.
The initial.script script file is an appropriate place to set character and
numeric missing value indicator defaults.
Upon opening a pedigree table, Madeline categorizes each field into one of three categories:
When a field is completely empty or contains only missing values,
Madeline assigns the field to a null category represented by an asterisk, "*".
When required, Madeline allows the user to designate a subset of "P" phenotype fields
as "V" covariate fields using the toggle command. Madeline does not
automatically assign fields to the "V" covariate category.
Field categories are summarized in the table below and described in greater depth
below.
Field Categories in Madeline.
| Category | Symbolic Designation | Description |
|---|---|---|
| Core | C |
Set of five required fields like GenderField that must be
present in all pedigree tables, plus additional optional
fields, like AffectionStatusField, that are not required by
default but may be required for some operations.
|
| Genotype | G | Character fields containing two numeric labels separated by a forward slash character representing allele calls, e.g., "141/142" |
| Phenotype | P | Character, numeric, or date fields that contain categorical or continuous phenotype information. |
| Covariate | V |
A subset of phenotype fields that are to be used as
covariates. The user must use the toggle command to
change the designation of a "P" field to "V".
|
| Null | * | Character, numeric, or date fields that are completely empty or contain only missing value indicators. These fields cannot be operated upon. |
Core "C" data fields provide key information about an individual (see table below).
Madeline identifies core fields by their names (in contrast, "G" and "P"
fields are distinguished by scanning the data in the table). These names are stored in
variables whose values may be reassigned by the user.
In conformance with the requirements of the supported legacy database types, field names must be capitalized, and cannot exceed 10 letters in length. When assigning names to the field variables, Madeline will automatically capitalize and truncate non-conforming names, and will issue warning messages to the user.
Note: Limitations on field name length will likely be relaxed in the next release of the program when support for SAS and dBase/xBase file formats is removed.
Core data fields are either required or optional. The absence of one or more of the five required core fields will generate an error when a pedigree table is opened.
Optional core fields may be required for some operations,
but are not required by default.
Madeline makes use of the additional information provided in optional
core fields whenever they are present. In particular, Madeline's pedigree drawing
functionality is greatly enhanced by the presence of optional core fields such as
AffectionStatusField and MZTwinField, among others.
Core fields representing categorical attributes of an individual, such as the GenderField
and AffectionStatusField have corresponding associative arrays for mapping
user data codes to Madeline internal codes.
Core Data Fields in Madeline.
| Variable Name | Description | Default Value | Allowed Field Types | Associative Array |
|---|---|---|---|---|
| Required Core Fields | ||||
1. IndividualIdField
|
Individual identifier |
"STUDYID"
|
Character only | n/a |
2. FatherIdField
|
Father's identifier |
"FATHER"
|
Character only | n/a |
3. MotherIdField
|
Mother's identifier |
"MOTHER"
|
Character only | n/a |
4. GenderField
|
Gender |
"SEX"
|
Character or Numeric |
GenderStatus[]
|
5. FamilyIdField
|
Family identifier |
"FAMID"
|
Character only | n/a |
| Optional Core Fields | ||||
6. AffectionStatusField
|
Affection status |
"AFFECTED"
|
Character or Numeric |
AffectionStatus[]
|
7. DeathStatusField
|
Death status |
"DECEASED"
|
Character or Numeric |
DeathStatus[]
|
8. ProbandField
|
Index case or proband indicator |
"PROBAND"
|
Character or Numeric |
ProbandStatus[]
|
9. LiabilityClassField
|
Liability class |
"LCLASS"
|
Numeric or Character |
LiabilityClass[]
|
10. MZTwinField
|
Monozygotic twin status indicator |
"TWIN"
|
Character only | n/a |
11. DZTwinField
|
Dizygotic twin status indicator |
"DZTWIN"
|
Character only | n/a |
12. DateOfBirthField
|
Date of birth |
"DOB"
|
Date only | n/a |
13. DateOfDeathField
|
Date of death |
"DOD"
|
Date only | n/a |
It is extremely easy to tell Madeline how to translate coded information stored in your data files into values that the program knows about and can process.
Coded values in the
GenderField,
AffectionStatusField,
DeathStatusField,
ProbandField, and
LiabilityClassField
are mapped to Madeline constants using the set of
associative arrays shown in the table in the preceding section. Madeline provides default mappings that
are appropriate for reading many character-coded and Linkage-coded data tables.
For example, the GenderStatus[] array contains
the following values by default:
M>list GenderStatus GenderStatus has 6 elements: GENDERSTATUS[ 1 ]=0 zero is defined as male in Madeline GENDERSTATUS[ 2 ]=1 one is defined as female in Madeline GENDERSTATUS["F"]=1 GENDERSTATUS["M"]=0 GENDERSTATUS["♀"]=1 For animal studies; requires UTF-8 data files in a UTF-8 locale. GENDERSTATUS["♂"]=0 For animal studies; requires UTF-8 data files in a UTF-8 locale.
These defaults are equivalent to issuing the following sequence of
map commands. Note that Madeline constants are prefixed by
the hash sign (#):
M>map GenderStatus 1 as #male M>map GenderStatus 2 as #female M>map GenderStatus "F" as #female M>map GenderStatus "M" as #male M>map GenderStatus "♀" as #female M>map GenderStatus "♂" as #male
Note how the mapping of 1 as #male and
2 as #female is appropriate for reading data coded
according to Linkage file format conventions. The mapping of ♀ (Unicode
u+2640)
and ♂ (Unicode u+2642) are appropriate for animal studies.
Assignments can be made to the associative arrays directly without using the
map command.
The following assignment statements replicate the default mappings for the
AffectionStatus[] array. The first
three assignments allow Madeline to process files coded using the
Linkage format conventions. The remaining three assignments
support the processing of a substantially more intuitive coding convention
that we prefer:
M>// Assignments to support the Linkage/Genehunter format: M>AffectionStatus[ 0 ]=#missing M>AffectionStatus[ 1 ]=#unaffected M>AffectionStatus[ 2 ]=#affected M>// Assignments to support a substantially more intuitive coding convention: M>AffectionStatus["A"]=#affected M>AffectionStatus["I"]=#missing M>AffectionStatus["U"]=#unaffected
The mappings shown above for GenderStatus[] and
AffectionStatus[] are the default mappings present when
you start a Madeline session. Codes not present in these associative arrays
will be mapped to #missing by default. If your codes
match these codes, then you don't need to do anything. If your codes differ from the defaults, then
you will need to provide the correct mappings in the initial.script,
in a batch file, or on the command line. For example, if your pedigree table used the capitalized words
"MALE" and "FEMALE" to indicate
males and females respectively, then you would want to execute the following:
M>map GenderStatus "FEMALE" as #female M>map GenderStatus "MALE" as #male
For the default values in all associative arrays, see Table 4.4.
Note:
To insure that Madeline recognizes values in core fields correctly,
assignment of values in associative arrays that affect the interpretation of core fields
must be made before any open
or load command.
Different databases impose different restrictions on the length and format of field names. Up to 10 characters can be used for field names in an xbase file, but only up to 8 characters in a SAS transport file. Although Madeline now supports several different file formats, the program originally only supported the xbase file format. As a result of this legacy, Madeline restricts field name identifiers as follows:
Here is an example:
M>AffectionStatusField="AffectionStatus" Field name assignment has been truncated and capitalized to "AFFECTIONS". M>
Field names are restricted to capitalized labels of 10 or fewer characters in length.
Note: Madeline will warn you if you try to assign a name with embedded spaces or control characters to a field name variable. However, the program does not actively check for all possible errors in field identifiers. This is the user's responsibility. Madeline also has no way of knowing in advance what type of database file will be opened. For example, the program will not notice if you enter a ten-letter name for use with a SAS transport file that permits only 8-letter field identifiers.
Note: Support for legacy xbase and SAS transport formats may removed in the next version of the program. Field name limitations would then become less restrictive.
The value in FamilyIDField tells Madeline the name of the
family ID field to look for in a pedigree table.
The default value is "FAMID".
The values in IndividualIDField,
FatherIDField,
and MotherIDField
identify the individual and parent identifier fields for Madeline to look for in a pedigree table.
The default values are "STUDYID", "FATHER", and "MOTHER", respectively.
Note: We recognize that the default value of
IndividualIDField as STUDYID is not a good choice. The default will very
likely become INDIVIDUALID in the next version of the program.
Parent IDs should be present in both the FatherIDField and
MotherIDField of all non-founder individuals.
The program interprets any individual with missing value indicators for
both parents as a founder.
In the event that one of the two parent identifiers is missing for
an individual or individuals in a sibship, Madeline automatically generates a random
eight-letter identifier to represent the missing parent. The randomly-generated IDs
begin and end with exclamation marks to distinguish them from regular IDs. Using
the generated ID, Madeline constructs a virtual parent in memory who will appear on
pedigree drawings (figure below) and in output generated by
the write command.
Madeline assumes that all the sibs with the one identified parent are full sibs sharing
the one identified and other assumed parent.
Virtual constructed parent in Madeline. A virtual parent with a randomly-generated ID (male on the right) is constructed when the ID of one parent is missing among a sibship of individuals (not shown). Sibs are assumed to be full sibs.
Note: Lack of one parent usually indicates that a data set has not yet been thoroughly examined for errors or missing data. Unlike other programs, Madeline tolerates certain types of missingness and errors. This enhances the program's utility as a proofing tool. However, in the end you still have to fix your errors ;-).
The default value for GenderField is "SEX". The
GenderField can
be either numeric or character. Madeline detects the field type
when the pedigree table is opened. Madeline defines two symbolic constants for gender:
#female which has a value of 1#male which has a value of 0
The GenderStatus[]
array is used to map external gender codes to Madeline's internal gender constants, #male
and #female, as described above under
Interpretation of Core Data.
Only terminal individuals without offspring may retain a gender attribute
of #missing. During pedigree reconstruction, if Madeline detects any father or
mother with a missing gender attribute, the program will automatically change the
gender of the individual in memory to be consistent with the reconstruction, and will
warn the user of the change (example below). The database file on disk will not be changed.
Madeline will also automatically correct the gender attribute of mislabeled individuals in memory, for example, of a male listed as a mother, or of a female listed as a father (example below), to the extent that these changes still result in logical consistency. Madeline always warns the user of these types of data errors. Again, the data file on disk will not be changed; that is the user's responsibility.
M>open "family.mfh"
...
ConnectIndividual(): Gender in database is incomplete:
Gender of G-10-162's mother, G-10-159, changed from MISSING to FEMALE
ConnectIndividual(): Gender in database is incorrect:
Gender of G-15-012's father, G-15-003, changed from FEMALE to MALE
13 WARNINGS, 11 SEVERE WARNINGS M>
Inconsistencies in Gender. During pedigree reconstruction, Madeline automatically corrects inconsistencies in gender in the data set (as long as such changes do not violate the logical consistency of the reconstruction) and warns the user.
Madeline will warn the user and terminate if conflicting and unresolvable gender roles exist for an individual (for example if an individual is listed as both a mother and a father in the data set).
Note: We recommend coding the GenderField as a
character field using conventional codes such as "M" and "F", or "male"
and "female".
Not only does this enhance human interpretability of the raw data files, but also enhances Madeline's ability to
automatically identify columns when the recognize command is used.
Numerically-coded fields, such as those in Linkage/Genehunter files,
generally cause unecessary confusion and introduce a greater potential for errors.
The MZTwinField should remain blank (or use a single dot) for non-twins, and should
contain a single-letter identifier for each twin pair or group of monozygotic siblings. For example, "A" can
be used to designate the first twin pair in a family, "B" the second pair, and so on.
Since version 0.90 of the program, the MZTwinField has been
considered an optional core field.
The optional DZTwinField, used to show dizygotic twins on pedigree drawings,
should be coded in the same manner to designate dizygotic twins.
The AffectionStatusField
may be either character or numeric. Madeline defines
two symbolic constants for describing the affection status of sampled individuals:
#unaffected which has a value of 0#affected which has a value of 1
Madeline provides the AffectionStatus[] associative array
for mapping affection status codes.
Note:
Coding the AffectionStatusField as a character
field using mnemonic codes is recommended to enhance interpretability of the data in the absence of
additional metadata. Numeric fields tend to cause confusion and may increase the potential for
human error.
The optional DeathStatusField may be either character
or numeric. The default value of DeathStatusField is "DECEASED".
Madeline defines the constants #alive, with a value of 0,
and #dead,
with a value of 1.
The DeathStatus[]
associative array contains a set of defaults for mapping the
DeathStatusField. This is shown below:
M> M>?DeathStatusField "DECEASED" M>?#alive 0 M>?#dead 1 M>list DeathStatus DeathStatus has 4 elements: DeathStatus[0]=0 DeathStatus[1]=1 DeathStatus["N"]=0 DeathStatus["Y"]=1 M>
The DeathStatusField, DeathStatus array, and #alive and #dead constants.
Note:
Coding the DeathStatusField as a character
field using mnemonic codes is recommended to enhance interpretability of the data in the absence of additional meta data.
Consider how coding a column using "L" for "living" and "D" for "deceased" is more meaningful than
using "1" for "living" and "0" for deceased
-- especially when you realize that Madeline's default encoding is exactly the opposite, with
"1" being "deceased" and
"0" being "living"!
The optional ProbandField must be numeric. Madeline assumes that the probands
or index cases will be coded using a value of 1, and all other individuals
with a value of 0.
Some output formats, such as Genehunter, have the option of including liability class
information. The LiabilityClassField may be numeric or character.
Madeline does not interpret the values in this field, but simply passes the values on directly. This means that if a program like
Genehunter requires a numeric encoding of liability classes, you must insure that the source
data are encoded numerically in a conformant manner.
The DateOfBirthField and
DateOfDeathField are optional core
date fields.
When present, Madeline performs checks to insure that dates in these fields are
reasonable, and looks for twins based on date of birth who have not been
designated as such in the MZTwinField or DZTwinField.
Genotype "G" data are character fields that contain allelic marker data
separated by the forward slash "/" character. The allele labels themselves
must be numeric, non-alphabetic labels, e.g. "1/2" or "141/142".
The names of genotype fields should be the capitalized names of the markers themselves.
This allows Madeline to automatically place the genotype fields into map order
whenever a map database for the markers is loaded using the
load command.
Make sure that marker names in the map table are capitalized to correspond
with the required capitalization of field names.
When a database is opened, Madeline automatically estimates allele frequencies for all genotype fields using gene counting ignoring family relationships. Allele frequencies are estimated from all records in a database.
Allele frequencies calculated from one pedigree table may be saved out
using the save command. A table of allele
frequency information can subsequently be read into Madeline using the
read command. The format of the allele
frequencies table is nearly identical to the format used by Mendel v. 4.1.
You need only modify the header at the top of an allele frequency table to
conform with the Mendel program convention.
Phenotype "P" fields are any remaining fields that are not core "C"
or genotype "G" fields.
Phenotype fields may be character, numeric, or date fields,
and are assumed to contain categorical or continuous phenotype information.
Because date fields cannot be written to output from the
write command,
date fields are the only type of phenotype field not flagged for output
when a pedigree table is opened.
For some types of output, it may be necessary to designate certain phenotype fields as
representing covariates. Madeline therefore maintains a separate covariate or
"V" field category which is a subset of the "P" category.
Covariate fields are automatically recognized as phenotype
fields when writing any format that does not distinguish between phenotype and
covariate fields. "P" fields can be marked as "V" fields using
the toggle command.
When a pedigree table is opened, most core "C" fields, all genotype "G" fields,
and all phenotype "P" fields (except date fields), are flagged,
or toggled on, for output by default. Madeline indicates which fields
in a database are toggled for output by placing the letter "o" after the
category indicator "C","G", or "P" (example below).
A number after the "o" indicates the order in which fields will
appear in pedigree drawings and in output from the write command.
Fields may be manually reordered using the set field order command.
M>list fields
1.FAMID Co__1 20.D20S482 Go__6 39.D20S96 Go_25
2.STUDYID Co__2 21.D20S849 Go__7 40.D20S119 Go_26
3.SEX Co__3 22.D20S905 Go__8 41.D20S481 Go_27
4.FATHER Co__4 23.D20S846 Go__9 42.D20S836 Go_28
5.MOTHER Co__5 24.D20S892 Go_10 43.D20S888 Go_29
6.TWIN Co__6 25.D20S115 Go_11 44.D20S886 Go_30
7.AFFECTED Co__7+ 26.D20S851 Go_12 45.D20S197 Go_31
8.BMI Po__1 27.D20S917 Go_13 46.D20S178N Go_32
9.INS_FAST Po__2 28.D20S894 Go_14 47.D20S866 Go_33
10.INS_2H Po__3 29.D20S189 Go_15 48.D20S196 Go_34
11.BW_REAL Po__4 30.D20S898 Go_16 49.D20S857 Go_35
12.GLU_FAST Po__5 31.D20S114 Go_17 50.D20S480 Go_36
13.GLU_2H Po__6 32.D20S912 Go_18 51.D20S211 Go_37
14.GAD_DUP Po__7 33.D20S477 Go_19 52.D20S840 Go_38
15.D20S103 Go__1 34.D20S874 Go_20 53.D20S120 Go_39
16.D20S117 Go__2 35.D20S195 Go_21 54.D20S100 Go_40
17.D20S906 Go__3 36.D20S909 Go_22 55.D20S102 Go_41
18.D20S193 Go__4 37.D20S107 Go_23 56.D20S171 Go_42
19.D20S889 Go__5 38.D20S170 Go_24 57.D20S173 Go_43
M>
Fields Categorization and Ordering in Madeline. Core "C" fields
are detected by name. Genotype "G" are detected by scanning the data:
all remaining fields are assumed to be phenotype "P" fields. Fields
are ordered for output respectively within the three groups, "C",
"G", and "P".
The plus "+"
sign after AFFECTED indicates that Madeline has detected this field as
the AffectionStatusField: categorical levels of this field
will be used to color icon symbols on pedigree drawings.
A field listing is shown when a pedigree table is first opened
or at any other time using the list fields
command.
The order of genotype fields is automatically set to map order when
a marker map database is loaded using the load command.
Load can be issued either before (the preferred method) or after an
open command. Genotype fields whose names
match the names of markers in the map database will be set to the map order.
Fields toggled on for output are displayed in pedigree drawings
created with the draw command.
When a write command is executed, the set of core "C" fields required by the
specific format being produced will generally be output regardless of the on/off
output flag status. For example, Madeline will output the GenderField even if
you toggle it off because it is required for almost all output formats.
This behavior is required to insure proper file construction. Genotype "Go"
fields toggled for output will be written, along with phenotype "Po"
(and possibly covariate "Vo") fields toggled for output if
the analysis format supports phenotype fields. Some analysis programs, such as
Genehunter and Siblink, do not use phenotype data beyond affection status
(which is a core field).
Fields may be toggled on or off for output using the
toggle command.
Madeline makes use of marker map information to:
Go" fields for output.
The load command is used to load a table containing
genetic maps for one or more chromosomes. A genetic map table may contain only one map for each
chromosome. At a minimum, the map table must have columns specifying the
chromosome, rank or ordinal position of the marker within the
map for a given chromosome, name of the marker, and the position of the
marker in centiMorgans:
Minimum Required Fields in a Map Table
| Variable For Storing Field Name | Default Value | Description |
|---|---|---|
ChromosomeField
|
"CHROMOSOME"
|
Numeric field storing the chromosome number. |
OrdinalField
|
"ORDINAL"
|
Numeric field storing the ordinal position or rank of the marker on the map for this chromosome. |
MarkerField
|
"MARKERNAME"
|
Character field storing the name of the marker |
PositionField
|
"POSITION"
|
Numeric field storing the map position from the p terminus in centiMorgans. |
Additional columns for sex-specific maps may also be present: see the load
command for details.
A map may be viewed using the list map command:
M>load 'marshfield.map.mfh' Marker maps based on marshfield.map.mfh are now installed. M>list map for chromosome 7 Map Position (Kosambi cM) ----------------------------- Ch Or Marker Name Sex-avg. Female Male -- -- ----------- --------- --------- --------- 7 1 035XB9 0.0000 . . 7 2 GATA24F03 0.0001 . . 7 3 GATA61G06 3.7001 . . 7 4 TATC010 6.4001 . . 7 5 GATA119B03 10.0001 . . 7 6 TATT019 17.3001 . . 7 7 GATA137H02N 22.0001 . . 7 8 GATA41G07 26.0001 . . 7 9 GATA137A12 30.5001 . . 7 10 GGAA3F06 35.0001 . . 7 11 AGAT103 37.7001 . . 7 12 GATA13G11 43.0001 . . 7 13 GATA026 48.8001 . . 7 14 GATA31A10 51.0001 . . 7 15 ATA31F09 55.1001 . . 7 16 TAT028 57.3001 . . 7 17 GATA24D12 63.0001 . . 7 18 GATA4E04 65.8001 . . 7 19 GATA118G10 72.0001 . . 7 20 GATA21D12 77.0001 . . 7 21 GATA73D10N 84.0001 . . 7 22 GATA87D11 88.4001 . . 7 23 GATA3F01 91.0001 . . 7 24 ATA78C09NZ 96.6001 . . 7 25 GATA5D08 102.0001 . . 7 26 GATA23F05 107.0001 . . 7 27 ATAC037 112.9001 . . 7 28 TTTA001 118.1001 . . 7 29 AGAT133 119.1001 . . 7 30 GGAA6D03N 121.0001 . . 7 31 ATA55A05 123.2001 . . 7 32 GATA145G10 127.6001 . . 7 33 GATA43C11 130.0001 . . 7 34 GATA63F08 143.0001 . . 7 35 GATA32C12 143.0002 . . 7 36 GATA104 148.0002 . . 7 37 AGAT049 150.8002 . . 7 38 GATA189C06 156.0002 . . 7 39 TATG002 161.0002 . . 7 40 GATA30D09N 167.0002 . . 7 41 MFD442-GTTT 171.6002 . . M>
Loading and viewing marker maps. A map table is
loaded using the load command. The
list map command is used to print a table
showing marker name, chromosome, mapped order, and position
in centiMorgans.
Madeline produces three types of log files (table below). The first is a
summary file that has a ".log" extension by default and records each command that
was entered and a summary of execution results. The second is a
detail file that has a ".dtl" extension by default. It provides details of command
results, such as which pedigrees and individuals were included or excluded and
why. The third log file is an error log that has a ".err"
extension by default. It records warning and error conditions that occur.
Three Types of Log Files
| Type of File | Default Name | Purpose |
|---|---|---|
| Summary |
madeline.log
|
Records commands and summaries of execution results. |
| Detail |
madeline.dtl
|
Records details regarding inclusion and exclusion of individuals and pedigrees. |
| Error |
madeline.err
|
Records warning and error conditions. |
You can change the names of the log files individually or en masse, as shown below:
M>?LogFile "madeline.log" M>LogFile="MyLogFile.log" LogFile has been changed from "madeline.log" to "MyLogFile.log" M>DetailFile="MyDetailLogFile.dtl" DetailFile has been changed fro