SO, WHAT IS SMILES??
Simplified Molecular-Input Line-Entry System
ITS FUNCTION??
- describing the structure of chemical molecules using short ASCII strings in the form of line notation
- SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensionalmodels of the molecules.
ITS FOUNDER??
ASSISTS BY??
- Gilman Veith and Rose Russo (USEPA) and Albert Leo and Corwin Hansch (Pomona College)
- Arthur Weininger (Pomona; Daylight CIS) and Jeremy Scofield (Cedar River Software, Renton, WA) for assistance in programming the system
PROJECT'S FUNDER??
MODIFIED BY??
- Daylight Chemical Information Systems Inc
- Blue Obelisk (2007) - "OpenSMILES"
- Wiswesser Line Notation (WLN)
- ROSDAL
- SLN (Tripos Inc)
- IUPAC - InChI
here some example of SMILES notation ;
SMILES | NAME |
---|---|
CC | ethane |
O=C=O | carbon dioxide |
C#N | hydrogen cyanide |
CCN(CC)CC | triethylamine |
CC(=O)O | acetid acid |
C1CCCCC1 | cyclohexane |
c1ccccc1 | benzene |
SMILES Specification Rules
SMILES notation consists of a series of characters containing no spaces. Hydrogen atoms may be omitted (hydrogen-suppressed graphs) or included (hydrogen-complete graphs). Aromatic structures may be specified directly or in Kekulé form.
There are five generic SMILES encoding rules, corresponding to specification of atoms, bonds, branches, ring closures, and disconnections.
3.2.1 Atoms
Atoms are represented by their atomic symbols: this is the only required use of letters in SMILES. Each non-hydrogen atom is specified independently by its atomic symbol enclosed in square brackets, [ ]. The second letter of two-character symbols must be entered in lower case. Elements in the "organic subset" B, C, N, O, P, S, F, Cl, Br, and I may be written without brackets if the number of attached hydrogens conforms to the lowest normal valence consistent with explicit bonds. "Lowest normal valences" are B (3), C (4), N (3,5), O (2), P (3,5), S (2,4,6), and 1 for the halogens. Atoms in aromatic rings are specified by lower case letters, e.g., aliphatic carbon is represented by the capital letter C, aromatic carbon by lower case c. Since attached hydrogens are implied in the absence of brackets, the following atomic symbols are valid SMILES notations.
C | methane | (CH4) |
P | phosphine | (PH3) |
N | ammonia | (NH3) |
S | hydrogen sulfide | (H2S) |
O | water | (H2O) |
Cl | hydrochloric acid | (HCl) |
Atoms with valences other than "normal" and elements not in the "organic subset" must be described in brackets.
[S] | elemental sulfur |
[Au] | elemental gold |
Within brackets, any attached hydrogens and formal charges must always be specified. The number of attached hydrogens is shown by the symbol H followed by an optional digit. Similarly, a formal charge is shown by one of the symbols + or -, followed by an optional digit. If unspecified, the number of attached hydrogens and charge are assumed to be zero for an atom inside brackets. Constructions of the form [Fe+++] are synonymous with the form [Fe+3]. Examples are:
[H+] | proton |
[Fe+2] | iron (II) cation |
[OH-] | hydroxyl anion |
[Fe++] | iron (II) cation |
[OH3+] | hydronium cation |
[NH4+] | ammonium cation |
3.2.2 Bonds
Single, double, triple, and aromatic bonds are represented by the symbols -, =, #, and :, respectively. Adjacent atoms are assumed to be connected to each other by a single or aromatic bond (single and aromatic bonds may always be omitted). Examples are:CC | ethane | (CH3CH3) |
C=O | formaldehyde | (CH2O) |
C=C | ethene | (CH2=CH2) |
O=C=O | carbon dioxide | (CO2) |
COC | dimethyl ether | (CH3OCH3) |
C#N | hydrogen cyanide | (HCN) |
CCO | ethanol | (CH3CH2OH) |
[H][H] | molecular hydrogen | (H2) |
For linear structures, SMILES notation corresponds to conventional diagrammatic notation except that hydrogens and single bonds are generally omitted. For example, 6-hydroxy-1,4-hexadiene can be represented by many equally valid SMILES, including the following three:
Structure | Valid SMILES |
---|---|
C=CCC=CCO | |
CH2=CH-CH2-CH=CH-CH2-OH | C=C-C-C=C-C-O |
OCC=CCC=C |
3.2.3 Branches
Branches are specified by enclosing them in parentheses, and can be nested or stacked. In all cases, the implicit connection to a parenthesized expression (a "branch") is to the left. Examples are:CCN(CC)CC | CC(C)C(=O)O | C=CC(CCC)C(C(C)C)CCC |
Triethylamine | Isobutyric acid | 3-propyl-4-isopropyl-1-heptene |
3.2.4 Cyclic Structures
Cyclic structures are represented by breaking one bond in each ring. The bonds are numbered in any order, designating ring opening (or ring closure) bonds by a digit immediately following the atomic symbol at each ring closure. This leaves a connected non-cyclic graph which is written as a non-cyclic structure using the three rules described above. Cyclohexane is a typical example:
There are usually many different, but equally valid descriptions of the same structure, e.g., the following SMILES notations for 1-methyl-3-bromo-cyclohexene-1:
Many other notations may be written for the same structure, deriving from different ring closures. SMILES does not have a preferred entry on input; although (a) above may be simplest, others are just as valid.
A single atom may have more than one ring closure. This is illustrated by the structure of cubane in which two atoms have more than one ring closure:
Generation of SMILES for cubane: C12C3C4C1C5C4C3C25.
If desired, digits denoting ring closures can be reused. As an example, the digit 1 used twice in the specification:
O1CCCCC1N1CCCCC1 |
The ability to re-use ring closure digits makes it possible to specify structures with 10 or more rings. Structures that require more than 10 ring closures to be open at once are exceedingly rare. If necessary or desired, higher-numbered ring closures may be specified by prefacing a two-digit number with percent sign (%). For example, C2%13%24 is a carbon atom with a ring closures 2, 13, and 24 .
3.2.5 Disconnected Structures
Disconnected compounds are written as individual structures separated by a "." (period). The order in which ions or ligands are listed is arbitrary. There is no implied pairing of one charge with another, nor is it necessary to have a net zero charge. If desired, the SMILES of one ion may be imbedded within another as shown in the example of sodium phenoxide.
Matching pairs of digits following atom specifications imply that the atoms are bonded to each other. The bond may be explicit (bond symbol and/or direction preceding the ring closure digit) or implicit (a nondirectional single or aromatic bond). This is true whether or not the bond ends up as part of a ring.
Adjacent atoms separated by dot (.) implies that the atoms are not bonded to each other. This is true whether or not the atoms are in the same connected component.
For example, C1.C1 specifies the same molecule as CC(ethane)
No comments:
Post a Comment