Source lines of code (SLOC or LOC) is a software metric used to measure the size of a software program by counting the number of lines in the text of the program's source code. SLOC is typically used to predict the amount of effort that will be required to develop a program, as well as to estimate programming productivity or effort once the software is produced.
There are two major types of SLOC measures: physical SLOC (LOC) and logical SLOC (LLOC). Specific definitions of these two measures vary, but the most common definition of physical SLOC is a count of lines in the text of the program's source code including comment lines. Blank lines are also included unless the lines of code in a section consists of more than 25% blank lines. In this case blank lines in excess of 25% are not counted toward lines of code.
Logical LOC attempts to measure the number of "statements", but their specific definitions are tied to specific computer languages (one simple logical LOC measure for C-like programming languages is the number of statement-terminating semicolons). It is much easier to create tools that measure physical SLOC, and physical SLOC definitions are easier to explain. However, physical SLOC measures are sensitive to logically irrelevant formatting and style conventions, while logical LOC is less sensitive to formatting and style conventions. Unfortunately, SLOC measures are often stated without giving their definition, and logical LOC can often be significantly different from physical SLOC.
Consider this snippet of C code as an example of the ambiguity encountered when determining SLOC:
for (i = 0; i < 100; i += 1) printf("hello"); /* How many lines of code is this? */
In this example we have:
Depending on the programmer and/or coding standards, the above "line of code" could be written on many separate lines:
for (i = 0; i < 100; i += 1)
{
printf("hello");
} /* Now how many lines of code is this? */
In this example we have:
Even the "logical" and "physical" SLOC values can have a large number of varying definitions. Robert E. Park (while at the Software Engineering Institute) et al. developed a framework for defining SLOC values, to enable people to carefully explain and define the SLOC measure used in a project. For example, most software systems reuse code, and determining which (if any) reused code to include is important when reporting a measure.
At the time that people began using SLOC as a metric, the most commonly used languages, such as FORTRAN and assembler, were line-oriented languages. These languages were developed at the time when punched cards were the main form of data entry for programming. One punched card usually represented one line of code. It was one discrete object that was easily counted. It was the visible output of the programmer so it made sense to managers to count lines of code as a measurement of a programmer's productivity, even referring to such as "card images". Today, the most commonly used computer languages allow a lot more leeway for formatting. Text lines are no longer limited to 80 or 96 columns, and one line of text no longer necessarily corresponds to one line of code.
SLOC measures are somewhat controversial, particularly in the way that they are sometimes misused. Experiments have repeatedly confirmed that effort is correlated with SLOC, that is, programs with larger SLOC values take more time to develop. Thus, SLOC can be effective in estimating effort. However, functionality is less well correlated with SLOC: skilled developers may be able to develop the same functionality with far less code, so one program with less SLOC may exhibit more functionality than another similar program. In particular, SLOC is a poor productivity measure of individuals, since a developer can develop only a few lines and yet be far more productive in terms of functionality than a developer who ends up creating more lines (and generally spending more effort). Good developers may merge multiple code modules into a single module, improving the system yet appearing to have negative productivity because they remove code. Also, especially skilled developers tend to be assigned the most difficult tasks, and thus may sometimes appear less "productive" than other developers on a task by this measure. Furthermore, inexperienced developers often resort to code duplication, which is highly discouraged as it is more bug-prone and costly to maintain, but it results in higher SLOC.
SLOC is particularly ineffective at comparing programs written in different languages unless adjustment factors are applied to normalize languages. Various computer languages balance brevity and clarity in different ways; as an extreme example, most assembly languages would require hundreds of lines of code to perform the same task as a few characters in APL. The following example shows a comparison of a "hello world" program written in C, and the same program written in COBOL - a language known for being particularly verbose.
C | COBOL |
---|---|
#include <stdio.h> |
000100 IDENTIFICATION DIVISION. |
Lines of code: 5 (excluding whitespace) |
Lines of code: 17 (excluding whitespace) |
Another increasingly common problem in comparing SLOC metrics is the difference between auto-generated and hand-written code. Modern software tools often have the capability to auto-generate enormous amounts of code with a few clicks of a mouse. For instance, GUI builders automatically generate all the source code for a GUI object simply by dragging an icon onto a workspace. The work involved in creating this code cannot reasonably be compared to the work necessary to write a device driver, for instance. By the same token, a hand-coded custom GUI class could easily be more demanding than a simple device driver; hence the shortcoming of this metric.
There are several cost, schedule, and effort estimation models which use SLOC as an input parameter, including the widely-used Constructive Cost Model (COCOMO) series of models by Barry Boehm et al., PRICE Systems True S and Galorath's SEER-SEM. While these models have shown good predictive power, they are only as good as the estimates (particularly the SLOC estimates) fed to them.
According to Vincent Maraia[1], the SLOC values for various operating systems in Microsoft's Windows NT product line are as follows:
Year | Operating System | SLOC (Million) |
---|---|---|
1993 | Windows NT 3.1 | 4-5[1] |
1994 | Windows NT 3.5 | 7-8[1] |
1996 | Windows NT 4.0 | 11-12[1] |
2000 | Windows 2000 | more than 29[1] |
2001 | Windows XP | 40[1] |
2003 | Windows Server 2003 | 50[1] |
David A. Wheeler studied the Red Hat distribution of the Linux operating system, and reported that Red Hat Linux version 7.1 (released April 2001) contained over 30 million physical SLOC. He also extrapolated that, had it been developed by conventional proprietary means, it would have required about 8,000 person-years of development effort and would have cost over $1 billion (in year 2000 U.S. dollars).
A similar study was later made of Debian Linux version 2.2 (also known as "Potato"); this version of Linux was originally released in August 2000. This study found that Debian Linux 2.2 included over 55 million SLOC, and if developed in a conventional proprietary way would have required 14,005 person-years and cost $1.9 billion USD to develop. Later runs of the tools used report that the following release of Debian had 104 million SLOC, and as of year 2005, the newest release is going to include over 213 million SLOC.[update]
One can find figures of major operating systems (the various Windows versions have been presented in a table above)
Operating System | SLOC (Million) |
---|---|
Debian 2.2 | 55-59[2][3] |
Debian 3.0 | 104[3] |
Debian 3.1 | 215[3] |
Debian 4.0 | 283[3] |
Debian 5.0 | 324[3] |
OpenSolaris | 9.7 |
FreeBSD | 8.8 |
Mac OS X 10.4 | 86[4] |
Linux kernel 2.6.0 | 5.2 |
Linux kernel 2.6.29 | 11.0 |
Linux kernel 2.6.32 | 12.6[5] |
In the PBS documentary Triumph of the Nerds, Microsoft executive Steve Ballmer criticized the use of counting lines of code:
In IBM there's a religion in software that says you have to count K-LOCs, and a K-LOC is a thousand line of code. How big a project is it? Oh, it's sort of a 10K-LOC project. This is a 20K-LOCer. And this is 50K-LOCs. And IBM wanted to sort of make it the religion about how we got paid. How much money we made off OS/2, how much they did. How many K-LOCs did you do? And we kept trying to convince them - hey, if we have - a developer's got a good idea and he can get something done in 4K-LOCs instead of 20K-LOCs, should we make less money? Because he's made something smaller and faster, less K-LOC. K-LOCs, K-LOCs, that's the methodology. Ugh! Anyway, that always makes my back just crinkle up at the thought of the whole thing.