This article is provided by special arrangement with the
Open Web Application Security
Project (OWASP). This article is covered by the
Creative
Commons Share-Alike Attribution 2.5 license. You can find the
latest version of this article and more free and open application
security tools and documentation at
http://www.owasp.org.
Data Validation
Objective
To ensure that the application is robust against all forms of input
data, whether
obtained from the user, infrastructure, external entities or
database systems
Platforms Affected
All.
Relevant COBIT Topics
DS11 – Manage Data. All sections should be reviewed
Description
The most common web application security weakness is the failure to
properly
validate input from the client or environment. This weakness leads
to almost all of
the major vulnerabilities in applications, such as interpreter
injection, locale/Unicode
attacks, file system attacks and buffer overflows.
Data from the client should never be trusted for the client has
every possibility
to tamper with the data.
Definitions
These definitions are used within this document:
- Integrity checks
Ensure that the data has not been tampered with and is the same as
before - Validation
Ensure that the data is strongly typed, correct syntax, within
length boundaries, contains only permitted characters, or if
numeric is correctly signed and within range boundaries - Business rules
Ensure that data is not only validated, but business rule correct.
For example, interest rates
fall within permitted boundaries.
Some documentation and references interchangeably use the
various meanings,
which is very confusing to all concerned. This confusion directly
causes continuing
financial loss to the organization.
Where to include integrity checks
Integrity checks must be included wherever data passes from a
trusted to a less
trusted boundary, such as from the application to the user's
browser in a hidden field,
or to a third party payment gateway, such as a transaction ID used
internally upon
return.
The type of integrity control (checksum, HMAC, encryption,
digital signature)
should be directly related to the risk of the data transiting the
trust boundary.
Where to include validation
Validation must be performed on every tier. However, validation
should be
performed as per the function of the server executing the code. For
example, the
web / presentation tier should validate for web related issues,
persistence layers
should validate for persistence issues such as SQL / HQL injection,
directory lookups
should check for LDAP injection, and so on.
Where to include business rule validation
Business rules are known during design, and they influence
implementation.
However, there are bad, good and "best" approaches. Often the best
approach is the
simplest in terms of code.
Example - Scenario
- You are to populate a list with accounts provided by the
back-end system.
- The user will choose an account, choose a biller, and press
next.
Wrong Way
The account select option is read directly and provided in a
message back to the
backend system without validating the account number is one of the
accounts
provided by the backend system.
Why this is bad:
An attacker can change the HTML in any way they choose:
- The lack of validation requires a round-trip to the backend to
provide an error
message that the front end code could easily have eliminated - The back end may not be able to cope with the data payload the
front-end code could
have easily eliminated. For example, buffer overflows, XML
injection, or similar.
Acceptable Method
The account select option parameter is read by the code, and
compared to the
previously rendered list.
if ( account.inList(session.getParameter('payeelstid') ) {
backend.performTransfer(session.getParameter('payeelstid'));
}
This prevents parameter tampering, but still makes the browser
do a lot of
work.
Best Method
The original code emitted indexes
rather than account
names.
int payeeLstId = session.getParameter('payeelstid');
accountFrom = account.getAcctNumberByIndex(payeeLstId);
Not only is this easier to render in HTML, it makes validation
and business rule
validation trivial. The field cannot be tampered with.
Conclusion
To provide defense in depth and to prevent attack payloads from
trust boundaries,
such as backend hosts, which are probably incapable of handling
arbitrary input data,
business rule validation is to be performed (preferably in workflow
or command
patterns), even if it is known that the back end code performs
business rule
validation.
This is not to say that the entire set of business rules need be
applied - it
means that the fundamentals are performed to prevent unnecessary
round trips to the
backend and to prevent the backend from receiving most tampered
data.
Data Validation Strategies
There are four strategies for validating data, and they should be
used in this
order:
Accept known good
If you expect a postcode, validate for a postcode (type, length and
syntax):
public String validateAUpostCode(String postcode) {
return (Pattern.matches("^(((2|8|9)d{2})|((02|08|09)d{2})|([1-9]d{3}))$", postcode)) ? postcode : '';
}
- Reject known bad. If you don't expect to see characters such as
%3f or
JavaScript or similar, reject strings containing them:
public String removeJavascript(String input) {
Pattern p = Pattern.compile("javascript", CASE_INSENSITIVE);
p.matcher(input);
return (!p.matches()) ? input : '';
}
It can take upwards of 90 regular expressions (see the CSS Cheat
Sheet in the
Guide 2.0) to eliminate known malicious software, and each regex
needs to be run
over every field. Obviously, this is slow and not secure.
Sanitize
Eliminate or translate characters (such as to HTML entities or to
remove quotes) in
an effort to make the input "safe":
public String quoteApostrophe(String input) {
return str.replaceAll("[']", "’");
}
This does not work well in practice, as there are many, many
exceptions to the
rule.
No validation
account.setAcctId(getParameter('formAcctNo'));
...
public setAcctId(String acctId) {
cAcctId = acctId;
}
This is inherently unsafe and strongly discouraged. The business
must sign off
each and every example of no validation as the lack of validation
usually leads to
direct obviation of application, host and network security
controls.
Just rejecting "current known bad" (which is at the time of
writing hundreds of
strings and literally millions of combinations) is insufficient if
the input is a string.
This strategy is directly akin to anti-virus pattern updates.
Unless the business will
allow updating "bad" regexes on a daily basis and support someone
to research new
attacks regularly, this approach will be obviated before long.
As most fields have a particular grammar, it is simpler, faster,
and more secure
to simply validate a single correct positive test than to try to
include complex and
slow sanitization routines for all current and future attacks.
Data should be:
- Strongly typed at all times
- Length checked and fields length minimized
- Range checked if a numeric
- Unsigned unless required to be signed
- Syntax or grammar should be checked prior to first use or
inspection
Coding guidelines should use some form of visible tainting on
input from the
client or untrusted sources, such as third party connectors to make
it obvious that the
input is unsafe:
taintPostcode = getParameter('postcode');
validation = new validation();
postcode = validation.isPostcode(taintPostcode);
Prevent parameter tampering
There are many input sources:
- HTTP headers, such as REMOTE_ADDR, PROXY_VIA or similar
- Environment variables, such as getenv() or via server
properties
- All GET, POST and Cookie data
This includes supposedly tamper resistant fields such as radio
buttons, drop
downs, etc - any client side HTML can be re-written to suit the
attacker
Configuration data (mistakes happen :))
External systems (via any form of input mechanism, such as XML
input, RMI,
web services, etc)
All of these data sources supply untrusted input. Data received
from untrusted
data sources must be properly checked before first use.
Hidden fields
Hidden fields are a simple way to avoid storing state on the
server. Their use is
particularly prevalent in "wizard-style" multi-page forms. However,
their use exposes
the inner workings of your application, and exposes data to trivial
tampering, replay,
and validation attacks. In general, only use hidden fields for page
sequence.
If you have to use hidden fields, there are some rules:
- Secrets, such as passwords, should never be sent in the
clear.
- Hidden fields need to have integrity checks and preferably
encrypted using non-
constant initialization vectors (i.e. different users at different
times have different yet
cryptographically strong random IVs). - Encrypted hidden fields must be robust against replay attacks,
which means some
form of temporal keying. - Data sent to the user must be validated on the server once the
last page has been
received, even if it has been previously validated on the server -
this helps reduce the risk
from replay attacks.
The preferred integrity control should be at least a HMAC using
SHA-256 or
preferably digitally signed or encrypted using PGP. IBMJCE supports
SHA-256, but PGP
JCE support requires the inclusion of the Legion of the Bouncy
Castle
(http://www.bouncycastle.org/) JCE classes.
It is simpler to store this data temporarily in the session
object. Using the
session object is the safest option as data is never visible to the
user, requires (far)
less code, nearly no CPU, disk or I/O utilization, less memory
(particularly on large
multi-page forms), and less network consumption.
In the case of the session object being backed by a database,
large session
objects may become too large for the inbuilt handler. In this case,
the recommended
strategy is to store the validated data in the database, but mark
the transaction as
"incomplete". Each page will update the incomplete transaction
until it is ready for
submission. This minimizes the database load, session size, and
activity between the
users whilst remaining tamperproof.
Code containing hidden fields should be rejected during code
reviews.
ASP.NET Viewstate
ASP.NET sends form data back to the client in a hidden "Viewstate"
field. Despite
looking forbidding, this "encryption" is simply plain-text
equivalent and has no data
integrity without further action on your behalf in ASP.NET 1.1. In
ASP.NET 2.0,
tamper proofing is on by default.
Any application framework with a similar mechanism might be at
fault – you
should investigate your application framework's support for sending
data back to the
user. Preferably it should not round trip.
How to determine if you are vulnerable
Investigate the machine.config:
- If the enableViewStateMac is not set to "true", you are at risk
if your viewstate
contains authorization state. - If the viewStateEncryptionMode is not set to "always", you are
at risk if your
viewstate contains secrets such as credentials. - If you share a host with many other customers, you all share
the same machine key
by default in ASP.NET 1.1. In ASP.NET 2.0, it is possible to
configure unique viewstate
keys per application.
How to protect yourself
- If your application relies on data returning from the viewstate
without being
tampered with, you should turn on viewstate integrity checks at the
least, and strongly
consider: - Encrypt viewstate if any of the data is application
sensitive.
- Upgrade to ASP.NET 2.0 as soon as practical if you are on a
shared hosting
arrangement. - Move truly sensitive viewstate data to the session variable
instead.
Selects, radio buttons, and checkboxes
It is commonly held belief that the value settings for these items
cannot be easily
tampered. This is wrong. In the following example, actual account
numbers are used,
which can lead to compromise:
<html:radio value="<%=acct.getCardNumber(1).toString( )% >" property="acctNo">
<bean:message key="msg.card.name" arg0="<%=acct.getCardName(1).toString( )% >" />
<html:radio value="<%=acct.getCardNumber(1).toString( )% >" property="acctNo">
<bean:message key="msg.card.name" arg0="<%=acct.getCardName(2).toString( )% >" />
This produces (for example):
<input type="radio" name="acctNo" value="455712341234">Gold Card
<input type="radio" name="acctNo" value="455712341235">Platinum Card
If the value is retrieved and then used directly in a SQL query,
an interesting
form of SQL injection may occur: authorization tampering leading to
information
disclosure. As the connection pool connects to the database using a
single user, it may
be possible to see other user's accounts if the SQL looks something
like this:
String acctNo = getParameter('acctNo');
String sql = "SELECT acctBal FROM accounts WHERE acctNo = '?'";
PreparedStatement st = conn.prepareStatement(sql);
st.setString(1, acctNo);
ResultSet rs = st.executeQuery();
This should be re-written to retrieve the account number via
index, and
include the client's unique ID to ensure that other valid account
numbers are exposed:
String acctNo = acct.getCardNumber(getParameter('acctIndex'));
String sql = "SELECT acctBal FROM accounts WHERE acct_id = '?' AND acctNo = '?'";
PreparedStatement st = conn.prepareStatement(sql);
st.setString(1, acct.getID());
st.setString(2, acctNo);
ResultSet rs = st.executeQuery();
This approach requires rendering input values from 1 to ... x,
and assuming
accounts are stored in a Collection which can be iterated using
logic:iterate:
<logic:iterate id="loopVar" name="MyForm" property="values">
<html:radio property="acctIndex" idName="loopVar" value="value"/>
<bean:write name="loopVar" property="name"/> <br />
</logic:iterate>
The code will emit HTML with the values "1" .. "x" as per the
collection's
content.
<input type="radio" name="acctIndex" value="1" />Gold Credit Card
<input type="radio" name="acctIndex" value="2" />Platinum Credit Card
This approach should be used for any input type that allows a
value to be set:
radio buttons, checkboxes, and particularly select / option
lists.
Per-User Data
In fully normalized databases, the aim is to minimize the amount of
repeated
data. However, some data is inferred. For example, users can see
messages that are
stored in a messages table. Some messages are private to the user.
However, in a
fully normalized database, the list of message IDs are kept within
another table:
If a user marks a message for deletion, the usual way is to
recover the message
ID from the user, and delete that:
DELETE FROM message WHERE msgid='frmMsgId'
However, how do you know if the user is eligible to delete that
message ID?
Such tables need to be denormalized slightly to include a user ID
or make it easy to
perform a single query to delete the message safely. For example,
by adding back an
(optional) uid column, the delete is now made reasonably safe:
DELETE FROM message WHERE uid='session.myUserID' and msgid='frmMsgId';
Where the data is potentially both a private resource and a
public resource (for
example, in the secure message service, broadcast messages are just
a special type of
private message), additional precautions need to be taken to
prevent users from
deleting public resources without authorization. This can be done
using role based
checks, as well as using SQL statements to discriminate by message
type:
DELETE FROM message
WHERE
uid='session.myUserID' AND
msgid='frmMsgId' AND
broadcastFlag = false;
URL encoding
Data sent via the URL, which is strongly discouraged, should be URL
encoded and
decoded. This reduces the likelihood of cross-site scripting
attacks from working.
In general, do not send data via GET request unless for
navigational purposes.
HTML encoding
Data sent to the user needs to be safe for the user to view. This
can be done using
<bean:write ... > and friends. Do not use <%=var%>
unless it is used to supply
an argument for <bean:write... > or similar.
HTML encoding translates a range of characters into their HTML
entities. For
example, > becomes
&g;
This will still display as > on the user's browser, but it is a
safe alternative.
Encoded strings
Some strings may be received in encoded form. It is essential to
send the correct
locale to the user so that the web server and application server
can provide a single
level of canoncalization prior to the first use.
Do not use getReader() or getInputStream() as these input
methods do not
decode encoded strings. If you need to use these constructs, you
must decanoncalize
data by hand.
Delimiter and special characters
There are many characters that mean something special to various
programs. If
you followed the advice only to accept characters that are
considered good, it is very
likely that only a few delimiters will catch you out.
Here are the usual suspects:
- NULL (zero) %00
- LF - ANSI chr(10) "r"
- CR - ANSI chr(13) "n"
- CRLF - "nr"
- CR - EBCDIC 0x0f
- Quotes " '
- Commas, slashes spaces and tabs and other white space - used in
CSV, tab delimited
output, and other specialist formats - <> - XML and HTML tag markers, redirection
characters
- ; & - Unix and NT file system continuance
- @ - used for e-mail addresses
- 0xff
- ... more
Whenever you code to a particular technology, you should
determine which
characters are "special" and prevent them appearing in input, or
properly escaping
them.
Further reading