Top

Follow me and receive all the latest free scripts:

By Email:

Categories
Most Popular Posts

Regular Expression Tutorial

Published October 04, 2014 by , category PHP

regular expressionsregex

You've always wanted to learn to speak Chinese?

Good thing! In this tutorial, I'll teach you to write something like this:

#(((https?|ftp)://(w{3}\.)?)(?<!www)(\w+-?)*\.([a-z]{2,4}))#

Believe me if you want, but this unpronounceable gibberish ... well it really means something! Yes, yes, I swear!

Regular expressions are a very powerful and very fast system for search in strings (phrases, for example). It is a kind of high degree functionality Find / Replace, you will not want to get away once you know to use it.

Regular expressions will allow us to make searches and replacements into texts. Here are some examples of what you'll be able to do:

Open your ears and fasten your seatbelts!

Where to use a regex?

POSIX or PCRE?

Good news: you will not have to activate anything to do regular expressions (unlike the GD library).

There are two types of regular expressions that respond to sweet names:

PHP proposes to choose between POSIX and PCRE. For me, the choice is made, we will study PCRE.
Rest assured, this is not much more complicated than POSIX, but it has the advantage of being very fast. And at our PHP level, what precisely interests us is speed.

Functions that interest us

We therefore chose PCRE. There are various functions using the "PCRE language" which all start with preg_:

Each function has its own characteristics: some are used to simply do a search, others a search and replacement, but the big thing in common is that they use the same "language" to do a search.
Once you have learned the PCRE language, you can use each of them without problem.

To avoid too much theory, we will begin to practice, and use one of these functions: preg_match.

preg_match

Using this function, you can work out at the same time as me and see if you gradually get the idea of PCRE language.
Just be aware that this function returns a boolean: true or false. It returns true if it founds the word you are looking for in the string, false if it does not find it.

You must give it two information: your regex (this is the small name given to "regular expression") and the string in which you make a search.

Here is how it can be used, with an if statement:

<?php
if (preg_match("** Your REGEX **", "String in which you do the search")) {
	echo 'The word you are looking for is in the string';
} else {
	echo 'The word you are looking for is not in the string';
}
?>

Instead of "** Your REGEX **", you would type something in PCRE language, as I showed you earlier in this chapter:

#(((https?|ftp)://(w{3}\.)?)(?<!www)(\w+-?)*\.([a-z]{2,4}))#

It is precisely what interests us, and we are gona look after that.
Because - in case you have not noticed - this thing is frankly not easy to read ... And in comparison Chinese looks simpler!

Simple searches

We'll start with very simple and very basic search. Normally, you shouldn’t have difficulties to follow for now; it’s after, when we will mix everything, that it will start to be complicated.

First important thing to know: a regex (regular expression) is always surrounded by special characters called delimiters.
One can choose any special character as a delimiter, and to avoid going around in circles for too long, I will impose one: the pound sign (#)!
Your regex is then surrounded by pound signs, like this:

#My regex#

Uh, but what pound signs are used for, since in any case the regex is surrounded by quotes in the PHP function?

Because if we want, we can use options. We will not talk about the options right now because we do not need them to start, but be aware that these options are placed after the second pound, like this:

#My regex#Options

Instead of "My regex" you have to put the word you are looking for.

An example: You want to know if a variable contains the word "guitar". Use the following regex to do the search:

#guitar#

In PHP code, it gives:

<?php
if (preg_match("#guitar#", "I love playing guitar")) {
	echo 'TRUE';
} else {
	echo 'FALSE';
}
?>

If you run this code you will see that it displays TRUE because the word "guitar" was found in the phrase "I love playing guitar".

Remember this snippet. We'll keep it a while changing sometimes the regex, sometimes the sentence in which we do the search.
For you to understand how the regex behave, I will present the results in a table, like this:

String Regex Result
I love playing guitar #guitar# TRUE
I love playing guitar #piano# FALSE

OK, got it so far?
We found the word "guitar" in the first sentence, but not "piano" in the second.
Until then it's easy, but I'm about to complicate!

There is something you must know: the regex make the difference between upper and lower case; they are said to be "case sensitive". Just look at these two regex example:

String Regex Result
I love playing guitar #Guitar# FALSE
I love playing guitar #GUITAR# FALSE

What if we want our regex not make the difference between upper and lower case?
We'll just use an option. It's the only one you'll need to remember this time. We must add the letter "i" after the second pound, and the regex will not be case sensitive:

String Regex Result
I love playing guitar #Guitar#i TRUE
Cheers GUITAR! #guitar#i TRUE
Cheers GUITAR! #guitar# FALSE

In the last example, I did not put the "i" option it returned FALSE.
But in the other examples, you can see that the "i" allowed to not make the difference between uppercase and lowercase.

The symbol OR

We will now use the OR symbol: the vertical bar "|".
With it, you will be able to leave several options to your regex. So if you type:

#guitar|piano#

... it means you're looking the word "guitar" OR the word "piano". If one of these words is found, the regex returns TRUE. Here are some examples:

String Regex Result
I love playing guitar #guitar|piano# TRUE
I love playing piano #guitar|piano# TRUE
I love playing banjo #guitar|piano# FALSE
I love playing banjo #guitar|piano|banjo# TRUE

In the last example, I put twice the vertical bar. This means we are looking for guitar OR piano OR banjo.

Are you still following?
Perfect!
We can now see the string "start and end" problems, and then we can move to the next level.

Beginning and end of string

The regex allow being very very precise, you will soon realize it.
So far in fact, the word could be anywhere. But suppose we grant that the sentence begins or ends with this word.

We will need the following two symbols, remember them:

So if you want a string begins with "Hello", you will use the regex:

#^Hello#

If you place the symbol "^" before the word, then that word will necessarily be at the beginning of the string, otherwise it will return FALSE.

Similarly, if we want to check that the string ends with "zero", we write this regex:

#zero$#

Got that? Here is a series of tests so you to see well how it works:

String Regex Result
Hello little zero #^Hello# TRUE
Hello little zero #zero$# TRUE
Hello little zero #^zero# FALSE
Hello little zero!!! #zero$# FALSE

Simple, right?
In the latter case it does not work because the string does not end with "zero" but "!!!". So naturally, we were returned FALSE...

Character classes

So far you have been able to do some pretty simple searches, but nothing really special. The Word program search tool makes the same after all.
But rest assured, the regex are much richer (and complex) than the Word search tool, you'll see.

With the so-called character classes, we can vary enormously search possibilities.

All of this revolves around the brackets. We place a character class between brackets in a regex.
This allows us to test many search possibilities at a time, while being very accurate.

Simple classes

Look carefully this regex:

#h[ao]t#

Between the brackets, this is what we call a character class. This means that one of the letters inside is suitable.
In this case, our regex recognizes two words "hat" and "hot". It's a bit like the OR we learned earlier, except that it applies here to a letter, not a word.

Besides, if you put several letters like this:

#h[aou]t#

It means "a" OR "o" OR "u". So our regex recognizes the words "hat", "hot" and "hut"!
Come on, we make a few examples:

String Regex Result
I wear a hat #h[aou]t# TRUE
Too much sun, hot weather #h[aou]t# TRUE
Too much sun, hot weather #h[aou]t$# FALSE
I wear a hat #[lmtu]$# TRUE
I wear a hat #^[lmtu]# FALSE

I guess you understand the first two regex. But I think you need an explanation of the last three.

Okay, I've still not lost you on the way?
If at any time you feel you've landed, feel free to read again what is above, it will do you no harm.

Class intervals

It is from this point that classes should begin to bluff you.

With the symbol "-" (hyphen), we can allow a wide range of characters.
For example, earlier we used class [lmtu]. OK, it is not too long.
But what do you think about the class [abcdefghijklmnopqrstuvwxyz]? All that to say that you want there is a letter?

I've better!
You have the right to write: [a-z]! Admit that it's shorter! And if you want to stop at the letter "e", no problem either: [a-e].
In addition, it also works with numbers: [0-9]. If you'd rather a number between 1 and 8, type: [1-8].

Even better! You can write two intervals at a time in a class: [a-z0-9]. It means "any letter (lowercase) OR a number".

Of course, you can also allow uppercase without using the options as we did earlier. That would give [a-zA-Z0-9]. This is a shorter way of writing:

[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]

Let's do some tests:

String Regex Result
This sentence contains a letter #[a-z]# TRUE
this sentence has neither uppercase nor number #[A-Z0-9]# FALSE
I live in the 21st century #^[0-9]# FALSE
<h1>HTML title tag</h1> #<h[1-6]># TRUE

The last example is particularly interesting because we’re slowly moving towards the practice. We check if the string contains an HTML title tag (<h1> or <h2>, etc., until <h6>).

And to say I do not want it?

If you DO NOT want characters that you list in your class, you will need to place the symbol "^" at the beginning of the class.

But!
I thought this character was used to indicate the beginning of a string?

Yes, but if you place it inside a class, it is used to say that you DO NOT WANT what lies within this class.

Thus, the following regex:

#[^0-9]#

... means you want that your string contains at least one character that is not a number.

Now I am gona heat your brains (table below).

String Regex Result
This sentence contains a letter #[^0-9]# TRUE
this sentence contains other things than uppercase and numbers #[^A-Z0-9]# TRUE
This sentence does not begin with a lowercase #^[^a-z]# TRUE
This sentence does not end with a vowel...hello #[^aeiouy]$# FALSE
ScrrmmmblllGnngngnngnMmmmmffff #[^aeiouy]# TRUE

I advise you to take a little break because will become harder later! We will now explore the role of quantifiers that will allow us to manage the repetition.

Quantifiers

Quantifiers are symbols that are used to say how many times a character or sequence of characters can be repeated.
For example, to recognize an e-mail address like john@gmail.com, we'll have to say, "It starts with one or more letters, followed by a @ (at sign), followed by at least two letters themselves followed by a period, and then two to four letters (for .com or .info (Yes, it exists!))".

Well, for now, our goal is not to write a regex which allows checking whether the e-mail address entered by the visitor has a good form (too early). But all that to say that it is essential to know how many times a letter can be repeated!

The most common symbols

You must remember three symbols:

Thus, #a?# recognizes 0 or 1 "a";

Thus, #a+# recognizes "a", "aa", "aaa", "aaaa", etc;

Thus, #a*# recognizes "a", "aa", "aaa", "aaaa", etc. But if there is no "a", it works too!

Note that these symbols apply to the letter directly in front. We can allow the word "dog", whether singular or plural, with the regex #dogs?# (will work for "dog" and "dogs").

You can allow the repetition of a letter. I just showed you the case for "dog." But it can also be used for a letter in the middle of the word, like this:

#gr?ave#

This code will recognize "grave" and "gave"!

And if I want it to be two or more letters that are repeated, how am I doing?

We must use parentheses. For example, if one wants to recognize "Ayayayayayay" (Speedy Gonzales battle cry!), we will enter the following regex:

#Ay(ay)*#
This code will recognize "Ay", "Ayay", "Ayayay", "Ayayayay"…

You can use the symbol "|" inside parentheses. For example the regex #Ay(ay|oy)*# will return TRUE for "Ayayayoyayayayoyoyoyoyayoy"! This is the "ay" OR "oy" repeated several times, simple!

More good news: you can put a quantifier after a character class (you know, with the brackets!). So #[0-9]+# can recognize any number, as long as there is at least one number!

Let's do some tests (next table).

String Regex Result
eeeee #e+# TRUE
ooo #u?# TRUE
wonderfull #[0-9]+# FALSE
Yahoooooo #^Yaho+$# TRUE
Yahoooooo amazing! #^Yaho+$# FALSE
Blablablablabla #^Bla(bla)*$# TRUE

The latest examples are very interesting. The regex #^Yaho+$# means that the string must begin and end with the word "Yahoo". There may be one or more "o". Thus "Yaho", "Yahoo", "Yahooo" etc. work ... But you mustn’t put anything before or after, as I indicated that it was a beginning AND an end of a string with ^ and $.

The last regex allows the words "Bla", "Blabla", "Blablabla" etc. I used parentheses to indicate that "bla" can be repeated 0, 1 or more times.

Be more precise with braces

Sometimes we would like to indicate that the letter can be repeated four times, or four to six times ... in short, we would like to be more specific about the number of repetitions.
This is where we use braces. You'll see: if you have understood the latest examples, it will seem simple.

There are three ways to use the braces.

If you pay attention, you notice that:
  • ? is the same than writing {0,1};
  • + is the same than writing {1,};
  • * is the same than writing {0,}.

Let's make a few examples, just to say we're ready (table below):

String Regex Result
eeeee #e{2,}# TRUE
Blablablabla #^Bla(bla){4}$# FALSE
546781 #^[0-9]{6}$# TRUE

Ok? Take a good break because ... in the next chapter, we mix everything we've just learned!

A history of wildcards

To begin, and before going further, it seems important to bring to your attention a new concept: the wildcards.
It's not a programmer’s insult, but a word that simply means "special characters". These are characters, not like others, who have a role or a special meaning.

In PCRE language (of regex), wildcards you need to know are:

# ! ^ $ ( ) [ ] { } ? + * . \ |

You must remember them. You know most of them already.
Thus, the dollar "$" is a special character because it is used to indicate the end of a string.
Same for the circumflex, pound sign, parentheses, brackets, braces and symbols "? + *": we have used them all in the previous chapter, remember.
For point "." and the backslash "\", you do not know them, but you're going to learn them quickly.

Well, these are special characters and each of them means something specific. So what?

So the problem falls the day you want to search, for example, "What? " in a string.
How would you write the regex? Like this?

#What?#

Nope, certainly not! The question mark, you know, is used to say that the letter just before is optional (it can appear 0 or 1 time). Here, the "t" before the question mark would be optional, but it is not what we want to do!

So how do we do to search "What?" if the question mark already has a meaning?
We'll have to escape it. This means you have to place it a backslash "\" before a special character. Thus, good regex would be:

#What\?#

Here, the backslash is used to say that the question mark is not a special symbol, but rather a letter as another!

It is the same for all other wildcards that I showed you earlier (# ! ^ $ ( ) [ ] { } ? + * . \): you have to put a backslash in front if you want to use them in your search.
Notice that to use a backslash you need ... a backslash before! Like this: \\.

All what you need to remember is simple: if you want to use a special characters in your search, place a backslash before. Done.
I give you some examples:

String Regex Result
I am anxious! #anxious\!# TRUE
I am (very) tired #\(very\) tired# TRUE
I'm sleepy... #sleepy\.\.\.# TRUE
The smiley :-\ #:-\\# TRUE

The class case

There is one more little thing to see (still a special case), and this relates to the character classes.
So far you have put letters and numbers inside brackets; for example:

#[a-z0-9]#

Yes, you guessed it, you have the right to put other characters, such as accents (but in this case it is necessary to list them one by one). For example: [a-zéèàêâùïüë] and so on.

So far, so good. But if you also want to list special characters, huh? For example, a question mark (at random). Well there, it does not count! No need to escape: inside brackets, wildcards do not count!
So this regex works very well:

#[a-z?+*{}]#

It means that we have the right to put a letter, a question mark, a plus sign, etc.

3 special case, however:

The abbreviated classes

The good news is that you are now ready to perform almost any regex you want.
The bad news is that I just said "almost".

Oh do not worry, it will not be long and you will not feel any pain (at this point, we no longer feel the pain anyway).
I just want to show you what is called the abbreviated classes, that I call shortcuts.

Some of these shortcuts will not essential to you, but as you may meet them one day or another, I do not want you to be surprised and thought I hid you things.

Here's what to remember:

Shortcut Meaning
\d Indicates a number.
It is exactly the same than [0-9]
\D Indicates that it is NOT a number.
It's the same as typing [^0-9]
\w Indicates an alphanumeric character or an underscore.
This corresponds to [a-zA-Z0-9_]
\W Indicates that it is NOT a word.
This corresponds to [^a-zA-Z0-9_]
\t Indicates a tab
\n Indicates a new line
\r Indicates a carriage return
\s Indicates a blank space
\S Indicates that it is NOT a blank space (\t \n \r)
. Matches any character.
It allows all!

This is normal letters, but when a backslash is placed before they are given a special meaning.
It is the opposite of what we were doing earlier: we used a backslash before wildcards to take away their special meaning.

For the point ".", there is an exception: it indicates anything but the new lines (\n).
To ensure that the point indicates everything, even the new lines, you will need to use the "s" option of PCRE. Example:
#[0-9]-.#s

Come on, this time you know enough, let's practice!

Build a complete regex

We will build big regex together, so that you understand the method. Then you will be quite capable of inventing your own regex and use them for your PHP scripts!

A phone number

For this first real regex, we'll try to see if a variable (input by a visitor via a form, for example) corresponds to a phone number.
I will base myself on the French telephone numbers, so excuse me if you are not French. The advantage is that you can then practice and write this regex for phone numbers in your country.

To recap (and for those who do not know), a French phone number has 10 digits. For example: "01 52 45 18 62". We must respect the following rules:

For starters, and to make it simple, we will assume that the user enters the phone number without spaces or anything.
So, the phone number should look like this: "0152451862". How to write a regex that matches a phone number like this?

Here is how I proceed, to build this regex:

  1. First, we want there is ONLY the phone number. So we will start by putting ^ and $ symbols to indicate the beginning and end of string:
    #^$#

  2. Second, we know that the first character is always a 0. Thus we tape:
    #^0$#

  3. The 0 is followed by a number from 1 to 6, not to mention the 8 for special numbers. So use the class [1-68], which means "a number from 1 to 6 OR the 8":
    #^0[1-68]$#

  4. Then, come the remaining 8 digits, ranging from 0 to 9. So we just write [0-9]{8} to indicate that we want 8 digits:
    #^0[1-68][0-9]{8}$#

And that's it!

Now we will assume that the person can type a space every two digits (as it is common to do in France), but also a period or a hyphen. Our regex must therefore accept the following telephone numbers:

This is where the power of regex comes!
The possibilities are numerous, and yet you just need to write the corresponding regex.

We come back to the creation of our regex:

  1. First, the 0 and the number from 1 to 6 without forgetting the 8. That does not change:
    #^0[1-68]$#

  2. After the first two numbers, there may be a space or a hyphen or a dot, or nothing at all (if the digits are attached). So we will use the class [-. ] (dash, dot, space).
    But how do you say that the dot (or the dash, or the space) is not compulsory? With the question mark, of course! It gives us: #^0[1-68][-. ]?$#

  3. After the first dash (or dot or space or anything), we have the following two digits. We must add [0-9]{2} to our regex:
    #^0[1-68][-. ]?[0-9]{2}$#

  4. And now, think. There is a way to finish quickly: we just need to say that "[-. ]?[0-9]{2}" must be repeated four times, and this regex is over! We will use parentheses to enclose the whole and place a {4} just after indicating that all this must be repeated four times. This gives us finally:

#^0[1-68]([-. ]?[0-9]{2}){4}$#

Here is a small script that I made quickly, so you can test the power of regex:

<p>
<?php
if (isset($_POST['telephone'])) {
	$_POST['telephone'] = htmlspecialchars($_POST['telephone']);

	if (preg_match("#^0[1-68]([-. ]?[0-9]{2}){4}$#", $_POST['telephone'])) {
		echo $_POST['telephone'] . ' is a <strong>valid</strong> number!';
    } else {
		echo $_POST['telephone'] . ' is not valid!';
	}
}
?>
</p>

<form method="post">
<p>
    <label for="telephone">Your telephone?</label> <input id="telephone" name="telephone" /><br />
    <input type="submit" value="Check the number" />
</p>
</form>

You can try all the phone numbers you want, with spaces in the middle or not if you like: the regex handles all cases.

You could also have used the shortcut \d to indicate a digit in your regex:
#^0[1-68]([-. ]?\d{2}){4}$#
Personally, I think that put [0-9] is clearer.

An E-mail address

Here is a second example that will certainly be useful to you: testing whether an email address is valid or not.

So, before we start anything, and to be clear, I will recall how an email address is built:

  1. First, we have the pseudonym (at least one letter, but it is rather rare). There may be small letters (no caps), numbers, dots, dashes and underscores "_".

  2. Second, there is the "at" sign: @.

  3. Then there is the domain name. For this name, same rule as for the pseudonym: only lowercase, numbers, dashes, dots and underscores. The only difference - you could not necessarily guess - is that there are at least two characters (for example "a.com" does not exist, but "aa.com" yes).

  4. Finally, there is an extension (such as ".com"). This extension has a dot, followed by two to four letters (lowercase). Indeed, there is "es", ".de", ".fr" but also ".net", ".org", ".info" etc.

The email address can look like p.smith_2@gmail.com.

Let's build the regex.

  1. First, as earlier, we want ONLY an e-mail address; so we will ask for a beginning and an end of string:
    #^$#

  2. Second, we have letters, numbers, dashes, dots, underscores, at least once. So we use the class [a-z0-9._-] followed by the + sign to ask there is at least one:
    #^[a-z0-9._-]+$#

  3. Then comes the sign "at" (nothing complicated, we just have to type the character):
    #^[a-z0-9._-]+@$#

  4. Then again a sequence of letters, numbers, dots, hyphens, at least twice. Thus we tape {2,} to say "two or more times":
    #^[a-z0-9._-]+@[a-z0-9._-]{2,}$#

  5. Then comes the dot (of "com" for example). As I told you earlier, this is a special character that is used to indicate "any character" (even accents). But here, we want to remove its meaning to say that we want the symbol dot in our regex. So we will put a backslash before:
    #^[a-z0-9._-]+@[a-z0-9._-]{2,}\.$#

  6. To conclude, we have two to four letters. These are necessarily lowercase letters, and this time no numbers or dashes, etc. We write:

#^[a-z0-9._-]+@[a-z0-9._-]{2,}\.[a-z]{2,4}$#

I give you the PHP script to test this regex:

<p>
<?php
if (isset($_POST['mail'])) {
	$_POST['mail'] = htmlspecialchars($_POST['mail']);

	if (preg_match("#^[a-z0-9._-]+@[a-z0-9._-]{2,}\.[a-z]{2,4}$#", $_POST['mail'])) {
		echo 'The address ' . $_POST['mail'] . ' is <strong>valid</strong>!';
	} else {
		echo 'The address ' . $_POST['mail'] . ' is not valid!';
	}
}
?>
</p>

<form method="post">
<p>
    <label for="mail">Your email?</label> <input id="mail" name="mail" /><br /> 
    <input type="submit" value="Check the email" />
</p>
</form>

I just want to show you one last thing before we go to the last important concept that we will discuss (capture and replacement).

Regex ... with MySQL!

You just learn to write regex, you almost have nothing more to know to use them with MySQL.
Be aware, however, that MySQL understands only POSIX regex language, not PCRE as we learned.

You just need to remember the following to make a regex POSIX:

The best, of course, it's always a good example. Suppose you have stored the IP of your visitors in a "visitors" table and you want the names of visitors whose IP starts with "84.254":

SELECT name FROM visitors WHERE ip REGEXP '^84\.254(\.[0-9]{1,3}){2}$'

This means: select all the names in the table "visitors" whose IP starts with "84.254" and ends with two numbers from one to three digit(s) (eg. 84.254.6.177).

The power of regex in a MySQL query to make a very specific search ... We can't refuse it!

Now let's talk about the last important concept with regex: "capture and replacement"!

Capture and replacement

I told you at the beginning of these two chapters that regex are used to make a powerful search, but also to do a search and a replacement.
This will allow us, for example, do the following thing:

  1. look if there are email addresses in a message left by a visitor;
  2. automatically change his message to put a link

<a href="mailto:someone@gmail.com"> before each address, which will make email addresses clickable!

With this technique, we can do the same to make http:// links automatically clickable too. We can also, you will see, create our own simplified language for the visitor, such as the famous bbCode used on most forums ([b][/b] for bold, does it mean anything to you?).

The capturing parentheses

All we will see now revolves around parentheses. You have already used them to surround a portion of your regex and say it had to be repeated four times for example (as we did for the phone number).
Well that's the first use of parentheses, but they can also be used for something else.

From now on, we will work with the function preg_replace.
It's with this function that we will be able to realize what we call a string "capture".

What you should know is that every time you use parentheses, it creates a "variable" containing what they surround.
Let me explain with a regex:

#\[b\](.+)\[/b\]#

It means "Search a [b] in a string, followed by one or more character(s), followed by a [/b]".

I was forced to put a backslash "\" before the brackets so PHP avoid the confusion with the character classes (such as [a-z]).

Every time there is a parenthesis, it creates a variable called $1 (for the first parenthesis), $2 for the second, etc.
We will then use these variables to modify the string (make replacement).

On the regex that I showed you earlier, there is only one parenthesis, you agree? So there will be just one variable $1 that will contain what is between [b] and [/b]. So we know what we'll put in bold.

Now, I'll show you how to put in bold all words between the [b][/b]:

<?php
$text = preg_replace('#\[b\](.+)\[/b\]#i', '<strong>$1</strong>', $text);
?>

Here's how to use preg_replace.

Here, I added the "i" option, so the code also works with uppercase ([B][/B]).

Between the HTML tags <strong>, I put $1. This means that what is in the capturing parentheses (between [b] and [/b]) will be surrounded by the <strong> tags!

Preg_replace function returns the result after the replacements.

Create your bbCode

We can now go to practice and learn to use the capturing parentheses.

We will realize what is called a parser.
The parser is used to convert the text written by a visitor (a message on a forum, on a guestbook or even on a mini-chat!) into a harmless text (without HTML tags through htmlspecialchars) but which also accepts the bbCode!
We will not do all the existing bbCode (too long), but to practice, we will use:

And we'll make sure to also replace automatically URLs (http://) by clickable links.

Let's start with [b] and [i] (the same thing).
You have already seen the code for [b], and it is indeed almost good. There is a problem though: options are missing. To make it work, we will need to use three options:

"This text is [b]important[/b], please [b]understand[/b] me!"... Without enabling Ungreedy option, the regex will have put in bold everything between the first [b] and the last [/b] (meaning,"important[/b], please [b]understand"). By using the "U", the regex will stop at first [/b], and that's what we want.

Here is the correct code for bold and italic with bbCode:

<?php
$text = preg_replace('#\[b\](.+)\[/b\]#isU', '<strong>$1</strong>', $text);
$text = preg_replace('#\[i\](.+)\[/i\]#isU', '<em>$1</em>', $text);
?>

Now a slightly more complex case, the tag [color=stuff]. We'll give a choice of colors with the symbol "|" (OR), and we will use two capturing parentheses:

  1. the first to retrieve the name of the color that was chosen;
  2. the second to retrieve the text between [color=stuff] and [/color] (such as bold and italic).

Here is the result:

<?php
$text = preg_replace('#\[color=(red|green|blue|yellow|purple|olive)\](.+)\[/color\]#isU', '<span style="color:$1">$2</span>', $text);
?>

For example, if you type [color=blue]text[/color], the text will be written in blue. You can try with the other colors too!

Come on, last step, and then I'll let you try.
I want http:// links to be automatically converted into clickable links.

Here's the solution:

<?php
$text = preg_replace('#http://[a-z0-9._/-]+#i', '<a href="$0">$0</a>', $text);
?>
In the replacement text, I used $0 which, if you recall, takes all the recognized text by the regex (so here, the whole URL).
There are no "s" and "U" options because we never break line in the middle of a URL, and "Ungreedy" mode is not used here.

Now let's summarize our complete parser bbCode:

<?php
if (isset($_POST['text'])) {
	$text = stripslashes($_POST['text']);
	$text = htmlspecialchars($text);
	$text = nl2br($text);
    
	$text = preg_replace('#\[b\](.+)\[/b\]#isU', '<strong>$1</strong>', $text);
	$text = preg_replace('#\[i\](.+)\[/i\]#isU', '<em>$1</em>', $text);
	$text = preg_replace('#\[color=(red|green|blue|yellow|purple|olive)\](.+)\[/color\]#isU', '<span style="color:$1">$2</span>', $text);
	$text = preg_replace('#http://[a-z0-9._/-]+#i', '<a href="$0">$0</a>', $text);

	echo $text . '<br /><hr />';
}
?>

<p>Have fun. Type for example :</p>

<blockquote style="font-size:0.8em">
<p>
	Hello how are you [b]today[/b]? It is a [i]sunny[/i] day.<br />
	Please [b][color=green]visit[/color][/b] my [i][color=purple]beautiful website[/color][/i]: http://www.tipocode.com
</p>
</blockquote>

<form method="post">
<p>
    <label for="text">Your message:</label><br />
    <textarea id="text" name="text" cols="50" rows="8"></textarea><br />
    <input type="submit" value="Show me all the power of regex" />
</p>
</form>

Do not hesitate to practice and improve this code snippet.

That's it for the regex. Hope you had fun with this tutorial. Happy coding...

About Simon Laroche
Simon Laroche on Google+
Simon Laroche on Twitter
Simon Laroche on Facebook
Simon Laroche on Pinterest
Simon Laroche on LinkedIn
: I am a Coder, Designer, Webmaster and Expert SEO Consulting, I'm also a wise traveller and an avid amateur photographer. I created the website TipoCode and many others such as Landolia: a World of Photos...

If you need help about this script, please leave a comment below. I reply as much as I can depending of my time, you may also get help from others.
I also offer a paid support, if you are in the need to adapt or create a script...

Leave a comment

Comments (0 comment)

No comments for the moment!