Tapping the Internet Database
By Matt Hart
This article first appeared in Visual Developer Magazine.
(Note: This is my unedited original and may differ slightly from the published version.)
SYSTEM REQUIREMENTS: VB 5, VB 6, Professional or Enterprise
LEVEL: Intermediate Programmers
One of my all-time favorite books is The Long Run by Daniel Keys Moran. The hero, Trent, is able to outwit the bad guys by using his skills as a "Player", or super-hacker, to access the global "InfoNet", gathering intelligence and crashing their systems. While our Internet hasn't quite caught up with science fiction, we do have the tools to tap it like a vast database. In this article, I'll explain how to use the Microsoft Internet Transfer Control (Inet) to query Internet database engines and parse the returned information for use in any application.
Table 1 – Inet control methods that retrieve data
| Method | Description | Usage |
| Execute | Runs an FTP or HTTP command | Inet1.Execute , "POST", aData |
| GetChunk | Retrieves data after an Execute | aData = Inet1.GetChunk(1024, icString) |
| OpenURL | Loads a URL | aData = Instr1.OpenURL("www.visual-developer.com") |
There are several ways to retrieve data using the Inet control (see Table 1). The OpenURL method is the easiest to use. OpenURL can retrieve information from a database that places the query parameters in its URL. The information is returned in a string. For instance, following code would query the government's consumer information database about vehicles.
Dim aDat As String
aDat = Inet1.OpenURL("http://www.info.gov/cgi-bin/web_evaluate.CONSUMER" & _
"?query=vehicles&dataset=ws.dst")
The parameters begin after the question mark. Each parameter has a name (query, dataset) and a value (vehicles, ws.dst). This query was originally longer, but I pulled out the two essential parameters and it worked with those. Some query engines will require every parameter.
Query Engines
A query is usually made by typing information into a text box and clicking a Submit button. Load the database web page into your browser and view the page's code. Find the <form> tag and its action parameter. This tells the browser how to submit the information and will usually be either a GET or POST method. The GET method is what the consumer information database uses. All of the form's object names and the values they contain are placed after the actual URL. The parameters begin with a question mark and are separated by an ampersand (see Sidebar).
Sidebar – HTML form tags define the data that is sent to the query engine.
<form method="POST" action="/cgi-bin/zip4/zip4inq">
The form’s posting method: method="POST" The URL to post to: action="/cgi-bin/zip4/zip4inq" <input name="city">TextBox named "city" <input name="state">TextBox named "state" <input name="submit">Button that will perform the POST method </form>Defines the end of the form Posted forms build a data block containing the input names and their values separated by an ampersand "&" character. This data is sent after the URL request. The data block for the above form would be:
http://www.searchurl.com/cgi-bin/query?address=entry&city=entry&state=entry |
The POST method is also quite common but more complex to use. The query data is not placed on the URL line. It is sent to the query engine in a data packet that follows the URL retrieve request. Each parameter must be separated by an ampersand, although it is possible for the query engine to define its own separation characters. The Execute method of the Inet control is used to accomplish a POST. In addition to the URL, and POST query data packet, you also need to tell the query engine the "language" that the Inet control is using. The language data is: "Content-Type: application/x-www-form-urlencoded"
Gathering information from the Internet is tricky because the data is mixed with HTML tags and sometimes JavaScript or VBScript code. Examine the submission forms to know how to submit queries, and look at the pages returned by the query engine to figure out how to extract the needed data. Both the submission forms and the format of the returned data can and will change, so plan to make minor updates as the query information and returned data changes.
There are several VB functions that will get a good workout. The Instr function searches for a string within another string. The Left and Mid functions extract or truncate strings. VB6 added the Split function to easily parse strings, but you can emulate its capability in VB5 with the three other functions. The easiest way to arrange the returned HTML page and extract the data is to first eliminate all unneeded information that occurs before and after the data. Then remove all HTML tags and scripting code.
Trial and Error
Parsing HTML pages for data is like trying to read an infinite number of languages. Every page has its own unique way of rendering its information, and often more than one way on the same page. A little trial and error will be necessary before the data can be extract correctly every time. Well, at least almost every time.
The ZIPCheck application queries the USPS database. The query data is gathered from the various text boxes and concatenated into a single string using the separator ampersand character. Note that unlike the OpenURL method, the Execute method of the Inet control returns immediately - it doesn't wait until the command sent to the URL is finished. That information is passed to your program via the StateChanged event. Any errors or status changes, including a completed query, are indicated by the State parameter of that event. It retrieves the data returned by the USPS server, but it doesn't include all the State parameters. Download the sample code listings of this issue to see the various states. The algorithm for extracting the data starts with retrieving the HTML page. This is seen in the loop that repeatedly calls the GetChunk method and adds the returned chunk of data to the strData string.
Next, a starting point for the data is found. The lSrch1 variable finds the first <b> tag (bold) that follows the first occurrence of the <hr> tag (horizontal rule, or line) on the page. If the program can't find that sequence of tags, it assumes that the server did not return an address.
A table follows the information we want, so the next step finds the ending point of the data by assigning the start of that table to the lSrch2 variable. Then the needed information is extracted to the variable aDat. Most of the extraneous HTML code has now been eliminated, but aDat still contains a few HTML tags.
Search and Destroy
To remove those tags, the program searches for a less-than symbol "<":
lSrch1 = InStr(aDat, "<")
This indicates the start of a tag. Inside the loop, a second search finds the character indicating the end of the tag - the greater-than symbol ">":
lSrch2 = InStr(aDat, ">")
Then the tag is removed from the aDat string:
aDat = Left$(aDat, lSrch1 - 1) & Mid$(aDat, lSrch2 + 1)
The loop continues until all tags have been removed.
Finally, the data itself is extracted. A line feed character separates each line of data. The program first counts the number of lines so that it can create an array to receive the data, then it extracts the data. It first finds the endpoint of a line:
lSrch2 = InStr(aDat, vbLf)
Then it assigns the line to an array element:
alines(lSrch1) = Trim$(Left$(aDat, lSrch2 - 1))
Finally, the line that was just processed is removed:
aDat = Mid$(aDat, lSrch2 + 1)
Figure 1 shows the extracted data as well as the entire HTML text.
Figure 1 – The ZIPCheck sample application queries the US Postal Service’s ZIP code database and returns the standardized address and ZIP code.

The basic algorithm for extracting data from an HTML page will almost always be the same, but the details will differ. Sometimes the program will need to extract data from tables, sometimes from multiple locations. The key to getting the data is to identify unique HTML elements that precede and follow the data. Remove the unneeded information. Then remove any unnecessary tags or characters. Finally, parse the remaining data. Note that an HTML tag may actually be the unique data separator. A <br> tag could be the only thing between one piece of data and the next.
The code archive for this article contains a second project that queries the National Weather Service database for current weather conditions across the country (see Figure 2). It must gather the data in several stages. It first retrieves a list of states and territories, then a list of areas within those states, and finally the current weather conditions for the selected area.
Figure 2 – The Weather sample application queries the National Weather Service current conditions database.

Summary
Understanding how to use the Inet control methods is just the first step to utilizing the vast Internet as a database. You'll also need a grasp of basic HTML commands and a willingness to experiment. Once you've looked at the HTML source of a few submission forms and examined the data returned, you'll realize how easy it is to extract the data and present it in your own programs.