|
This is the first part of a project I was commissioned to do. This script takes in a URL and reads the page at that address. After it reads it, it uses Tidy to fix the markup and convert the document into XML. From there it reads in all the sentences on the page, and displays all words that are adjacent in the document. This multi-step process was created by me, and ensures that virtually all pages are read correctly, regardless of errors. For security reasons, the script can only read a page every 30 seconds, and anything after a ? in a URL will be stripped off - it also has a 20 second timeout.
Skills and Technologies Used:- PHP
- Tidy
- XML
- Regular Expressions
|