parsing - Text Parser with PHP, like Instapaper -


I'm trying to write a text parser with PHP, like what Instapaper did I want to do; Get a webpage and parse it in text-only mode.

Getting web pages with CRL and strip HTML tags is easy. But there are some common areas of every webpage; Such as header, navigation, sidebar, footer, banner etc. I just want to get articles in text mode and want to exclude all other parts. If I know "ID" or "class" information then it is also easy to exclude those parts, but I am trying to apply this process and apply for any page, such as Instapaper.

I get all the content, but I do not know how to exclude headers, sidebars or footers, and only the main articles body has to develop the logic to get me the main article section.

It is not important to find the exact code for me, it will also be helpful to understand how to separate unnecessary parts because I can try to write my code with PHP. This would be useful if there are any examples in other languages.

Thank you for helping.

You can try to see algorithms behind this bookmarklet - remove all content in the web page garbage It has got a good success rate.

My friend has made it, this is the reason I am recommending this - because I know that this works, and I am aware of many techniques that use it to parse the data Has been doing. You can apply these techniques to your demand.


Comments

Popular posts from this blog

oracle - The fastest way to check if some records in a database table? -

php - multilevel menu with multilevel array -

jQuery UI: Datepicker month format -