dom - Logic for Implementing a Dynamic Web Scraper in C# -
I am trying to develop a web scrapper in C # window forms. The one I am trying to accomplish is as follows:
- Get the URL from the user.
- Load the web page in the IE UI Control (embedded browser) in the WiForce.
- Allow user to select a text (Approx, Small (no more than 50 characters)) From the loaded web page
- When the user wishes to continue the location ( HTML Dome Location ), then it must be maintained in DB so that the user can use it during subsequent visits Place to get data in that place Assuming that the website being loaded is a precycling site and the quoted rate keeps changing, this idea is to continue the DOM hierarchy so that I can cross it next time
I would be able to do this if all HTML elements had their ID attributes. In this case where the ID is zero, I am not able to fulfill it.
Can someone recommend a valid idea on it (if possible, the minimum code snippet).
This will be useful, even if you can share some online resources.
Thank you,
vijay
An approach is < / P>
Create a pile of tags / styles / IDs below that element, which will spread to the closest ID element. In this way you will get rid of most of the header etc. Then look for a sequence to make.
Example:
& lt; Html & gt; & Lt; Body & gt; & Lt ;! - Too many HTML - & gt; & Lt; Div id = "main" & gt; & Lt; Div & gt; & Lt; Period & gt; & Lt; Div class = "pricearea" & gt; & Lt; Table & gt; & Lt ;! - with value data - & gt;
For Exmaple you will store a sequence in your DB: [id = main], div, span, div, table or maybe div [Class = pricearea], table .
You can also make your own path using styles / classes or you want to see a feature of a tag, a tag or a combination, you want it to be as accurate as possible, as much as possible. Strengthen the elements as much as possible.
If the layout rarely changes, it will allow you to navigate to the same place every time.
I can also suggest that you might use similar to DOM parsing or in some similar way, IE control is slow.
Screen scraping is fun, but it is difficult to get 100% for all pages. Good luck!
Comments
Post a Comment