0Day Forums
Language/libraries for downloading & parsing web pages? - Printable Version

+- 0Day Forums (https://0day.red)
+-- Forum: Coding (https://0day.red/Forum-Coding)
+--- Forum: Ruby (https://0day.red/Forum-Ruby)
+--- Thread: Language/libraries for downloading & parsing web pages? (/Thread-Language-libraries-for-downloading-amp-parsing-web-pages)

Pages: 1 2


Language/libraries for downloading & parsing web pages? - briellamrexpd - 07-19-2023

What language and libraries are suitable for a script to parse and download small numbers of web resources?

For example, some websites publish pseudo-podcasts, but not as proper RSS feeds; they just publish an MP3 file regularly with a web page containing the playlist. I want to write a script to run regularly and parse the relevant pages for the link and playlist info, download the MP3, and put the playlist in the MP3 tags so it shows up nicely in my iPod. There are a bunch of similar applications that I could write too.

What language would you recommend? I would like the script to run on Windows and MacOS. Here are some alternatives:

- **JavaScript**. Just so I could use jQuery for the parsing. I don't know if jQuery works outside a browser though.
- **Python**. Probably good library support for doing what I want. But I don't love Python syntax.
- **Ruby**. I've done simple stuff (manual parsing) in Ruby before.
- **Clojure**. Because I want to spend a bit of time with it.

What's your favourite language and libraries for doing this? And why? Are there any nice jQuery-like libraries for other languages?


RE: Language/libraries for downloading & parsing web pages? - sitter668 - 07-19-2023

You should really give to **Python** a shot.

When I decide to design a crawler, i usually reproduce the same *pattern*.

For each step, there is a worker, which picks the data from a container (mainly a queue). There is container between each type of worker. After the first connection the target site, all types of workers can be threaded. So we have to use synchronization for accessing theses queues.

1. **Connector :** the [Session][1] object from the [requests][2] library is remarkable.
2. **Loader :** with multiple [threaded][3] Loaders, multiple requests can be launched in no time.
3. **Parser :** [xpath][4] is intensively used on each [etree][5] object created with [lxml][6].
4. **Validator :** set of assertions and heuristics to check the validity of parsed data.
5. **Archiver :** depending on what is stored, how many and how fast, but nosql is often the easiest way to store the retrieved data. For example, [mongodb][7] and [pymongo][8].


[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

[4]:

[To see links please register here]

[5]:

[To see links please register here]

[6]:

[To see links please register here]

[7]:

[To see links please register here]

[8]:

[To see links please register here]




RE: Language/libraries for downloading & parsing web pages? - zuzanau - 07-19-2023

Clojure link dumps, covering enlive, based on tagSoup and agents for parallel downloads (roundups/ link dumps aren't pretty, but I did spend some time googling/searching for different libs. Spidering/crawling can be very easy or pretty involved depending on the structure of sites crawled, HTML, XHTML, etc.)

[To see links please register here]


[To see links please register here]


[To see links please register here]


[To see links please register here]


--------------------------------

[To see links please register here]


[To see links please register here]


[To see links please register here]


[To see links please register here]



--------------------------
apache http client

[To see links please register here]


[To see links please register here]


[To see links please register here]







RE: Language/libraries for downloading & parsing web pages? - prehunger120662 - 07-19-2023

For jQuery-like CSS selector library in Perl then take a look at [`pQuery`][pQuery]

Also have a look at this previous SO question for examples of HTML parsing & scraping in many languages.

* [Can you provide an example of parsing HTML with your favorite parser?](

[To see links please register here]

)

/I3az/

[pQuery]:http://search.cpan.org/dist/pQuery/


RE: Language/libraries for downloading & parsing web pages? - westernisation596262 - 07-19-2023

What do you really want to do? If you want to learn Clojure||ruby||C do that. If you just want to get it done do whatever is fastest for you to do. And at the very least when you say Clojure and library you are also saying Java and library, there are lots and some are very good(I don't know what they are though). And the same was said for ruby and python above. So what do you want to do?


RE: Language/libraries for downloading & parsing web pages? - strickenezzbchkcqv - 07-19-2023

Beautiful Soup (

[To see links please register here]

) is a good python library for this. It specializes in dealing with malformed markup.


RE: Language/libraries for downloading & parsing web pages? - andrzejmqcdezxbrr - 07-19-2023

In ruby you also have Nokogiri, Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.


RE: Language/libraries for downloading & parsing web pages? - unchristianizes954974 - 07-19-2023

If you want to spend some time with Clojure (a very good idea IMO!), give [Enlive][1] a shot. The GitHub description reads

> a selector-based (à la CSS) templating and transformation system for Clojure — [Read more][2]

In addition to being useful for templating, it's a capable webscraping library; see the initial part of [this tutorial][3] for some simple scraping examples. (The third one is the New York Times homepage, so actually not as simple as all that.)

There are other tutorials available on the Web if you look for them; Enlive itself comes with some docs / examples. (Plus the code is < 1000 lines in total and very readable, though I suppose this might be less so for someone new to the language.)


[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]




RE: Language/libraries for downloading & parsing web pages? - roric585266 - 07-19-2023

Like Mikael S has mentioned [hpricot][1] is a great ruby html parser. However, for page retrieval, you may consider using a screen scraping library like [scRUBYt][2] or [Mechanize][3].


[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]




RE: Language/libraries for downloading & parsing web pages? - hybrid976 - 07-19-2023

I highly recommend using Ruby and the [hpricot][1] library.


[1]:

[To see links please register here]