Ruby on Rails Saturday, November 10, 2018



On Saturday, November 10, 2018 at 10:35:03 AM UTC-5, Walter Lee Davis wrote:

> On Nov 9, 2018, at 6:22 PM, fugee ohu <fuge...@gmail.com> wrote:
>
>
>
> On Wednesday, November 7, 2018 at 12:28:05 PM UTC-5, Jake Niemiec wrote:
> The ui-box class would indicate that it is a react component: https://github.com/segmentio/ui-box
>
> React components are run client-side, meaning the text you are looking for is inserted into the document after the page runs <script> tags. I would take a look at the Sources tab in chrome, you can find all the loaded scripts there.
>
> On Wed, Nov 7, 2018 at 10:17 AM fugee ohu <fuge...@gmail.com> wrote:
>
>
> On Wednesday, November 7, 2018 at 11:01:32 AM UTC-5, Colin Law wrote:
> I should think that javascript is involved.  I am sure you asked a
> similar question before when you were trying to scrape a website and
> couldn't find the text in the html.
>
> Colin
> On Wed, 7 Nov 2018 at 15:35, fugee ohu <fuge...@gmail.com> wrote:
> >
> > I'm not very good with the consoles in chrome and firefox but I couldn't find the text I was looking for in source even though it's displayed as text seemingly, the cursur changes to a vertical line on mouse-over I found this html below in the source How does this html create the text that displays?
> >
> >    <div class="ui-box product-description-main" id="j-product-description">
> >         <div class="ui-box-title">Product Description</div>
> >         <div class="ui-box-body">
> >
> >             <div class="description-content" data-role="description" data-spm="1000023">
> >             <div class="loading32"></div>
> >             </div>
> >
> >         </div>
> >     </div>
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...@googlegroups.com.
> > To post to this group, send email to rubyonra...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/8e0eb26a-517a-4216-bb9c-8bd05e4412a5%40googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
>  Yes, within that context, javascript, how does it happen that the text I'm viewing in the browser isn't visible in source?
>
> --
> You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...@googlegroups.com.
> To post to this group, send email to rubyonra...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/12b65225-60e5-4fe3-80a7-9ebb8013f312%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
> So far I'm trying to get up to the table, the last element shown below   doc.at_css("div#j-product-description div.ui-box-body div.description-content") gets me back the div class="description-content element but  doc.at_css("div#j-product-description div.ui-box-body div.description-content div.origin-part") returns nil There's a lot inside kde:widget that I'm not including here
>
> <div class="ui-box product-description-main" id="j-product-description" data-widget-cid="widget-27">
>         <div class="ui-box-title">Product Description</div>
>         <div class="ui-box-body">
> <div class="description-content" data-role="description" data-spm="1000023"><div class="origin-part"><p> <br> <br> <br> &nbsp; </p>
> <kse:widget data-widget-type="relatedProduct" id="24226336" title="TOP" type="relation">...</kse:widget>
> <table border="2">
>

It seems to me that you are going to have to identify the data source that the in-page JavaScript is using to generate the dynamic table data, and query that rather than trying to work everything out from the HTML (which is just a template for the in-page script to fill). There's probably a JSON URL somewhere that is being loaded into the page, and the script is building from that. This entire approach is pretty fraught with peril, though, because (like any scraping project, only more so) any change to the scheme that the site's developer chooses to implement will break your scraper immediately.

Following this path is going to force you to learn about how the site is working on a code level -- and to figure out how they go from data to presentation.

Another approach might be to use a headless browser on the server to construct a "real" DOM of the page, and query that. To be clear -- I do not recommend you follow this path -- I am noting it here to illustrate how ridiculous this effort will be.

One way to visualize this difference is to use the Web Inspector in Safari or Chrome to look at the differences between the raw HTML (Safari labels this tab "Resources") and the DOM (Safari calls this "Elements"). There is likely very little in common outside of the overall outline, if the page is changing as dramatically as you describe. If you hunt through the Resources tab (in Safari) you may find a link to a JSON file that is being required into the page. Loading that URL, rather than the HTML, may give you a much cleaner set of data (which you can parse directly using Ruby) rather than trying to execute JS on your server in order to construct an HTML DOM that you can parse with Nokogiri.

Walter


It wasn't shown in source but when I expanded the element recursively in chrome developer tools I saw the text I was looking for So, what's that gonna be worth?

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/4d5c228f-5252-46b4-9ab0-72257d754ead%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment