Ruby on Rails Monday, February 4, 2019



On Monday, February 4, 2019 at 8:10:33 AM UTC-5, Walter Lee Davis wrote:

> On Feb 4, 2019, at 7:35 AM, fugee ohu <fuge...@gmail.com> wrote:
>
>
>
> On Sunday, February 3, 2019 at 9:54:25 PM UTC-5, Walter Lee Davis wrote:
>
> > On Feb 3, 2019, at 7:14 PM, fugee ohu <fuge...@gmail.com> wrote:
> >
> >
> >
> > On Wednesday, January 30, 2019 at 5:16:59 PM UTC-5, Colin Law wrote:
> > On Wed, 30 Jan 2019 at 22:12, Colin Law <cla...@gmail.com> wrote:
> > >
> > > On Wed, 30 Jan 2019 at 22:09, fugee ohu <fuge...@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On Wednesday, January 30, 2019 at 5:02:17 PM UTC-5, Colin Law wrote:
> > > >>
> > > >> On Wed, 30 Jan 2019 at 21:56, fugee ohu <fuge...@gmail.com> wrote:
> > > >> > ...
> > > >> > Everything in the unparsed resonse body that I want is between [ and ] I have to gsub it out
> > > >>
> > > >>
> > > >> No you don't.  After you get parsed_obj["results] (which is an array,
> > > >> that's what the [] mean) then you can get the first product by
> > > >> parsed_obj["results"][0]["productId"]
> > > >> It is just an array.  You have met ruby arrays haven't you?
> > > >>
> > > >> I am rapidly losing the will to live.
> > > >>
> > > >> Colin
> > > >
> > > >
> > > > The response body isn't JSON.parse parsable as is it has to be gsub'd and chomped first before I can run JSON.parse My original gsub wasn't right it wasn't removing the end that follows ]
> > > > JSON::ParserError: 784: unexpected token at 'myscript.js({"success":true,"code
> > >
> > > You previously posted that you had got parsed_obj where
> > > parsed_obj["results]  was an array.  Go back to that.
> >
> > To quote your previous message
> >
> > >puts parsed_obj["results"]  shows the entire results but `puts parsed_obj["results"]["productId"] gets me error no implicit
> > > conversion of String into Integer
> >
> > The error is because it is an array, which is perfectly obvious if you
> > look at the unparsed string. So if you use
> > parsed_obj["results"][0]
> > you will get the first element
> >
> > Colin
> >
> > There are scripts in the browser page source that pass a lot of useful values like this
> > <script type="text/javascript">
> >                 if(!window.runParams) {
> >                 window.runParams = {};
> >                 }
> >                 window.runParams.minPrice="44.98";
> >                 window.runParams.maxPrice="44.98";
> >                ...
> > And more within definitions in the same script like this
> > var skuProducts=[{"skuAttr":"14:1052","skuPropIds":"1052","skuVal":{"actSkuCalPrice":"20.24","actSkuMultiCurrencyCalPrice":"20.24","actSkuMultiCurrencyDisplayPrice":"20.24","availQuantity":29,"inventory":30,"isActivity":true,"skuCalPrice":"44.98","skuMultiCurrencyCalPrice":"44.98","skuMultiCurrencyDisplayPrice":"44.98"}},{"skuAttr":"14:173","skuPropIds":"173","skuVal":{"actSkuCalPrice":"20.24","actSkuMultiCurrencyCalPrice":"20.24","actSkuMultiCurrencyDisplayPrice":"20.24","availQuantity":26,"inventory":30,"isActivity":true,"skuCalPrice":"44.98","skuMultiCurrencyCalPrice":"44.98","skuMultiCurrencyDisplayPrice":"44.98"}}];
> >                 var GaData = {
> >         pageType: "product",
> >         productIds: "en32837801078",
> >         totalValue: "US $20.24"
> >     };
> >
> > Since it's in <script> containers in page source can I parse it?
>
> Since it's in a <script> tag, you can use Nokogiri or another HTML parser to extract only that bit of the page. To be sure, you will have to do some work on the script before you can access the parts you're interested in as JSON. But JSON is the same whether it is being parsed by JavaScript or Ruby. You're going to have to work out the best way to identify the parts you want. There's no such thing as a JavaScript parser in Ruby, but if you can figure out where to start, and how to get the offsets to trim your starting code, the parts that look interesting above will be interesting to Ruby, too.
>
> I'm assuming you don't have control over this page, and that you are doing some sort of scraping exercise here. So you'll need to have lots of tests around whatever code you write, and keep checking often, because the owner of this code may change its fundamental structure at a moment's notice.
>
> Walter
>
>
> It's the 12th <script> on the page but doc.at_css("script:nth-child(12)") returns nil

Are you sure that it's there in the original HTML, or is it being put there by a script after page load? Make sure that you are only looking at the original HTML, not the DOM. (Safari shows the HTML in the tab called Resources, and the DOM in the tab called Elements. If you're using a different browser, there may be a similar distinction with different names.) The DOM can change over the life of the page, in response to scripting. The HTML is fixed at the moment that the page is served to the client. Nokogiri and other HTML parsers can only read the HTML as served, not the DOM as mutated by a browser.

You've already been around this tree a couple of times -- no, the answer is not to stand up a headless browser and read the DOM. The data is there, either in the HTML or the associated scripts served by the site. You just have to find it and isolate it from the rest of the visual page content.

Once you confirm that the data you want is there in the HTML, then instead of trying to just get the 12th script tag with css_at, use css('script').each to loop over all the scripts on the page, and see which one contains that target string. Once you figure out the correct offset, you can use the more targeted selector if you like, but because you don't own the page, you may want to continue using an enumerator to loop over the page contents -- that's likely to be more resilient in the face of change.

Walter

>
> --
> You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...@googlegroups.com.
> To post to this group, send email to rubyonra...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/86958eec-3d90-42e0-bfdb-30801a38a878%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

 
doc=Nokogiri::HTML.parse(browser.html) should return the html or elements ?

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/5dbb1150-0b08-4b3b-80ca-023b6c6f39f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment