logo
down
shadow

Scrapy JSON export issues


Scrapy JSON export issues

By : user141072
Date : November 22 2020, 10:54 AM
Does that help Multiple issues here.
The main problem is in invalid expressions inside the select() calls.
code :
import urlparse

from scrapy.spider import BaseSpider
from scrapy.http.request import Request

from scrapy_demo.items import ScrapyDemoItem


class ScrapyDemoSpider(BaseSpider): 
    name = "scrapy_demo"
    allowed_domains = ["buffalo.craigslist.org"]
    start_urls = ['http://buffalo.craigslist.org/search/cps/']

    def parse(self, response):
        # processing listings
        for listing in response.css('p.row > a[data-id]'):
            link = listing.xpath('@href').extract()[0]
            yield Request(urlparse.urljoin(response.url, link), callback=self.parse_listing_page)

        # following next page
        next_page = response.xpath('//a[contains(@class, "next")]/@href').extract()
        print next_page
        if next_page:
            yield Request(urlparse.urljoin(response.url, next_page[0]), callback=self.parse)

    def parse_listing_page(self, response):
        item = ScrapyDemoItem()
        item['link'] = response.url
        item['title'] = response.xpath('//title/text()').extract()[0].strip()
        item['content'] = response.xpath('//section[@id="postingbody"]/text()').extract()[0].strip()
        yield item
[
    {"content": "Using a web cam with your computer to video communicate with your loved ones has never been made easier and it's free (providing you have an Internet connection).  With the click of a few buttons, you are sharing your live video and audio with the person you are communicating with. It's that simple.  When you are seeing and hearing your grand kids live across the country or halfway around the world, web camming is the next best thing to being there!", "link": "http://buffalo.craigslist.org/cps/4784390462.html", "title": "Web Cam With Your Computer With Family And Friends"},
    {"content": "Looking to supplement or increase your earnings?", "link": "http://buffalo.craigslist.org/cps/4782757517.html", "title": "1k in 30 Day's"},
    {"content": "Like us on Facebook: https://www.facebook.com/pages/NFB-Systems/514380315268768", "link": "http://buffalo.craigslist.org/cps/4813039886.html", "title": "NFB SYSTEMS COMPUTER SERVICES + WEB DESIGNING"},
    {"content": "Like us on Facebook: https://www.facebook.com/pages/NFB-Systems/514380315268768", "link": "http://buffalo.craigslist.org/cps/4810219714.html", "title": "NFB Systems Computer Repair + Web Designing"},
    {"content": "I can work with you personally and we design your site together (no outsourcing or anything like that!) I'll even train you how to use your brand new site. (Wordpress is really easy to use once it is setup!)", "link": "http://buffalo.craigslist.org/cps/4792628034.html", "title": "I Make First-Class Wordpress Sites with Training"},
    ...
]


Share : facebook icon twitter icon
Scrapy dynamic creation of objects + json export

Scrapy dynamic creation of objects + json export


By : Simi Olowolafe
Date : March 29 2020, 07:55 AM
around this issue I created a new spider to crawl a website. This crawler get each video game of liste on website and create an object for it : , The solution was simple. Create an object as:
code :
class GameInfo(Item):
    title = Field()
    desc = Field()
    kind = Field()
    listeBuys = Field()
gameInfo = GameInfo()
gameInfo['listeBuys'] = []
gameInfo['listeBuys'].append(asyouwant)
Is there a way using scrapy to export each item that is scrapped into a separate json file?

Is there a way using scrapy to export each item that is scrapped into a separate json file?


By : user3632636
Date : March 29 2020, 07:55 AM
I think the issue was by ths following , You can use scrapy-pipeline and from there you can insert each item into seperate files.
I have set a counter in my spider so that it increments on each item yield and added that value to item. Using that counter value I'm creating file names.
code :
class TestSpider(Spider):
    # spider name and all
    file_counter = 0

def parse(self, response):
    # your code here

def parse_item(self, response):
     # your code here
     self.file_counter += 1
      item = Testtem(
        #other items, 
        counter=self.file_counter)
     yield item
ITEM_PIPELINES = {'test1.pipelines.TestPipeline': 100}
class TestPipeline(object):

    def process_item(self, item, spider):
        with open('test_data_%s' % item.get('counter'), 'w') as wr:
            item.pop('counter') # remove the counter data, you don't need this in your item
            wr.write(str(item))
        return item
Scrapy process.crawl() to export data to json

Scrapy process.crawl() to export data to json


By : Shuuno
Date : March 29 2020, 07:55 AM
I wish this help you This might be a subquestion of Passing arguments to process.crawl in Scrapy python but the author marked the answer (that doesn't answer the subquestion i'm asking myself) as a satisfying one. , You need to specify it on the settings:
code :
process = CrawlerProcess({
    'FEED_URI': 'file:///tmp/export.json',
})

process.crawl(MySpider)
process.start()
.json export formating in Scrapy

.json export formating in Scrapy


By : user7978550
Date : March 29 2020, 07:55 AM
Hope this helps In terms of memory usage, it's not a good practice, but an option is to keep an object and write it at the end of the process:
code :
class RautahakuPipeline(object):

    def open_spider(self, spider):
        self.items = { "pages":[] }
        self.file = null # open('items.json', 'w')

    def close_spider(self, spider):
        self.file = open('items.json', 'w')
        self.file.write(json.dumps(self.items))
        self.file.close()

    def process_item(self, item, spider):            
        self.items["pages"].append(dict(item))
        return item
class RautahakuPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.json', 'w')
        header='{"pages": ['
        self.file.write(header)

    def close_spider(self, spider):
        footer=']}'
        self.file.write(footer)
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item
Scrapy: How to export Json from script

Scrapy: How to export Json from script


By : zuku
Date : March 29 2020, 07:55 AM
hope this fix your issue I created a web crawler with scrapy, but I've a problem with phone number because it is into a script. The script is: , it's simple if you already crawled the contents inside the script tag
code :
import re

script = '{"@context":"http://schema.org","@type":"LocalBusiness","name":"Clínica Dental Reina Victoria 23","description":".TU CLÍNICA DENTAL DE REFERENCIA EN MADRID","logo":"https://estaticos.qdq.com/CMS/directory/logos/c/l/clinica-dental-reina-victoria.png","image":"https://estaticos.qdq.com/coverphotos/098/535/ed1c5ffcf38241f8b83a1808af51a615.jpg","url":"https://www.clinicadental-reinavictoria.es/","hasMap":"https://www.google.com/maps/search/?api=1&query=40.4469174,-3.7087934","telephone":"+34915340309","address":{"@type":"PostalAddress","streetAddress":"Av. Reina Victoria 23","addressLocality":"MADRID","addressRegion":"Madrid","postalCode":"28003"}}'

phone_number = re.search(r'"telephone":"(.*?)","address"', script).group(1)

print(phone_number)
Related Posts Related Posts :
  • Read data with NAs into python and calculate mean row-wise
  • How to print telnet response line by line?
  • Pylint: Avoid checking INSIDE DOCSTRINGS (global directive / rcfile)
  • Sending MIDI messages using Python (on Ubuntu)
  • Generate Dictionary in Python at run time
  • code is cluttered by try-except in Python
  • Python class inheritance - spooky action
  • Why is a Python multiprocessing daemon process not printing to standard output?
  • How to feed numeric data into a classifier?
  • How to unambiguously identify a Model as a lowercase string in Django
  • How to get only one specific line from subprocess output
  • Python network communication with encryption and password protection
  • with urllib urlopen read function but get none
  • django could not find database in ubuntu
  • How to replace a character in a string with a non ascii character in python?
  • Self learning data evaluation in Python
  • Django: UnicodeDecodeError while trying to read template 500.html
  • how can you Import an os variable into PYTHON and have it update?
  • Browse Folders to open a file in python
  • sql select group by a having count(1) > 1 equivalent in python pandas?
  • Why Insert command when button clicked OpenERP
  • Pandas dataframe from nested dictionary to melted data frame
  • Which is more efficient in Python: list.index() or dict.get()
  • Xor bits in python
  • A simple python confusion about format string
  • Finding index of a list containing an item, also in a list
  • How to test Domain is it 'www' domain or subdomain or name server domain or mail domain in python?
  • Keydown event for Pygame
  • Lazy loading csv with pandas
  • Use webdriver,python,beautifulsoup to retrieve dynamic website
  • Scrapy is Visiting same Url despite dont_filter=False
  • How to add support for SNI in python 2.7
  • Pandas: np.where with multiple conditions on dataframes
  • How to get meaningful words by splitting a continuous string?
  • TypeError: 'numpy.ndarray' object is not callable while extracting index and elements are stored in different array in p
  • Python: comparing list to a string
  • Is there any way to write '\r' into a file on linux with python?
  • matplotlib retrieve color from plt.plot command
  • beautifulsoup to retrieve the date
  • Django Rest Framework 3.0: Saving Nested, Many-To-One Relationship
  • Lost connection to MySQL server during Python connection
  • uploadede file path django
  • How to reduce a data with the longest string under pandas framework?
  • Single line for-loop to build a dictionary?
  • How to read in lines until a certain line?
  • Beautifulsoup to retrieve the href list
  • Python - Vincenty's inverse formula not converging (Finding distance between points on Earth)
  • Saving django model instance into another model
  • Same URL request fails with python->urllib but not with curl
  • Python dictionary: set value as the key string
  • With setuptools, when does namespace packages __init__.py files disappears?
  • Paraview: NameError: name 'inputs' is not defined
  • Too many values to unpack during template rendering
  • A different type of iterative function
  • Test if value exists in several lists
  • Go and Python HMAC libraries give different results
  • Python Nested Loop Working Weirdly
  • Why is this not assigning the correct value?
  • 'numpy.ndarray' object is not callable
  • How to deal with globals in modules?
  • shadow
    Privacy Policy - Terms - Contact Us © ourworld-yourmove.org